An improved algorithm for the maximal information coefficient and its application

The maximal information coefficient (MIC) captures both linear and nonlinear correlations between variable pairs. In this paper, we proposed the BackMIC algorithm for MIC estimation. The BackMIC algorithm adds a searching back process on the equipartitioned axis to obtain a better grid partition than the original implementation algorithm ApproxMaxMI. And similar to the ChiMIC algorithm, it terminates the grid search process by the χ2-test instead of the maximum number of bins B(n, α). Results on simulated data show that the BackMIC algorithm maintains the generality of MIC, and gives more reasonable grid partition and MIC values for independent and dependent variable pairs under comparable running times. Moreover, it is robust under different α in B(n, α). MIC calculated by the BackMIC algorithm reveals an improvement in statistical power and equitability. We applied (1-MIC) as the distance measurement in the K-means algorithm to perform a clustering of the cancer/normal samples. The results on four cancer datasets demonstrated that the MIC values calculated by the BackMIC algorithm can obtain better clustering results, indicating the correlations between samples measured by the BackMIC algorithm were more credible than those measured by other algorithms.

ZY, 0000-0002-7444-8743 The maximal information coefficient (MIC) captures both linear and nonlinear correlations between variable pairs. In this paper, we proposed the BackMIC algorithm for MIC estimation. The BackMIC algorithm adds a searching back process on the equipartitioned axis to obtain a better grid partition than the original implementation algorithm ApproxMaxMI. And similar to the ChiMIC algorithm, it terminates the grid search process by the χ 2 -test instead of the maximum number of bins B(n, α). Results on simulated data show that the BackMIC algorithm maintains the generality of MIC, and gives more reasonable grid partition and MIC values for independent and dependent variable pairs under comparable running times. Moreover, it is robust under different α in B(n, α). MIC calculated by the BackMIC algorithm reveals an improvement in statistical power and equitability. We applied (1-MIC) as the distance measurement in the K-means algorithm to perform a clustering of the cancer/normal samples. The results on four cancer datasets demonstrated that the MIC values calculated by the BackMIC algorithm can obtain better clustering results, indicating the correlations between samples measured by the BackMIC algorithm were more credible than those measured by other algorithms.

Comparison of grids and estimated MICs for independent variable pairs
The expected grid for independent variable pairs is 2 × 2 [13]. Figure 1 shows the grid frequency distribution obtained by the AppMIC, ChiMIC and BackMIC algorithms for computing the MIC values of independent variable pairs at 1000 repetitions. When data size n = 100 and B(n, α) = 16,  Figure 1. Grid frequency distribution of the three algorithms for independent variable pairs. y-axis and x-axis are the number of bins. Data size n = 100, 1000 replicates.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 201424 almost all the grids of the AppMIC algorithm were concentrated in 2 × 8 and 8 × 2. Most grids of the ChiMIC algorithm were concentrated in a × b, where a ≤ 5 and b ≤ 5. By contrast, almost all the grids of the BackMIC algorithm were concentrated in 2 × 2, 2 × 3 and 3 × 2; the most frequent grid was 2 × 2. The grids of the BackMIC algorithm were the closest to the expected one. The MIC of independent variable pairs should converge to 0 as data size n → ∞ [7]. Table 1 shows the MIC values estimated by the three algorithms (i.e. AppMIC, ChiMIC and BackMIC) for different data sizes. BackMIC was similar to ChiMIC without expanding the normalization term, which was closer to zero than AppMIC under the same data size.

Comparison of grids and estimated MICs for dependent variable pairs
The MIC of variable pairs with noiseless functional correlations should be 1 [7]. We used the BackMIC algorithm to calculate the MIC values of 13 pairs of noiseless functional correlations (table 2). All MIC values were 1, indicating that the BackMIC algorithm maintained the generality of MIC.
For the 'I'-type dataset, the expected grid and MIC were 3 × 3 and 0.2561, respectively (figure 2a). The grids of the AppMIC, ChiMIC and BackMIC algorithms were 14 × 3, 3 × 3 and 3 × 3, respectively, and the estimated MICs were 0.2457, 0.1808 and 0.2561, respectively (figure 2b-d). Only the grid and estimated value of the BackMIC algorithm were in line with expectations. For the chequerboard dataset, the expected grid and MIC were 5 × 5 and 0.3835, respectively (figure 2e). The grids of the AppMIC, ChiMIC and BackMIC algorithms were 9 × 5, 5 × 6 and 5 × 6, respectively, and the estimated  figure 3. We compared the three algorithms under the criterion that 'with the same number of bins, the larger the MIC values, the better the grid'. The AppMIC algorithm was excluded from the comparison because of its excessive bins (all reached 2 × 21 for data size n = 500 and B(n, α) = 42; figure 3a, d and g). A comparison of the MIC values of the three functional correlations obtained by the ChiMIC and BackMIC algorithms revealed that the latter always achieved higher MIC values with the same or fewer number of bins than the former. Thus, the grid and estimated MICs obtained by the BackMIC algorithm were more reasonable than those by the ChiMIC algorithm; no axis was equipartitioned.

Comparison of robustness
The correlation strength of a given variable pair should be certain. However, as shown in figure 4, AppMICs varied with the change of α in B(n, α) for noisy linear, parabolic and sinusoidal correlations, because more bins generally result in larger MIC values. BackMICs remained almost constant because the χ 2 -test was used instead of B(n, α) to terminate grid optimization. Therefore, the BackMIC algorithm was more robust in measuring the correlation between variables than the AppMIC algorithm. The variation in ChiMICs was not obvious either; however, ChiMICs were always lower than BackMICs due to the equipartition restriction and harsh normalization term in the ChiMIC algorithm.

Comparison of statistical power
Statistical power refers to the probability of correctly accepting the alternative hypothesis in the hypothesis test [14]. As the statistical power increases, the probability of making type II error decreases [15,16].
For the null hypothesis of statistical independence, for each dataset, statistical power is computed on the dependent variable pairs as well as on independent variable pairs; the statistical power of each statistic is defined as the fraction of dependent variable pairs yielding a statistic value greater than 95% (significance level is 0.05) of the values yielded by the independent variable pairs [13]. The statistical power of AppMIC, ChiMIC and BackMIC for the above five functional correlations (figures 2 and 3) at different noise amplitudes is shown in figure 5. The power of BackMIC was significantly higher than those of AppMIC and ChiMIC.

Comparison of equitability
If a statistic assigns similar scores to equally noisy correlations of different types, then the statistic has the property of equitability [17]. Equitability allows us to specify a threshold correlation strength below  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 201424 which we are uninterested and search for correlations whose strength is greater than the threshold [18]. Perfect equitability does not exist [19]. We tested the approximate equitability of AppMIC, ChiMIC and BackMIC on the basis of 13 functional correlations listed in table 2. For each correlation, we analyse the equitability by generating a noiseless data sequence with a data size n = 500 and 301 data series with noise ε added to f (X ), where ε is a uniformly distributed random variable from −b to b, and b denotes the noise levels selected from [0, 3] with step size of 0.01. The results confirmed that the approximate equitability of BackMIC was better than those of AppMIC and ChiMIC (figure 6).

Comparison of computational cost
We compared the computational time of the AppMIC, ChiMIC and BackMIC algorithms by using different data sizes for two types of variable pairs (the independent variable pairs and the parabolic functional variable pairs with noise level of 0.5). The results in table 3 showed that the ChiMIC algorithm ran faster than the AppMIC algorithm because the former used the χ 2 -test to terminate grid optimization earlier than the latter. The running time of the BackMIC algorithm was almost twice as much as that of the ChiMIC algorithm, because the BackMIC algorithm added a searching back process for an optimal partition on the originally equipartitioned axis compared with the ChiMIC algorithm. Compared with the AppMIC algorithm, the BackMIC algorithm was slower when the data size was small. As the data  royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 201424 size increased, the BackMIC algorithm was able to catch up with the AppMIC algorithm. In terms of independent variable pairs, the BackMIC algorithm ran faster than the AppMIC algorithm.

Comparison of AppMIC, ChiMIC and BackMIC applied in clustering for cancer classification
Given that cancer samples and normal samples have different gene expression levels [20], clustering algorithms such as K-means locate samples into different clusters on the basis of the similarity (distance) between samples; these clusters can then be used for cancer classification [21,22].
To evaluate the performance of MIC obtained by the AppMIC, ChiMIC and BackMIC algorithms, we replaced the Euclidean distance with (1-MIC) as the distance measurement between two samples in the K-means algorithm. Cancer gene expression datasets GSE37023, GSE29272 and GSE35602 were used in our work. Table 4 shows that the purity and Rand index (RI) of clustering results based on (1-BackMIC) were higher than those based on (1-AppMIC) and (1-ChiMIC), and they were even better than those based on Euclidean distance in the original K-means.   Figure 4. Comparison of AppMIC, ChiMIC and BackMIC for linear, parabolic and sinusoidal correlations at different α in B(n, α). R 2 in 'noise' is the squared Pearson correlation coefficient of f(X ) and Y, where f(X ) is the same as in figure 3 and Y = f(X ) + ζ. For linear and parabolic correlations, ζ is drawn uniformly at noise levels 0.2, 0.35 and 0.5, respectively; for sinusoidal correlation, ζ is drawn uniformly at noise levels 0.6, 0.9 and 1.2, respectively. The MIC values were the average of 500 repetitions with data size n = 500.
royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 201424 [18], where each square has sides with a length of 1. In figure 2, a and η are equal to 0 in 'I'-type pattern and 5 × 5 chequerboard. In figure 5, the noise aptitude a for the statistical power calculation is 25 noise amplitudes with logarithmic distribution from 1 to 10; X, ξ and η are random variables drawn from the normal distribution N (0,1). In figure 4, the noise added to linear, parabolic and sinusoidal functional correlations is defined as Y = f (X ) + noise_level × (2rand(n,1) − 1), where rand(n,1) is used to generate n uniformly distributed numbers in [0, 1].

Real datasets
We used GSE37023 [23], GSE29272 [24] and GSE35602 [25] to verify the reliability of BackMIC, as described in table 6. In GSE37023, two datasets from platforms GPL96 and GPL97 were used respectively, denoted as GSE37023 1 and GSE37023 2 . In GSE35602, only the dataset from the platform GPL6480 was used. All the datasets were obtained from the Gene Expression Omnibus database (https://www.ncbi.nlm.nih.gov/geo/). Probe IDs were converted to gene symbols according to GEO platform (GPL). If several probes were mapped to the same gene symbol, their average value was taken as the expression value of this gene. An implementation of the BackMIC algorithm can be downloaded at https://github.com/Caodan82/BackMIC. noise(l -R 2 ) noise(l -  The BackMIC algorithm uses the χ 2 -test to terminate grid optimization, which is similar to the ChiMIC algorithm. Given an optimal segment point, if the p-value of the χ 2 -test for the optimal segment point is lower than a given threshold (here, threshold = 0.01), the segment point is valid and the BackMIC algorithm continues to search for the next optimal segment point. Otherwise, the BackMIC algorithm stops its searching process. We take the kth optimal segment point SP k on the x-axis as an example. Suppose that the y-axis is partitioned into n y bins, and SP k is between SP k−2 and SP k−1 . The x-axis is partitioned into (s − 1)th and sth columns (figure 7). The coloured n y × 2 contingency table in figure 7 is called the detection area of SP k for the χ 2 -test. The χ 2 statistic is defined as follows [13]: where n j,i is the number of data points in the bin of row j and column i , nÃ ,i is the number of data points in the bins of column i , n j, Ã is the number of data points in the bins of row j of the detection area, and N d is the total number of data points in the bins of the detection area. If n y = 2, the χ 2 statistic needs to be corrected according to the following formula [26]:

BackMIC algorithm
The BackMIC algorithm involves two phases to obtain the optimal partition on the scatterplot of a variable pair (see algorithm 1). First, based on an equipartition of n y bins on one axis (we take the yaxis as example), the BackMIC algorithm locates an optimal partition of the x-axis through the dynamic programming algorithm to achieve the largest normalized mutual information under the restriction of the χ 2 -test, which is similar to the ChiMIC algorithm [13]. Unlike the AppMIC and ChiMIC algorithms, the BackMIC algorithm fixes the partition of the x-axis obtained in the previous step and searches back for the optimal partition of the y-axis instead of equipartitioning. Therefore, the BackMIC algorithm controls the bins of both the yand x-axes by the χ 2 -test and obtains unequipartitioning for both the yand x-axes. For n y = 2, the simulation process of the BackMIC algorithm is shown in figure 8. From the results we find that, compared with equipartitioning of the y-axis, when the y-axis is unequipartitioned, larger normalized mutual information can be obtained (0.5414 versus 0.3113).

K-means clustering algorithm
Suppose there are M samples with K classes in a dataset, C = {c 1 , c 2 , …, c K } is the sample set of real classifications, and c i is the sample set of the ith class (given that four gene expression datasets are of binary class, K is 2 here). The K-means clustering algorithm is performed using Matlab scripts downloaded at http://people.revoledu.com/kardi/tutorial/kMean/download.htm [27], which proceeds by randomly selecting K initial clustering centres and then assigns each sample to the nearest clustering centre [28]. For each dataset, we used the top 1000 genes with the largest variance to calculate the distance between samples. Suppose that the clustering results of K-means is Ω = {ω 1 , ω 2 , … ,ω K }, where ω i is set of the ith cluster. royalsocietypublishing.org/journal/rsos R. Soc. Open Sci. 8: 201424 Two commonly used evaluation criteria of clustering algorithms, namely, purity and RI, were used in this paper. Purity is the proportion of correctly clustered samples in total samples, and it can be calculated by [29] Purity ¼ 1 M X i max j jc i > v j j: RI refers to the proportion of concordant sample pairs in the total number of sample pairs [30]. A is the number of sample pairs placed in the same group in both C and Ω, and B is the number of sample pairs placed in different groups in both C and Ω. RI is defined as follows [31]:

Conclusion
In this paper, we introduced the BackMIC algorithm for better MIC estimation. The BackMIC algorithm added a searching back process to obtain an optimal partition for the equipartitioned axis, making it more likely to obtain the true MIC value. Meanwhile, the BackMIC algorithm used the χ 2 -test to ensure that each introduced optimal segment point can significantly increase the MIC value. This effectively avoided unreasonable grid refinement and made MIC value independent of B(n, α) which improved the robustness of MIC.
The results on simulation data showed that, compared with the AppMIC and ChiMIC algorithms, the BackMIC algorithm can effectively reduce the MIC value of independent variable pairs without expanding the normalization term in MIC definition; if there is a functional correlation between variable pairs, the MIC calculated by the BackMIC algorithm is equal to 1, maintaining the generality of MIC; if there is noisy correlation between variable pairs, the BackMIC algorithm usually obtained larger MIC values with less bins; moreover, the statistical power and equitability of MIC calculated by the BackMIC algorithm are better. When applying the (1-MIC) value as the distance measurement between cancer samples and normal samples in K-means algorithm, experiments on four cancer datasets showed that the MIC values calculated by the BackMIC algorithm can obtain better clustering results. All evidence verified that the BackMIC algorithm improves MIC estimation.