Large numbers of explanatory variables: a probabilistic assessment

Recently, Cox and Battey (2017 Proc. Natl Acad. Sci. USA 114, 8592–8595 (doi:10.1073/pnas.1703764114)) outlined a procedure for regression analysis when there are a small number of study individuals and a large number of potential explanatory variables, but relatively few of the latter have a real effect. The present paper reports more formal statistical properties. The results are intended primarily to guide the choice of key tuning parameters.


Introduction
We consider studies of dependence in which there are relatively few individuals on each of which a large number of potential explanatory variables are measured. Genetic microarray experiments are a prime example. For progress, an implicit or explicit assumption of sparsity is required, namely that the great majority of the explanatory variables have no effect. We call explanatory variables with an effect on response signal variables, and those with no effect noise variables.
Powerful procedures for analysis emanate from the lasso [1] (for a discussion of the mathematical aspects, see [2]). They aim to uncover a single small set of signal variables effective for prediction. It is possible, however, that many different choices of explanatory variables are essentially equally effective and have quite different subject-matter implications. Cox & Battey [3] outline a different approach aiming to specify those small sets of potential signal variables that give essentially the same fit; choice between different sets requires either more data or specific subject-matter information. Their procedure also allowed informal checks standard in much statistical work, such as those for nonlinearity, interaction or anomalous individuals.
In the present paper, we outline a probabilistic base for such an analysis. The emphasis is on how the procedure would perform under various idealized scenarios, as such providing guidance on the choice of key tuning parameters. The focus is on two key aspects at each stage and in the overall process of analysis. What is the probability that a signal variable is falsely discarded? How many of the variables ultimately suggested as potentially important are in fact noise variables? Our objective is essentially to calibrate the analytical procedure by specifying its behaviour under idealized conditions, not to set up a procedure to achieve preassigned error rates.
The ideas involved apply rather generally, for example, to likelihood-based fitting of logistic models for binary data.

Broad outline of procedure
Suppose that v variables are to be assessed for their explanatory power for a response variable. It is assumed that v is roughly k d for k ≤ 15 and d a small positive integer. In the method as outlined by Cox & Battey [3], the indices of these variables are arranged in a k × k × k cube for d = 3 or k × k × k × k hypercube for d = 4, etc. Sets of variables are then selected for test k at a time by traversing the cube by rows, columns, etc. Every variable thus has associated with it d sets of (k − 1) companion variables from each time the traversal of the cube passes through that variable. The explanatory power of each variable is assessed alongside its d sets of companions and, on the basis of those analyses, variables are either discarded or retained according to a suitable decision rule. For instance, a variable might be retained if it is among the two most significant in at least d/2 of the d analyses in which it appears. This is repeated with successively lower dimensional hypercubes until ideally fewer than 20 variables remain on which informal diagnostic checks are then performed.
The final phase of the analysis is to find small subsets of variables among the augmented set of retained variables and any square or interaction terms suggested at the informal phase. See Cox & Battey [3] for a more detailed description of the exploratory and subset selection phases.

Specification
At the ith stage of the process, let there be respectively v Si signal variables and v Ni noise variables. In the specific example below, i = 0, 1, 2. Thus of the v S0 signal variables initially present, v S2 are chosen at the end for detailed study. Given v S0 , the numbers of signal variables chosen at the first and second stage have binomial distributions with parameters θ S1 and θ S2 = θ S1 θ S2.1 , where, in particular, θ S2.1 is the conditional probability, given that a specific signal variable is chosen at the first stage, that it is chosen again at the second stage. A different specification is needed for the noise variable because, initially at least, there are a large number of them. In particular, the v Ni have Poisson distributions with means μ Ni . We write φ N2.1 = μ N2 /μ N1 for the probability that a noise variable retained after step 1 is still retained at step 2.
A summary of the properties of a two-stage procedure is thus provided by θ S2 and μ N2 , the probability that a signal variable is retained and the expected number of noise variables not rejected.
We now study these in order to compare different strategies of analysis.

Some reduction strategies
We compare three possible approaches to each analysis of the first stage reduction. Thus, we may -take the single variable with highest score (most significant); -take the two variables with highest scores; and -take all those variables, if any, whose scores exceed a threshold.
Significance tests based on the Wald, score or likelihood ratio statistics are natural choices. The former does, however, have the disadvantage of not being parameterization invariant. In a set of k variables chosen at random for test, the number of signal variables has a Poisson distribution of mean kv S0 and v S0 is assumed modest, and hence has two or more signal variables with probability approximately . This is small for the situations to be considered and does not affect the qualitative comparison of procedures. It is, therefore, ignored in later calculations.
It is convenient to phrase the initial discussion in terms of significance tests and their associated p-values. For noise variables, these are uniformly distributed on (0, 1), and for signal variables their density can be modelled as (1 − γ )x −γ , where 0 < γ < 1.
In the following calculations of the probability ϑ (j) that the jth procedure listed above chooses the signal variable, it is convenient to use Stirling's formula in the form that for large k and fixed a the ratio Γ (k + a)/Γ (k) is close to k a .
If only the most significant out of k is chosen, the probability that the signal variable is taken is after simplifying by Stirling's formula. Note that the Gamma function is close to 1 over the range of interest. If γ = 0, there is no distinction between signal and noise variables and the notional signal variable is selected with probability 1/k, that is, essentially at random. If now we take the two most significant values out of k, then ϑ (2) = ϑ (1) + ϑ (2.1) , where This is approximately That is, approximately, ϑ (2.1) /ϑ (1) = 1 − γ . The difference between ϑ (2) and ϑ (1) is maximized at approximately γ 0 = (log k − 1)/ log k, at which ϑ (2.1) is Γ {(log k) −1 }/(e log k) (figure 1). If we were to set a critical level at α, then the corresponding probability for a signal is so that the equalities ϑ (3) = ϑ (1) and ϑ (3) = ϑ (2) are achieved for any γ by setting, respectively, This shows that strategy 3 with α = 1/k and strategy 1 are equivalent as this choice also makes μ (3) = μ (1) . By contrast, (2 − γ ) 1/(1−γ ) > 2 for all 0 < γ < 1, meaning that, for the same probability of selecting a signal variable, strategy 3 necessarily selects more noise variables than strategy 2 and, therefore, is strictly inferior.
In replacing the second strategy by the first, μ N1 and consequently μ N2 are reduced by roughly a factor of four independently of γ . However, the penalties of including noise variables are mainly incidental in view of the exhaustive search over models that must later take place. The associated computational burdens are reasonable unless μ N2 is, for example, 20 or more, and so strategy 2 would normally be preferred for its higher θ S1 . The probability θ (j) S1 that the jth procedure retains a signal variable through the first stage is θ with equality at γ = 1, is the reduction in survival probability of signal variables from replacing the second procedure by the first. Here we have approximated by treating analyses in which the same variable appears as independent. The difference (4.3) is approximately at the point of maximum distinction γ = γ 0 , and as the slowest decaying term in (4.4) is of the order of 1/(e 2 log k) this shows the potential advantages of strategy 2 over strategy 1 to decrease only very slowly with k.
A less general but more conventional formulation is to assume a normal theory linear model. Define to be signal strength multiplied by the square root of the sample size. The values of ϑ (j) are then where φ(x) and Φ(x) are the standard normal probability density and cumulative distribution function at x, respectively. The threshold in the third strategy is, by convention, calibrated as the upper quantile κ * of the standard normal distribution. We thereby approximate ϑ (1) and dx and show that the two formulations are qualitatively equivalent.
The integral in ϑ (2.1) is negligible over R − for > 0 and over R + is of the form of a generalized Laplace integral for fixed, that is, where g(x, ν) is uniformly bounded in x as ν → ∞ and h has a single maximum, x 0 (ν), which varies with ν. Specifically we take g( where the symbol denotes the partial derivative with respect to the first argument. It is customary when constructing Laplace integrals to define h such that its maximum is independent of ν. This is not possible here as, without the contribution of the Φ(−x) term, h has no sharp maximum. That the form in (4.5) is permissible is discussed by Copson [5, pp. 42-47].
The second derivative at x is and an asymptotic approximation to ϑ (2.1) is (4.11) Figure 2a shows the accuracy of the approximation (4.9) for k = 10. The qualitative implications of the normal theory formulation are equivalent to those of the formulation in terms of p-values. For instance, at = 0, which corresponds to γ = 0, equation (4.9) is (2π )(k − 1) k k −(k+1) . This is 1/k to two decimal places for k sufficiently large. If γ is redefined as γ = {cosh( ) − 1}/cosh( ), then equation (4.2) has a unique maximum at sech −1 (1/ log k), which, like 0 (k), is a very slowly growing function of k. Figure 2b shows the approximation in equation (0, 1) is monotonically increasing with g(0) = 0 and g(x) ≈ 1 for x > 4 gives essentially the same qualitative conclusions.

Relative severity of each stage
We now consider the relative advantages of -a mild first stage and severe second stage -a severe first stage and a mild second stage, where, for clarity of exposition, 'mild' and 'severe' are associated with the same quantitative values in both versions. In particular, let 1 > ϑ H > ϑ L > 0, where ϑ H and ϑ L are the probabilities that a signal variable is retained in a single analysis of a mild and severe reduction, respectively. Let θ HL S2 and θ LH S2 denote the overall survival probabilities of the first and the second of these procedures, respectively. To demonstrate that suppose for a contradiction that The supremum of 3ϑ L + 2ϑ H (2 − ϑ L ) is 5, attained only in the limit as ϑ L → ϑ H → 1. But 5 < 6 − ϑ L because ϑ L < 1 and so equation (5.1) implies a contradiction, and we conclude that θ HL S2 > θ LH S2 . That is, the first procedure gives a higher probability of a signal variable being retained than the second. The same argument shows that the first procedure results also in a higher probability of noise variables being retained.

A simple example
The performance of various stages of the procedure may be investigated as follows. Generate n = 10 2 replicates of v = 10 3 variables from a normal distribution of zero mean and covariance matrix PΣP −1 , where P is a permutation matrix and Σ is an identity matrix with one diagonal block replaced by a correlation matrix of dimension v S0 + v C0 and equal correlation ρ. For each replicate, v S0 of the v S0 + v C0 correlated variables is multiplied by a constant signal and added, together with standard normal noise. Conditional on a realization x S of the v S0 signal variables, the resulting response variable is then normally distributed of mean γ T x S and unit variance, where γ is a v S0 vector of constants. Arrange the indices of the v variables in a 10 × 10 × 10 cube. The rows, the columns, etc. of the cube form 3k 2 = 300 sets, each of k = 10 variables. Reduction is performed by taking the top two scoring variables in each analysis in the first stage and by taking all those exceeding the threshold of a 0.1% level test in the second stage. Finally, find small subsets of variables that give adequate fit using a likelihood ratio test against the comprehensive model.
Summaries estimated from 500 Monte Carlo replications are reported in table 1. The general conclusion is that such correlation slightly degrades the probability of a signal variable being retained and increases the number of false models not rejected. The comprehensive model in the likelihood ratio test is taken as the model with all variables retained through the reduction phase. Having been selected in the light of the data, it achieves a better fit to the data than an Table 1. S is the true set of signal variables,Ŝ is the set of variables surviving the reduction phase, M is the set of low-dimensional models whose likelihood ratio test against the comprehensive model is not rejected at the 1% level. Empirical standard errors in parenthesis.  arbitrary model embedding the one to be tested and so the probability of the true model being accepted, conditional on it having been retained, is lower than the nominal coverage probability of the likelihood ratio test. A simple way to recover the conditional nominal coverage is to split the sample. Table 1 reports the improvement in coverage probability and the associated increase in the number of false models not rejected from using 70 observations for reduction and the remaining 30 to construct the final set of low-dimensional models.

Discussion
(a) On the choice of k The combinatorial arrangements used are essentially the partially balanced incomplete block designs introduced 80 years ago in the context of plant breeding trials [6]. Our treatment has thus taken k as determined a priori, preferably below 15. An alternative would be to draw random sets of size k for each variable, leaving the choice of k undetermined. The argument against too large a value of k is that the precision of the resulting estimates is degraded by the correlations induced when there are many correlated variables among those selected for test.
The following is a simplistic formulation outlining the effect of k on the estimated standard errors and ignoring other aspects.
Suppose Z 1 , . . . , Z k have zero mean, unit variance and are weakly correlated with average correlationρ. Then the ratio of the variance of l T Z, an estimated effect, to that assuming independence is approximately 1 −ρ +ρ(Σl i ) 2 /Σl 2 i . The worst case is where, if the correlations are all of the same sign, the l i are approximately equal and the ratio becomes 1 + (k − 1)ρ. Thus, the calculation of standard errors assuming independence is not unduly distorted so long as this ratio does not exceed 2, that is, the average correlation does not exceed about 1/k. If correlations of the order 0.05-0.1 are likely to be present, k of the order 10-20 is a reasonable choice, justifying the choice described here and in the previous paper.

(b) Arrangement randomization
Strong reassurance of the security of one's conclusions is given if, upon re-randomization of the arrangement of the variable indices in the cube, the outcome is relatively stable. An unstable outcome would most likely indicate that too severe a reduction has been used. The probabilistic properties set forth in §4 guarantee robustness to re-randomization under idealized conditions provided that the decision rules used are not too severe, and empirical experience with microarray data supports this for continuous responses. Some applications with binary outcome require caution. In particular, when v is very large, there may be an appreciable number of relatively low-dimensional sets of variables perfectly predicting the outcome. If a set of variables forms such a separating hyperplane, then so does any larger set. The likelihood would, in such a case, be theoretically unbounded and consequently all sets of k variables containing a primitive set of separating variables would be retained, leading to too mild a reduction.
The special features of the procedure with binary responses will not be explored in the present paper. Note, however, that agreement of the outcome with the lasso solution should not be expected with binary responses for the reasons outlined by Cox & Battey [3].

(c) Further remarks
The analysis discussed in the paper is intended to be largely exploratory. The object of the formal probabilistic discussion is to calibrate the procedure to show how it performs under idealized conditions. It is not aimed to justify an explicit probabilistically based assessment, such as a confidence coefficient, attached to a specific analysis. To develop such an assessment, a Bayesian analysis might be considered. For this, meaningful prior probabilities have to be attached to key