Royal Society Open Science
Open AccessResearch articles

On the linear in probability model for binary data

    Abstract

    The analysis of binary response data commonly uses models linear in the logistic transform of probabilities. This paper considers some of the advantages and disadvantages of simple least-squares estimates based on a linear representation of the probabilities themselves, this in particular sometimes allowing a more direct empirical interpretation of underlying parameters. A sociological study is used in illustration.

    1. Introduction

    The interpretation of data in the form of binary outcomes arises in many areas of science from the primary physical and biological sciences and their application through to more directly applied areas and the social sciences.

    Two distinct themes in the analysis of binary data go back at least to the beginning of the twentieth century with the contrast between Karl Pearson who, in his biserial correlation coefficient, treated a pair of possibly related binary variables as derived from an unobserved bivariate normally distributed variable, and Yule who worked directly with observed proportions of outcomes. When the hypothesized latent variables have a tangible interpretation, as in quantal bioassays, the former approach is preferable, but in the present paper we consider only situations in which observed proportions of outcomes are represented directly and relations concerning them interpreted.

    Suppose that for n independent individuals, we observe a realization of a binary outcome variable Yi (1 ≤ in) taking values 1 or −1, and that for individual i there is a p × 1 vector xi of explanatory variables. A widely used representation is the linear logistic form in which log{pr(Yi = 1)/pr(Yi = −1)} is assumed to depend linearly on xi. This leads to a simple interpretation of regression coefficients as ratios of effects when the binary responses are concentrated at one of the two levels but otherwise the interpretation is less direct. For a discussion from a sociological perspective of the difficulties of interpreting logistic coefficients, see [1] and, for a wide-ranging review, see [2].

    The linear in probability model to be considered in the present paper specifies the probabilities as linear functions of the explanatory variables, that is for y = −1, 1 and with xi typically including a constant term

    pr(Yi=y)=pβ(y)=12(1+yβTxi),1.1
    so that E(Yi) = βTxi. There are implicit restrictions on the parameter space, namely that for all data x, |βTx| ≤ 1.

    If both the linear in probability and linear logistic models give adequate fit, the former has the advantage that the linear regression coefficients have a clearer operational interpretation in terms of numbers of individuals potentially influenced by a unit change of an explanatory variable. Emphasis sometimes lies on testing the significance of individual effects and comparison of their relative magnitudes. For this, the exponential family form of the linear logistic model [3,4] brings substantial simplification and other advantages. Furthermore, the logistic dependence has the potential to apply over a wide range of future conditions excluded by the positivity constraints on the linear form.

    The discussion highlights a context in which maximum-likelihood estimation is very sensitive to aberrant observations, whereas ordinary least squares is insensitive yet typically achieves high efficiency.

    A limiting case which sharply illustrates these distinctions concerns the comparison of data (Y1, Y2) formed from counts of events from two Poisson processes of rates, say, ρ1 and ρ1ψ or ρ1 and ρ1 + θ for the multiplicative and additive representations, respectively. That is, Y2 represents either a multiplication of the baseline rate by a constant or the addition of a separate signal. The former model falls within the exponential family of distributions and leads to an analysis based on a 2 × 2 contingency table. The second calls for a different analysis based on large-sample maximum-likelihood theory. For a further discussion concerning a similar model for Poisson variables, see [5].

    2. Inferential aspects

    2.1. Second-moment theory

    We now consider properties of the linear in probability model based only on first and second moments. First, we define the least-squares estimate of β by projecting the vector Y = (Y1, …, Yn)T orthogonally onto the space spanned by the columns of x, thus giving

    β^OLS=(xTx)1xTY.
    In the present context, x is a matrix whose ith row is xiT. The estimate is unbiased but does not have second-moment optimality unless β = 0 because the components of Y in general do not have equal variance. Nor is the covariance matrix of the estimates given by the standard formulae unless β is small.

    In fact

    var(β^OLS)=Σβ=(xTx)1(xTx)1xTΔx(xTx)1,2.1
    where Δ=diag(xiTβ)2. One simple and often satisfactory estimate of the covariance matrix of β^OLS is to replace Δ by Δ^ in which β is replaced by β^OLS.

    A more elaborate second moment approach is to replace β^OLS by a weighted least-squares estimate β^WLS in which var(Yi) is estimated as 1(xiTβ^OLS)2. Since 1(xiTβ)2 is not bounded away from zero, weighted least squares is inappropriate as a general method.

    The calculation of approximate confidence intervals and significance tests may be based on the asymptotic normality of β^OLS.

    2.2. Maximum-likelihood estimation

    The log likelihood corresponding to (1.1) is

    (β)=log(1+xiTβYi)2.2
    provided that for all i, 1<xiTβ<1. We return to the relevance of this condition later. A stationary value of the log likelihood occurs where
    xiYi1+xiTβ^MLYi=0.
    If 1/(1 + a) is expanded as 1 − a and higher terms neglected, that is the regression assumed small, the least-squares estimate β^OLS is recovered.

    There is a strong argument for using ordinary least squares rather than maximum likelihood in this context despite sufficiency of pβ^ML under model (1.1). In the present context, the two estimators are virtually equivalent in terms of their efficiency, while maximum likelihood suffers extreme fragility, as explained below.

    There is the following expansion of the second derivative of ℓ(β), valid for small xiTβ,

    ββ(β)=xixiTYi2(1+xiTβYi)2=ixixiT(12xiTβYi+3(xiTβ)2)+O{(xiTβ)3}.
    Here ββ denotes the matrix of second partial derivatives with respect to β. On taking expectations, an approximation to the asymptotic variance of the maximum-likelihood estimator is obtained as {xT(I + Δ)x}−1. For comparison to (2.1), it is more convenient to work with {xT(I − Δ)−1x}−1, which is a lower bound for {xT(I + Δ)x}−1. Using the geometric series expansion (IΔ)1= I+Δ+Δ2+=I+Υ, say, and the formula
    (A+BC)1=A1A1B(I+CA1B)1CA1,2.3
    we write, with A = xTx, B = I and C=xTΥx in (2.3) and M={I+(xTΥx)(xTx)1}1,
    var(β^ML)=(xTx)1{IM(xTΥx)(xTx)1}.2.4
    Because MI, where the notation AB means that AB is a negative definite matrix, the inflation in variance from using β^OLS rather than β^ML is
    (xTx)1{(MI)xTΔx(xTx)1+MxT(ΥΔ)x(xTx)1}(xTx)1xT(ΥΔ)x(xTx)1.
    Write δi = βTxi. From the geometric series, we deduce that
    ΥΔ=diag{δ14(1δ12),,δn4(1δn2)}.
    Thus var(β^OLS)var(β^ML)=O(n1max{δi4/(1δi2)}) showing that the loss in efficiency is typically very small.

    On the other hand, from the perspective of formal likelihood theory even one individual out of range, in the sense that |βTxi| > 1, would refute the parameter value in question. That is, maximum likelihood is extremely sensitive in the present context to observations measured with error or drawn from a model even slightly different from that postulated. Ordinary least squares is by contrast relatively unaffected by such anomalies.

    2.3. Interpretation of analysis

    The interpretation of the regression coefficients in the linear in probability model is similar to that in a normal theory linear regression model. Let x* and x** be two different vectors of covariate information, differing by 1 unit in variable j and otherwise the same. The number of positive outcomes is S=iZi where Zi = (Yi + 1)/2. Therefore, the hypothetical change in E(S) for a hypothetical replacement of m individuals who differ by one unit in the jth component but are otherwise the same is

    i=1m{E(Zix)E(Zix)}=mβj2.
    If there are binary covariates, it is natural to code them as {−1, 1}, in which case division of two is not needed because a unit change in the level corresponds to a numerical difference of two units.

    If, upon fitting the linear in probability model, it is found that the number of least-squares fitted values xiTβ^OLS outside [−1, 1] is appreciably larger than could be attributed to chance under the linear in probability model, some doubt would be cast upon the plausibility of the model. The expected number out of range, assuming that the linear in probability model is valid for all observations, is λ=ipr(|xiTβ^OLS|>1)=ipi where, by the asymptotic normality of β^OLSβ,

    piΦ{1+βTxi(xiTΣβxi)}+Φ{1βTxi(xiTΣβxi)}(n).
    Thus, a predicted number of out of range values is an estimate of λ, obtained by replacing β and Σβ by estimates in the expression for each pi. A crude lower bound on the variance of the sum, R, of out of range values is λ, obtained by incorrectly assuming that R is approximately Poisson distributed for large n. The variance of R is larger than λ due to dependence between the summands, induced by β^OLS. In particular,
    var(R)=ipi(1pi)+ij{pr(|β^OLSTxi|>1,|β^OLSTxj|>1)pipj}.2.5
    Write
    Zi=(β^OLSβ)Txi(xiTΣβxi),zi=1βTxi(xiTΣβxi),
    so that Zi and Zj are bivariate normally distributed of zero means, unit variances and correlation coefficient
    ρij=xiTΣβxj(xiTΣβxi)(xjTΣβxj).
    Then pr(|β^OLSTxi|>1,|β^OLSTxj|>1) is the sum of the quadrant probabilities,
    pr(Zi>zi,Zj>zj)=Φ(zi)ziΦ{ρijszj(1ρij2)}Φ(s)ds,pr(Zi<zi,Zj<zj)=Φ(zi)ziΦ{zj+ρijs(1ρij2)}Φ(s)ds,pr(Zi>zi,Zj<zj)=Φ(zi)ziΦ{zjρijs(1ρij2)}Φ(s)dsandpr(Zi<zi,Zj>zj)=Φ(zj)zjΦ{ziρijs(1ρij2)}Φ(s)ds.
    While there is no closed-form expression for these, close approximations are obtained by replacing the conditional expectations of the functions of interest by the corresponding functions of the conditional expectations, with approximation error established by Taylor series expansion. Depending on the signs of zi, zj and ρij, the approximation so obtained might be improved by interchanging the roles of zi and zj on the right-hand side of the above display. For a further discussion, see [6].

    3. Socio-economic inequalities in educational attainment

    We use US data from the National Longitudinal Study of Youth (1979), a nationally representative longitudinal study of people aged 14–22. Our binary outcome, coded as {−1, 1}, specifies whether the individual enrolled in a 4-year-degree-granting institution for at least 1 year. There are five potential explanatory variables. Ability is measured as the respondent’s score on the Armed Forces Qualifying Test, administered to all respondents in the 1981 wave of the survey. Family income in childhood is measured as the log of total net family income in 1979. All respondents identified themselves as male or female but race was measured via interviewer observation, and we here limit our sample to those respondents who were classified as black or non-black and non-Hispanic. Finally, we include an indicator of whether respondents were living with at least one parent at the time of the first survey.

    As is common with extensive observational data, some observations on explanatory variables are missing, as shown in table 1. Because we are concerned with the dependence of outcome on explanatory variables, individuals with missing outcome are treated as uninformative about that dependence. A sensitivity analysis examined how the regression coefficients of interest changed when rather extreme assignments were made to the three explanatory variables with missing values, treating binary variables as all at one or other extreme and continuous variables as at their upper and lower quartile. The levels used were 68.33 and 17.28 for the Armed Forces Qualifying Test score and 10.00 and 8.79 for the logarithm of family income when the individual was in childhood. Estimates from the eight patterns of missingness are in table 2. While there is some dependence on the missing values, that dependence is very minor and without qualitative impact on the conclusions of the analysis. If a larger number of explanatory variables have missing values the sensitivity analysis should be based on a suitable fraction of the two-level factorial system of potential missing values, allowing estimation of main effects from missingness [7, §12.2].

    Table 1. Summary of data.

    covariate description sample range per cent missing
    x1 gender {1 = male, −1 = female} 0
    x2 AFQT score percentage (0–100) 4.3
    x3 log income continuous (3.00–11.23) 51.2
    x4 race {1 = black, −1 = non-black/non-Hispanic} 0
    x5 lives with parent {1 = yes, −1 = no} 5.1

    Table 2. Sensitivity analysis of least squares estimates and their estimated standard errors from replacing all missing values of xj by high and low levels. The estimated standard errors are obtained by replacing Δ by Δ^ in equation (2.1). The sample size is 9043.

    least squares estimates of regression coefficients (estimated standard errors)
    x2 x3 x5 β^0 β^1 β^2 β^3 β^4 β^5 number out of range predicted number out of range
    L H H −1.51 (0.13) −0.061 (0.0092) 0.0201 (0.00031) 0.064 (0.011) 0.224 (0.011) −0.034 (0.011) 394 396
    L H L −1.51 (0.13) −0.062 (0.0092) 0.0202 (0.00031) 0.063 (0.014) 0.223 (0.011) −0.021 (0.010) 383 388
    L L H −1.32 (0.12) −0.060 (0.0092) 0.0202 (0.00031) 0.048 (0.014) 0.222 (0.011) −0.038 (0.011) 384 391
    L L L −1.31 (0.12) −0.061 (0.0092) 0.0203 (0.00031) 0.046 (0.014) 0.221 (0.011) −0.025 (0.011) 377 384
    H H H −1.57 (0.13) −0.065 (0.0093) 0.0198 (0.00033) 0.068 (0.014) 0.225 (0.011) −0.028 (0.011) 444 441
    H H L −1.57 (0.13) −0.067 (0.0094) 0.0198 (0.00032) 0.066 (0.014) 0.223 (0.011) −0.011 (0.010) 434 436
    H L H −1.45 (0.13) −0.065 (0.0094) 0.0198 (0.00033) 0.059 (0.014) 0.224 (0.011) −0.034 (0.012) 451 450
    H L L −1.44 (0.13) −0.066 (0.0094) 0.0199 (0.00033) 0.056 (0.014) 0.222 (0.011) −0.017 (0.011) 453 443
    max absolute difference 0.23 0.0061 0.00050 0.022 0.0040 0.026

    The sensitivity analysis used here may be contrasted with procedures of multiple imputation based on the untestable assumption that observations are missing at random.

    An informal preliminary analysis involved tests for interactions and inspection of interaction plots. None was strongly suggested. Table 2 reports least squares estimates of regression coefficients and their estimated standard errors from a model with main effects for the five explanatory variables.

    The suggestion is that hypothetically increasing the number of males and correspondingly reducing the number of females in the population by m units, say, would correspond to a 6–7% of m decrease in the expected number of individuals receiving higher education, all other things equal. The coefficient of the race variable is similarly interpreted, the suggestion being that in a hypothetical population, demographically equivalent to the one under study except for having m more black children than white children, the expected number of individuals experiencing the positive outcome would be 22–23% higher.

    It is suggested, all other things being equal, that a 1% increase in family income, i.e. an increase of 0.01 in log family income, would correspond to a 0.02–0.03% increase in the expected number of positive outcomes and that a 1% increase in ability, to the extent that it can be measured by the Armed Forces Qualifying Test score, would correspond to a 1% increase. An absolute change at the bottom of the income scale has a relatively greater effect than the same absolute change at the top. Finally, accounting for other factors, individuals living with someone other than one of their parents are perhaps slightly more likely to experience the positive outcome, although the evidence for this is rather weak.

    In the above interpretation of the estimated coefficients on the continuous variables, division by 2 is needed, as described in §2.3. Division by 2 is not needed for the three binary explanatory variables because they are coded as {−1, 1}.

    The last two columns of table 2 show the actual and predicted number of least squares fitted values xiTβ^OLS that are outside [−1, 1]. The individuals whose fitted values are out of range are almost all at the two edges of the sample space for the Armed Forces Qualifying Test score.

    While the numerical values of the coefficient estimates from a linear logistic model are not comparable to those from a linear in probability model, the ratios of these coefficients are remarkably similar. The code for verifying this statement and the analysis of §3 is available as outlined in the data accessibility statement.

    4. Discussion

    As with other statistical methods care is needed especially when relatively complex data are involved. In the present context, a reasonable approach for general use is to base the analysis on β^OLS with the improved estimate of its covariance matrix, given by (2.1). Examination of model adequacy should include a check of the number of fitted values outside [−1, 1]. Do such values form a rationally identifiable subgroup to be analysed separately? Does their omission or exclusion materially affect the conclusions? Does the number of anomalous observations suggest major change to the whole analysis? A large number of anomalous observations may suggest that a model linear on the logit scale would be more appropriate.

    From the perspective of formal likelihood theory, even one individual out of range would refute the parameter value in question in the linear in probability model. Thus, the paper illustrates an empirical context in which the formal optimality of maximum-likelihood estimates is achieved only at the cost of extreme fragility. A formally slightly less efficient method is much to be preferred.

    Data accessibility

    The data and code used in the example, together with further details about the data source, can be accessed from the Royal Society's repository: https://rs.figshare.com.

    Authors' contributions

    H.S.B., D.R.C. and M.V.J. designed the research, performed the research and wrote the paper.

    Competing interests

    The authors confirm that there are no competing interests.

    Funding

    The work was supported by the UK Engineering and Physical Sciences Research Council under grant no. EP/P002757/1 and by the Russell Sage Foundation and the Andrew W. Mellon Foundation Fellowship at the Center for Advanced Study in the Behavioral Sciences, Stanford University.

    Acknowledgements

    We thank the referees for constructive comments.

    Footnotes

    Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.4479830.

    Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited.

    Comments