On the linear in probability model for binary data

The analysis of binary response data commonly uses models linear in the logistic transform of probabilities. This paper considers some of the advantages and disadvantages of simple least-squares estimates based on a linear representation of the probabilities themselves, this in particular sometimes allowing a more direct empirical interpretation of underlying parameters. A sociological study is used in illustration.


Introduction
The interpretation of data in the form of binary outcomes arises in many areas of science from the primary physical and biological sciences and their application through to more directly applied areas and the social sciences.
Two distinct themes in the analysis of binary data go back at least to the beginning of the twentieth century with the contrast between Karl Pearson who, in his biserial correlation coefficient, treated a pair of possibly related binary variables as derived from an unobserved bivariate normally distributed variable, and Yule who worked directly with observed proportions of outcomes. When the hypothesized latent variables have a tangible interpretation, as in quantal bioassays, the former approach is preferable, but in the present paper we consider only situations in which observed proportions of outcomes are represented directly and relations concerning them interpreted.
Suppose that for n independent individuals, we observe a realization of a binary outcome variable Y i (1 i n) taking values 1 or 21, and that for individual i there is a p Â 1 vector x i of explanatory variables. A widely used representation is the linear logistic form in which logfpr(Y i ¼ 1)/pr(Y i ¼ 21)g is assumed to depend linearly on x i . This leads to a simple interpretation of regression coefficients as ratios of effects when the binary responses are concentrated at one of the two levels but otherwise the interpretation is less direct. For a discussion from a sociological perspective of the difficulties of interpreting logistic coefficients, see [1] and, for a wide-ranging review, see [2].
The linear in probability model to be considered in the present paper specifies the probabilities as linear functions of the & 2019 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited. explanatory variables, that is for y ¼ 21, 1 and with x i typically including a constant term There are implicit restrictions on the parameter space, namely that for all data x, jb T xj 1. If both the linear in probability and linear logistic models give adequate fit, the former has the advantage that the linear regression coefficients have a clearer operational interpretation in terms of numbers of individuals potentially influenced by a unit change of an explanatory variable. Emphasis sometimes lies on testing the significance of individual effects and comparison of their relative magnitudes. For this, the exponential family form of the linear logistic model [3,4] brings substantial simplification and other advantages. Furthermore, the logistic dependence has the potential to apply over a wide range of future conditions excluded by the positivity constraints on the linear form.
The discussion highlights a context in which maximum-likelihood estimation is very sensitive to aberrant observations, whereas ordinary least squares is insensitive yet typically achieves high efficiency.
A limiting case which sharply illustrates these distinctions concerns the comparison of data (Y 1 , Y 2 ) formed from counts of events from two Poisson processes of rates, say, r 1 and r 1 c or r 1 and r 1 þ u for the multiplicative and additive representations, respectively. That is, Y 2 represents either a multiplication of the baseline rate by a constant or the addition of a separate signal. The former model falls within the exponential family of distributions and leads to an analysis based on a 2 Â 2 contingency table. The second calls for a different analysis based on large-sample maximum-likelihood theory. For a further discussion concerning a similar model for Poisson variables, see [5].

Inferential aspects 2.1. Second-moment theory
We now consider properties of the linear in probability model based only on first and second moments. First, we define the least-squares estimate of b by projecting the vector Y ¼ (Y 1 , . . ., Y n ) T orthogonally onto the space spanned by the columns of x, thus givinĝ In the present context, x is a matrix whose ith row is x i T . The estimate is unbiased but does not have second-moment optimality unless b ¼ 0 because the components of Y in general do not have equal variance. Nor is the covariance matrix of the estimates given by the standard formulae unless b is small.
In fact One simple and often satisfactory estimate of the covariance matrix ofb OLS is to replace D byD in which b is replaced byb OLS .
A more elaborate second moment approach is to replaceb OLS by a weighted least-squares estimatê 2 is not bounded away from zero, weighted least squares is inappropriate as a general method.
The calculation of approximate confidence intervals and significance tests may be based on the asymptotic normality ofb OLS .

Maximum-likelihood estimation
The log likelihood corresponding to (1.1) is provided that for all i, À1 , x T i b , 1. We return to the relevance of this condition later. A stationary value of the log likelihood occurs where royalsocietypublishing.org/journal/rsos R. Soc. open sci. 6: 190067 If 1/(1 þ a) is expanded as 1 2 a and higher terms neglected, that is the regression assumed small, the least-squares estimateb OLS is recovered. There is a strong argument for using ordinary least squares rather than maximum likelihood in this context despite sufficiency of pb ML under model (1.1). In the present context, the two estimators are virtually equivalent in terms of their efficiency, while maximum likelihood suffers extreme fragility, as explained below.
There is the following expansion of the second derivative of '(b), valid for small x T i b, Here r bb denotes the matrix of second partial derivatives with respect to b. On taking expectations, an approximation to the asymptotic variance of the maximum-likelihood estimator is obtained as fx T (I þ D)xg 21 . For comparison to (2.1), it is more convenient to work with fx T (I 2 D) 21 xg 21 , which is a lower bound for fx T (I þ D)xg 21 . Using the geometric series expansion ( Because M 0 I, where the notation A 0 B means that A 2 B is a negative definite matrix, the inflation in variance from usingb OLS rather thanb ML is showing that the loss in efficiency is typically very small.
On the other hand, from the perspective of formal likelihood theory even one individual out of range, in the sense that jb T x i j . 1, would refute the parameter value in question. That is, maximum likelihood is extremely sensitive in the present context to observations measured with error or drawn from a model even slightly different from that postulated. Ordinary least squares is by contrast relatively unaffected by such anomalies.

Interpretation of analysis
The interpretation of the regression coefficients in the linear in probability model is similar to that in a normal theory linear regression model. Let x* and x** be two different vectors of covariate information, differing by 1 unit in variable j and otherwise the same. The number of positive outcomes is S ¼ Therefore, the hypothetical change in E(S) for a hypothetical replacement of m individuals who differ by one unit in the jth component but are otherwise the same is If there are binary covariates, it is natural to code them as f21, 1g, in which case division of two is not needed because a unit change in the level corresponds to a numerical difference of two units. If, upon fitting the linear in probability model, it is found that the number of least-squares fitted values x T ibOLS outside [21, 1] is appreciably larger than could be attributed to chance under the linear in probability model, some doubt would be cast upon the plausibility of the model. The expected number out of range, assuming that the linear in probability model is valid for all observations, is l ¼ P i pr(jx T ibOLS j . 1) ¼ P i p i where, by the asymptotic normality ofb OLS À b, values is l, obtained by incorrectly assuming that R is approximately Poisson distributed for large n. The variance of R is larger than l due to dependence between the summands, induced byb OLS . In particular, , so that Z i and Z j are bivariate normally distributed of zero means, unit variances and correlation coefficient : Then pr(jb T OLS x i j . 1, jb T OLS x j j . 1) is the sum of the quadrant probabilities,

F(s) ds:
While there is no closed-form expression for these, close approximations are obtained by replacing the conditional expectations of the functions of interest by the corresponding functions of the conditional expectations, with approximation error established by Taylor series expansion. Depending on the signs of z i , z j and r ij , the approximation so obtained might be improved by interchanging the roles of z i and z j on the right-hand side of the above display. For a further discussion, see [6].

Socio-economic inequalities in educational attainment
We use US data from the National Longitudinal Study of Youth (1979), a nationally representative longitudinal study of people aged 14-22. Our binary outcome, coded as f21, 1g, specifies whether the individual enrolled in a 4-year-degree-granting institution for at least 1 year. There are five potential explanatory variables. Ability is measured as the respondent's score on the Armed Forces Qualifying Test, administered to all respondents in the 1981 wave of the survey. Family income in childhood is measured as the log of total net family income in 1979. All respondents identified themselves as male or female but race was measured via interviewer observation, and we here limit our sample to those respondents who were classified as black or non-black and non-Hispanic. Finally, we include an indicator of whether respondents were living with at least one parent at the time of the first survey. The sensitivity analysis used here may be contrasted with procedures of multiple imputation based on the untestable assumption that observations are missing at random.
An informal preliminary analysis involved tests for interactions and inspection of interaction plots. None was strongly suggested. Table 2 reports least squares estimates of regression coefficients and their estimated standard errors from a model with main effects for the five explanatory variables.
The suggestion is that hypothetically increasing the number of males and correspondingly reducing the number of females in the population by m units, say, would correspond to a 6-7% of m decrease in the expected number of individuals receiving higher education, all other things equal. The coefficient of the race variable is similarly interpreted, the suggestion being that in a hypothetical population, demographically equivalent to the one under study except for having m more black children than white children, the expected number of individuals experiencing the positive outcome would be 22-23% higher.
It is suggested, all other things being equal, that a 1% increase in family income, i.e. an increase of 0.01 in log family income, would correspond to a 0.02-0.03% increase in the expected number of positive outcomes and that a 1% increase in ability, to the extent that it can be measured by the Armed Forces Qualifying Test score, would correspond to a 1% increase. An absolute change at the bottom of the income scale has a relatively greater effect than the same absolute change at the top. Finally, accounting for other factors, individuals living with someone other than one of their parents are perhaps slightly more likely to experience the positive outcome, although the evidence for this is rather weak.
In the above interpretation of the estimated coefficients on the continuous variables, division by 2 is needed, as described in §2.3. Division by 2 is not needed for the three binary explanatory variables because they are coded as f21, 1g.
The last two columns of table 2 show the actual and predicted number of least squares fitted values x T ibOLS that are outside [21, 1]. The individuals whose fitted values are out of range are almost all at the two edges of the sample space for the Armed Forces Qualifying Test score. While the numerical values of the coefficient estimates from a linear logistic model are not comparable to those from a linear in probability model, the ratios of these coefficients are remarkably similar. The code for verifying this statement and the analysis of §3 is available as outlined in the data accessibility statement.

Discussion
As with other statistical methods care is needed especially when relatively complex data are involved. In the present context, a reasonable approach for general use is to base the analysis onb OLS with the improved estimate of its covariance matrix, given by (2.1). Examination of model adequacy should include a check of the number of fitted values outside [21, 1]. Do such values form a rationally identifiable subgroup to be analysed separately? Does their omission or exclusion materially affect the conclusions? Does the number of anomalous observations suggest major change to the whole analysis? A large number of anomalous observations may suggest that a model linear on the logit scale would be more appropriate.
From the perspective of formal likelihood theory, even one individual out of range would refute the parameter value in question in the linear in probability model. Thus, the paper illustrates an empirical context in which the formal optimality of maximum-likelihood estimates is achieved only at the cost of extreme fragility. A formally slightly less efficient method is much to be preferred.  Table 2.