Abstract
The analysis of binary response data commonly uses models linear in the logistic transform of probabilities. This paper considers some of the advantages and disadvantages of simple least-squares estimates based on a linear representation of the probabilities themselves, this in particular sometimes allowing a more direct empirical interpretation of underlying parameters. A sociological study is used in illustration.
1. Introduction
The interpretation of data in the form of binary outcomes arises in many areas of science from the primary physical and biological sciences and their application through to more directly applied areas and the social sciences.
Two distinct themes in the analysis of binary data go back at least to the beginning of the twentieth century with the contrast between Karl Pearson who, in his biserial correlation coefficient, treated a pair of possibly related binary variables as derived from an unobserved bivariate normally distributed variable, and Yule who worked directly with observed proportions of outcomes. When the hypothesized latent variables have a tangible interpretation, as in quantal bioassays, the former approach is preferable, but in the present paper we consider only situations in which observed proportions of outcomes are represented directly and relations concerning them interpreted.
Suppose that for n independent individuals, we observe a realization of a binary outcome variable Yi (1 ≤ i ≤ n) taking values 1 or −1, and that for individual i there is a p × 1 vector xi of explanatory variables. A widely used representation is the linear logistic form in which log{pr(Yi = 1)/pr(Yi = −1)} is assumed to depend linearly on xi. This leads to a simple interpretation of regression coefficients as ratios of effects when the binary responses are concentrated at one of the two levels but otherwise the interpretation is less direct. For a discussion from a sociological perspective of the difficulties of interpreting logistic coefficients, see [1] and, for a wide-ranging review, see [2].
The linear in probability model to be considered in the present paper specifies the probabilities as linear functions of the explanatory variables, that is for y = −1, 1 and with xi typically including a constant term
If both the linear in probability and linear logistic models give adequate fit, the former has the advantage that the linear regression coefficients have a clearer operational interpretation in terms of numbers of individuals potentially influenced by a unit change of an explanatory variable. Emphasis sometimes lies on testing the significance of individual effects and comparison of their relative magnitudes. For this, the exponential family form of the linear logistic model [3,4] brings substantial simplification and other advantages. Furthermore, the logistic dependence has the potential to apply over a wide range of future conditions excluded by the positivity constraints on the linear form.
The discussion highlights a context in which maximum-likelihood estimation is very sensitive to aberrant observations, whereas ordinary least squares is insensitive yet typically achieves high efficiency.
A limiting case which sharply illustrates these distinctions concerns the comparison of data (Y1, Y2) formed from counts of events from two Poisson processes of rates, say, ρ1 and ρ1ψ or ρ1 and ρ1 + θ for the multiplicative and additive representations, respectively. That is, Y2 represents either a multiplication of the baseline rate by a constant or the addition of a separate signal. The former model falls within the exponential family of distributions and leads to an analysis based on a 2 × 2 contingency table. The second calls for a different analysis based on large-sample maximum-likelihood theory. For a further discussion concerning a similar model for Poisson variables, see [5].
2. Inferential aspects
2.1. Second-moment theory
We now consider properties of the linear in probability model based only on first and second moments. First, we define the least-squares estimate of β by projecting the vector Y = (Y1, …, Yn)T orthogonally onto the space spanned by the columns of x, thus giving
In fact
A more elaborate second moment approach is to replace by a weighted least-squares estimate in which var(Yi) is estimated as . Since is not bounded away from zero, weighted least squares is inappropriate as a general method.
The calculation of approximate confidence intervals and significance tests may be based on the asymptotic normality of .
2.2. Maximum-likelihood estimation
The log likelihood corresponding to (1.1) is
There is a strong argument for using ordinary least squares rather than maximum likelihood in this context despite sufficiency of under model (1.1). In the present context, the two estimators are virtually equivalent in terms of their efficiency, while maximum likelihood suffers extreme fragility, as explained below.
There is the following expansion of the second derivative of ℓ(β), valid for small ,
On the other hand, from the perspective of formal likelihood theory even one individual out of range, in the sense that |βTxi| > 1, would refute the parameter value in question. That is, maximum likelihood is extremely sensitive in the present context to observations measured with error or drawn from a model even slightly different from that postulated. Ordinary least squares is by contrast relatively unaffected by such anomalies.
2.3. Interpretation of analysis
The interpretation of the regression coefficients in the linear in probability model is similar to that in a normal theory linear regression model. Let x* and x** be two different vectors of covariate information, differing by 1 unit in variable j and otherwise the same. The number of positive outcomes is where Zi = (Yi + 1)/2. Therefore, the hypothetical change in E(S) for a hypothetical replacement of m individuals who differ by one unit in the jth component but are otherwise the same is
If, upon fitting the linear in probability model, it is found that the number of least-squares fitted values outside [−1, 1] is appreciably larger than could be attributed to chance under the linear in probability model, some doubt would be cast upon the plausibility of the model. The expected number out of range, assuming that the linear in probability model is valid for all observations, is where, by the asymptotic normality of ,
3. Socio-economic inequalities in educational attainment
We use US data from the National Longitudinal Study of Youth (1979), a nationally representative longitudinal study of people aged 14–22. Our binary outcome, coded as {−1, 1}, specifies whether the individual enrolled in a 4-year-degree-granting institution for at least 1 year. There are five potential explanatory variables. Ability is measured as the respondent’s score on the Armed Forces Qualifying Test, administered to all respondents in the 1981 wave of the survey. Family income in childhood is measured as the log of total net family income in 1979. All respondents identified themselves as male or female but race was measured via interviewer observation, and we here limit our sample to those respondents who were classified as black or non-black and non-Hispanic. Finally, we include an indicator of whether respondents were living with at least one parent at the time of the first survey.
As is common with extensive observational data, some observations on explanatory variables are missing, as shown in table 1. Because we are concerned with the dependence of outcome on explanatory variables, individuals with missing outcome are treated as uninformative about that dependence. A sensitivity analysis examined how the regression coefficients of interest changed when rather extreme assignments were made to the three explanatory variables with missing values, treating binary variables as all at one or other extreme and continuous variables as at their upper and lower quartile. The levels used were 68.33 and 17.28 for the Armed Forces Qualifying Test score and 10.00 and 8.79 for the logarithm of family income when the individual was in childhood. Estimates from the eight patterns of missingness are in table 2. While there is some dependence on the missing values, that dependence is very minor and without qualitative impact on the conclusions of the analysis. If a larger number of explanatory variables have missing values the sensitivity analysis should be based on a suitable fraction of the two-level factorial system of potential missing values, allowing estimation of main effects from missingness [7, §12.2].
covariate | description | sample range | per cent missing |
---|---|---|---|
x1 | gender | {1 = male, −1 = female} | 0 |
x2 | AFQT score | percentage (0–100) | 4.3 |
x3 | log income | continuous (3.00–11.23) | 51.2 |
x4 | race | {1 = black, −1 = non-black/non-Hispanic} | 0 |
x5 | lives with parent | {1 = yes, −1 = no} | 5.1 |
least squares estimates of regression coefficients (estimated standard errors) |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|
x2 | x3 | x5 | number out of range | predicted number out of range | ||||||
L | H | H | −1.51 (0.13) | −0.061 (0.0092) | 0.0201 (0.00031) | 0.064 (0.011) | 0.224 (0.011) | −0.034 (0.011) | 394 | 396 |
L | H | L | −1.51 (0.13) | −0.062 (0.0092) | 0.0202 (0.00031) | 0.063 (0.014) | 0.223 (0.011) | −0.021 (0.010) | 383 | 388 |
L | L | H | −1.32 (0.12) | −0.060 (0.0092) | 0.0202 (0.00031) | 0.048 (0.014) | 0.222 (0.011) | −0.038 (0.011) | 384 | 391 |
L | L | L | −1.31 (0.12) | −0.061 (0.0092) | 0.0203 (0.00031) | 0.046 (0.014) | 0.221 (0.011) | −0.025 (0.011) | 377 | 384 |
H | H | H | −1.57 (0.13) | −0.065 (0.0093) | 0.0198 (0.00033) | 0.068 (0.014) | 0.225 (0.011) | −0.028 (0.011) | 444 | 441 |
H | H | L | −1.57 (0.13) | −0.067 (0.0094) | 0.0198 (0.00032) | 0.066 (0.014) | 0.223 (0.011) | −0.011 (0.010) | 434 | 436 |
H | L | H | −1.45 (0.13) | −0.065 (0.0094) | 0.0198 (0.00033) | 0.059 (0.014) | 0.224 (0.011) | −0.034 (0.012) | 451 | 450 |
H | L | L | −1.44 (0.13) | −0.066 (0.0094) | 0.0199 (0.00033) | 0.056 (0.014) | 0.222 (0.011) | −0.017 (0.011) | 453 | 443 |
max absolute difference | 0.23 | 0.0061 | 0.00050 | 0.022 | 0.0040 | 0.026 |
The sensitivity analysis used here may be contrasted with procedures of multiple imputation based on the untestable assumption that observations are missing at random.
An informal preliminary analysis involved tests for interactions and inspection of interaction plots. None was strongly suggested. Table 2 reports least squares estimates of regression coefficients and their estimated standard errors from a model with main effects for the five explanatory variables.
The suggestion is that hypothetically increasing the number of males and correspondingly reducing the number of females in the population by m units, say, would correspond to a 6–7% of m decrease in the expected number of individuals receiving higher education, all other things equal. The coefficient of the race variable is similarly interpreted, the suggestion being that in a hypothetical population, demographically equivalent to the one under study except for having m more black children than white children, the expected number of individuals experiencing the positive outcome would be 22–23% higher.
It is suggested, all other things being equal, that a 1% increase in family income, i.e. an increase of 0.01 in log family income, would correspond to a 0.02–0.03% increase in the expected number of positive outcomes and that a 1% increase in ability, to the extent that it can be measured by the Armed Forces Qualifying Test score, would correspond to a 1% increase. An absolute change at the bottom of the income scale has a relatively greater effect than the same absolute change at the top. Finally, accounting for other factors, individuals living with someone other than one of their parents are perhaps slightly more likely to experience the positive outcome, although the evidence for this is rather weak.
In the above interpretation of the estimated coefficients on the continuous variables, division by 2 is needed, as described in §2.3. Division by 2 is not needed for the three binary explanatory variables because they are coded as {−1, 1}.
The last two columns of table 2 show the actual and predicted number of least squares fitted values that are outside [−1, 1]. The individuals whose fitted values are out of range are almost all at the two edges of the sample space for the Armed Forces Qualifying Test score.
While the numerical values of the coefficient estimates from a linear logistic model are not comparable to those from a linear in probability model, the ratios of these coefficients are remarkably similar. The code for verifying this statement and the analysis of §3 is available as outlined in the data accessibility statement.
4. Discussion
As with other statistical methods care is needed especially when relatively complex data are involved. In the present context, a reasonable approach for general use is to base the analysis on with the improved estimate of its covariance matrix, given by (2.1). Examination of model adequacy should include a check of the number of fitted values outside [−1, 1]. Do such values form a rationally identifiable subgroup to be analysed separately? Does their omission or exclusion materially affect the conclusions? Does the number of anomalous observations suggest major change to the whole analysis? A large number of anomalous observations may suggest that a model linear on the logit scale would be more appropriate.
From the perspective of formal likelihood theory, even one individual out of range would refute the parameter value in question in the linear in probability model. Thus, the paper illustrates an empirical context in which the formal optimality of maximum-likelihood estimates is achieved only at the cost of extreme fragility. A formally slightly less efficient method is much to be preferred.
Data accessibility
The data and code used in the example, together with further details about the data source, can be accessed from the Royal Society's repository: https://rs.figshare.com.
Authors' contributions
H.S.B., D.R.C. and M.V.J. designed the research, performed the research and wrote the paper.
Competing interests
The authors confirm that there are no competing interests.
Funding
The work was supported by the UK Engineering and Physical Sciences Research Council under grant no. EP/P002757/1 and by the Russell Sage Foundation and the Andrew W. Mellon Foundation Fellowship at the Center for Advanced Study in the Behavioral Sciences, Stanford University.
Acknowledgements
We thank the referees for constructive comments.