Logistic Regression Instructor: Yuqi Gu Linear Regression Models, Fall 2022 Statistics GR 5205 004 / GU 4205 005 Columbia University 1 r>Binary Response Variable In many regression applications, the response Y has only two possible qualitative outcomes: – Financial status of firm: sound status/headed toward insolvency – Coronary heart disease status: has the disease/does not have the disease – In a study of labor force participation of married women: married woman in labor force/married woman not in labor force The constraint on the mean responses to belong to [0, 1] rule out a linear response function 2 Heart Disease Data Response: “chd”: indicates whether the person has heart disease or not The men vary in height (in inches) and the number of cigarettes (cigs) smoked per day > data(wcgs, package=”faraway”) > summary(wcgs[,c(“chd”,”height”,”cigs”)]) chd height cigs no :2897 Min. :60.00 Min. : 0.0 yes: 257 1st Qu.:68.00 1st Qu.: 0.0 Median :70.00 Median : 0.0 Mean :69.78 Mean :11.6 3rd Qu.:72.00 3rd Qu.:20.0 Max. :78.00 Max. :99.0 3 Heart Disease Data > plot(height ~ chd, wcgs) > wcgs$y <- ifelse(wcgs$chd == "no",0,1) > plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab=”Height”, + ylab=”Heart Disease”, pch=”.”) Figure: Plots of the presence/absence of heart disease according to height. 4 Heart Disease Data Predict heart disease and explain the relationship between height, cigarette usage and heart disease. For the same height and cigs, both outcomes occur. So we model the probability of getting heart disease P(Y = 1 | X ) rather than Y itself Figure: Interleaved histograms of the distribution of heights and cigarette usage for men with and without heart disease. 5 Logistic Regression Logistic regression defines the probability mass function: P(Y = 1 | X) = exp(β X) 1 + exp(β X) which implies that P(Y = 0 | X) = 1 P(Y = 1 | X) = 1 1 + exp(β X) where X is a (p + 1)-dim. vector with X0 ≡ 1, and β0 is the intercept 6 Logistic Regression This plot shows P(Y = 1 | X) and P(Y = 0 | X), plotted as functions of β X 7 Logistic Regression The logit function logit(x) = log ( x 1 x ) maps the unit interval (0, 1) to the entire real line ( ∞,∞) The inverse logit function, or expit function expit(x) = logit 1(x) = exp(x) 1 + exp(x) maps the real line to the unit interval In logistic regression, the inverse logit function is used to map the linear predictor β X to a probability of Y = 1: P(Y = 1 | X) = logit 1(β X) 8 Logistic Regression Geometric interpretation: a logistic regression fit based on two predictors can be represented by a S-shape surface in the 3D space 9 Logistic Regression The linear predictor in logistic regression is the conditional log odds: log [ P(Y = 1 | X) P(Y = 0 | X) ] = β X = β0 + β1X1 + · · ·+ βpXp Interpret logistic regression: a one unit increase in Xj results in a change of βj in the (conditional) log odds Or: a one unit increase in Xj results in a multiplicative change of exp(βj) in the conditional odds exp(βj) is also called the odds ratio, as it is the ratio of the two odds, corresponding to two scenarios where the values of Xj differ by one unit 10 Heart Disease Example > lmod <- glm(chd ~ height + cigs, family = binomial, wcgs) > summary(lmod) Deviance Residuals: Min 1Q Median 3Q Max -1.0041 -0.4425 -0.3630 -0.3499 2.4357 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.50161 1.84186 -2.444 0.0145 * height 0.02521 0.02633 0.957 0.3383 cigs 0.02313 0.00404 5.724 1.04e-08 *** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1781.2 on 3153 degrees of freedom Residual deviance: 1749.0 on 3151 degrees of freedom AIC: 1755 Number of Fisher Scoring iterations: 5 11 Heart Disease Example (beta <- coef(lmod)) plot(jitter(y,0.1) ~ jitter(height), wcgs, xlab="Height", ylab="Heart Disease",pch=".") curve(ilogit(beta[1] + beta[2]*x + beta[3]*0),add=TRUE) curve(ilogit(beta[1] + beta[2]*x + beta[3]*20),add=TRUE,lty=2) plot(jitter(y,0.1) ~ jitter(cigs), wcgs, xlab="Cigarette Use", ylab="Heart Disease",pch=".") curve(ilogit(beta[1] + beta[2]*60 + beta[3]*x),add=TRUE) curve(ilogit(beta[1] + beta[2]*78 + beta[3]*x),add=TRUE,lty=2) 12 Heart Disease Example Figure: Predicted probability of heart disease. Left: solid line represents a nonsmoker, the dashed line is a pack-a-day smoker. Right: solid line represents a very short man (60 in.), the dashed line represents a very tall man (78 in.) 13 Latent variable model for logistic regression It may make sense to view the binary outcome Y as being a dichotomization of a latent continuous outcome Yc : Y = I(Yc ≥ 0) Suppose Yc | X follows a logistic distribution with CDF F (Yc | X) = exp(Yc β X) 1 + exp(Yc β X) In this case, Y | X follows the logistic regression model: P(Y = 1 | X) = P(Yc ≥ 0 | X) = 1 exp(0 β X) 1 + exp(0 β X) = exp(β X) 1 + exp(β X) 14 Mean and variance relationship for logistic regression Since Y | X follows Bernoulli(logit(β X)), the mean is E[Y | X] = P(Y = 1 | X) = exp(β X) 1 + exp(β X) And the variance is Var[Y | X] = P(Y = 1 | X) · P(Y = 0 | X) = exp(β X) (1 + exp(β X))2 Since the variance depends on X, logistic regression models are always heteroscedastic (unequal error variances) 15 Estimation in logistic regression Assuming independent observations (x1, y1), . . . , (xn, yn), the log-likelihood for logistic regression is L(β | Y,X) = log n∏ i=1 ( exp(β xi ) 1 + exp(β xi ) )yi ( 1 1 + exp(β xi ) )1 yi = log n∏ i=1 exp(yi · β xi ) 1 + exp(β xi ) = n∑ i=1 yi · β xi n∑ i=1 log(1 + exp(β xi )) This likelihood is for the conditional distribution of Y given X. As in linear regression, we do not model the marginal distribution of X 16 Estimation in logistic regression Logistic regression models are usually fit using maximum likelihood estimation This means that the parametric likelihood above is maximized as a function of β The gradient of the log-likelihood function (the score function) is L(β | Y,X) = β L(β | Y,X) = n∑ i=1 yix i n∑ i=1 exp(β xi ) 1 + exp(β xi ) xi = N∑ i=1 xi ( yi exp(β xi ) 1 + exp(β xi ) ) This is a (p + 1)× 1 vector 17 Estimation in logistic regression The Hessian of the log-likelihood is H(β | Y,X) = 2 ββ L(β | Y,X) = n∑ i=1 exp(β xi ) (1 + exp(β xi ))2 xix i This is (p + 1)× (p + 1) symmetric matrix The Hessian is strictly negative definite as long as the design matrix Xn×(p+1) has independent columns (full column rank (p + 1)) (see proof on the next slide) Therefore L(β | Y,X) is a concave function of β, so has a unique maximizer, and hence the MLE is unique 18 Hessian matrix in logistic regression Rewrite the Hessian matrix as H(β | Y,X) = X exp(β x1) (1+exp(β x1))2 . . . exp(β xn) (1+exp(β xn))2 X =: X DX where the design matrix X of size n × (p + 1) is X = 1 x11 · · · x1p 1 x21 · · · x2p ... ... . . . ... 1 xn1 · · · xnp The n × n diagonal matrix D has positive diagonal entries, so if X has full rank p + 1, the Hessian matrix is negative definite 19 Estimation in logistic regression We need to use some optimization method to numerically find the MLE under logistic regression, because it has no closed form Newton-Rapson (Newton’s) Numerical Optimization Method: β(t+1) = β(t) H ( β(t) ) 1 L(β(t)), where t = 0, 1, 2, . . . index the iterations Newton’s method for logistic regression can be shown to be equivalent to a version of Iteratively (Re-)Weighted Least Squares (IRLS) 20 MLE of logistic regression Since β is an MLE for a regular problem, it is consistent, asymptotically unbiased, and asymptotically normal The estimated approximate covariance matrix of β (also the in- verse of the Fisher information matrix): S2(β ) = H(β | Y,X) 1 There is β approx.~ N(β,S2(β )) This provides a useful basis for statistical inference (confidence intervals, hypothesis testing) Note: such inferences rely on large sample sizes (large n) 21 Inference for logistic regression When n is large, β k βk S(β k) approx.~ N(0, 1), k = 0, 1, . . . , p Consider H0 : βk = 0 Ha : βk = 0 Wald test: test statistic z = β k S(β k) Decision rule: If |z | > z(1 α/2), reject H0 If |z | ≤ z(1 α/2), fail to reject H0 22 Likelihood Ratio Test Likelihood Ratio Test (LRT) can be used to test whether several βk = 0. LRT is a general procedure for use with maximum likelihood estimation, analogous to the general linear test procedure (F test) for linear models LRT is also based on a comparison of full and reduced models, and valid for large sample sizes H0 : βq+1 = · · · = βp = 0 Ha : not all of the βk in H0 equal zero Reduced model: (π = P(Y = 1 | X)) π = logit 1(β RX) where β RX = β0 + β1X1 + · · ·+ βqXq 23 Likelihood Ratio Test The maximum likelihood under the reduced model, L(Reduced), can not exceed that under the full model, L(Full) Actual test statistic for the LRT, denoted by G 2, is: G 2 = 2 log [ L(Reduced) L(Full) ] If the ratio L(Reduced)/L(Full) is small, indicating that H0 seems inappropriate, then G 2 is large Large-sample theory of LRT states that: when n is large, G 2 is distributed approximately as χ2(p q) when H0 holds p q = (n q 1) (n p 1) Decision rule: If G 2 > χ2(1 α; p q), reject H0 24 Deviance in logistic regression In logistic regression analysis, deviance is used instead of a sum of squares calculations in linear regression Deviance = 2× the maximum log likelihood Deviance is a measure of the goodness of fit to the data in a logistic regression model (analogous to R2 in linear models) The smaller the deviance, the better fit is the model to the training data 25 Heart Disease Data Example > sumary(lmod) Estimate Std. Error z value Pr(>|z|) (Intercept) -4.5016140 1.8418627 -2.4441 0.01452 height 0.0252078 0.0263274 0.9575 0.33833 cigs 0.0231274 0.0040402 5.7243 1.038e-08 n = 3154 p = 3 Deviance = 1749.04923 Null Deviance = 1781.24374 (Difference = 32.19451) Deviance is the deviance of the current model; Null Deviance is that for a model with not predictors but an intercept Use the deviance to compare two nested models: The LRT becomes G 2 = DReduced DFull approx.~ χ22 The p-value for testing whether βheight = βcigs = 0 is > 1-pchisq(32.2, 2) [1] 1.01826e-07 26 Heart Disease Data Example We can test the individual predictors by fitting models that drop the predictors and computing the difference in deviance. > lmodc <- glm(chd ~ cigs, family = binomial, wcgs) > anova(lmodc, lmod, test=”Chi”) Analysis of Deviance Table Model 1: chd ~ cigs Model 2: chd ~ height + cigs Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 3152 1750 2 3151 1749 1 0.92025 0.3374 So height is not significant in a model that already includes cigarette consumption 27 Heart Disease Data Example We can test all the predictors in this fashion: > drop1(lmod, test=”Chi”) Single term deletions Model: chd ~ height + cigs Df Deviance AIC LRT Pr(>Chi) 1749.0 1755.0 height 1 1750.0 1754.0 0.9202 0.3374 cigs 1 1780.1 1784.1 31.0695 2.49e-08 *** — Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 An alternative to this test is the aforementioned z-value β /S(β ); recall: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.5016140 1.8418627 -2.4441 0.01452 height 0.0252078 0.0263274 0.9575 0.33833 cigs 0.0231274 0.0040402 5.7243 1.038e-08 28 Heart Disease Data Example In this example, the p-values from the two tests are similar The outcomes of the deviance-based test and z-value-based test are not identical, unlike the linear regression with normal errors In general, deviance-based test is preferred 29 Inference for logistic regression Approximate 1 α confidence interval for βk : β k ± z(1 α/2)S(β k) Corresponding confidence limits for the odds ratio exp(βk) are exp { β k ± z(1 α/2)S(β k) } 95% confidence interval for β1: > 0.02521 + c(-1,1) * 1.96 * 0.02633 [1] -0.0263968 0.0768168 A better way is to construct a profile likelihood-based confidence interval: > confint(lmod) Waiting for profiling to be done… 2.5 % 97.5 % (Intercept) -8.13475465 -0.91297018 height -0.02619902 0.07702835 cigs 0.01514949 0.03100534 30 Model selection methods For logistic regression, the AIC and BIC formulas incorporate the log-likelihood function instead of the RSS (Residual Sum of Squares) For a model containing k predictors, AIC = 2 log(L) + 2(k + 1) BIC = 2 log(L) + (k + 1) log(n) where L refers to the maximum of the likelihood As with multiple linear regression, the smaller the AIC/BIC, the better the model BIC generally would favor a more parsimonious model, i.e., a model containing fewer predictors 31 Prediction of a New Observation To predict yh, a cutoff point τ for π h = logit 1(β xh) needs to be determined: Predict 1 if π h ≥ τ ; predict 0 if π h < τ 32 Inferences about Mean Response Given xh = (1, xh1, . . . , xhp) the population mean response is πh = 1 1 + exp( x i β) and its point estimator is π h = 1 1 + exp( x i β ) Interval estimation for πh: first calculate confidence limits for the linear predictor x h β, then apply the inverse logit function to obtain confidence limits for πh 33 Inferences about Mean Response Recall the large sample approximate distribution for β is β approx.~ N(β, H(β | Y,X) 1) so x h β approx.~ N(x h β, x h H(β | Y,X) 1xh︸ ︷︷ ︸ denote this value by s2(x h β ) ) Approximate 1 α large-sample confidence limits for x h β are L = x h β z(1 α/2) · s(x h β ) U = x h β + z(1 α/2) · s(x h β ) Finally, using the monotonic transformation of the inverse logit function, the approximate 1 α large-sample confidence limits for πh are L = 1 1 + exp( L) , U = 1 1 + exp( U) 34 Simultaneous confidence intervals We may estimate several mean responses πh corresponding to different xh vectors (h = 1, . . . , g) with family confidence level 1 α Bonferroni simultaneous confidence intervals for g such mean responses: Just replace the multiplier z(1 α/2) with z(1 α/(2g)) Confidence limits for linear predictor terms, h = 1, . . . ,H Lh = x h β z(1 α/(2g)) · s(x h β ) Uh = x h β + z(1 α/(2g)) · s(x h β ) 35 Polytomous Logistic Regression for Nominal Response When the response variable have > 2 qualitative levels, we can use logistic regression with a polytomous or multicategory response: – In business, relate a consumer’s choice of product (product A, B, C) to the consumer’s age, gender, geographic location and other variables Suppose the response Y has J response categories and define yij = { 1 if case i response is category j 0 otherwise Denote πij = P(yij = 1), then ∑J j=1 πij = 1 and ∑J j=1 yij = 1 Recall the binary response case log { P(yi = 1) 1 P(yi = 1) } =: x i β1,base If yi can be 1 or 2, then category 2 is the baseline category above 36 Polytomous Logistic Regression for Nominal Response For J polytomous categories, arbitrarily specify a baseline, say the Jth one, as the baseline, and define log { πij πiJ } = x i βj,base, j = 1, . . . , J 1 Logits for any other comparisions can be obtained from above: log { πij1 πij2 } = log { πij1 πiJ · πiJ πij2 } = x i βj1,base x i βj2,base After some algebra, we can show πij = exp(x i βj,base) 1 + ∑J 1 j′=1 exp(x i βj′,base) , j = 1, . . . , J 1 This is multinomial logistic regression 37 Maximum Likelihood Estimation Write βj,base simply as βj . We can use MLE to estimate β1, . . . ,βJ 1 Since yi follows a Categorical distribution with parameters (πi1, . . . , πiJ), for example, P(yi = 3) = πi3 = J∏ j=1 π yij ij Log-likelihood for n independent observations and J categories: L(β | Y,X) = log n∏ i=1 J∏ j=1 π yij ij = n∑ i=1 J 1∑ j=1 (yijx i β) log 1 + J 1∑ j=1 exp(x i βj) The J 1 fitted response functions π ij may be obtained by substituting the MLE β j ’s into the expression of πij ’s 38 Polytomous Logistic Regression for Ordinal Response The ordinal response where categories of Y are ordered are also common: – Employees asked to rate working conditions using a 7-point scale (unacceptable, poor, fair, acceptable, good, excellent, outstanding) – The severity of cancer is rated by stages on a 1-4 basis A more effective model for such responses explicitly takes into account the orderings Propotional odds model : P(yi ≤ j) = exp(αj + x i β) 1 + exp(αj + x i β) , j = 1, . . . , J 1 where there are J 1 cumulative logits: log [ P(yi ≤ j) 1 P(yi ≤ j) ] = αj + x i β MLE for this model can be similarly obtained 39