程序案例-3H

XX May 2021 EXAMINATION FOR THE DEGREES OF M.A., M.SCI. AND B.SC. (SCIENCE) Statistics – 3H Linear Models This paper consists of 9 pages and contains 3 question(s). Candidates should attempt all questions. Question 1 20 marks Question 2 20 marks Question 3 20 marks Total 60 marks The following material is made available to you: Statistical tables Statistical Tables Formula sheet NOTE: Candidates must attempt all questions. 1 CONTINUED OVERLEAF/ 1. (a) We have the following mean model for a linear regression: E(yi|xi, zi) = β0 + β11[xi > 3.5] + β2zi, where 1[·] is an indicator function equal to 1 if the statement in the brackets is true and 0 otherwise. We have the following values for x = (1, 6.4, 3, 5.1, 2.9, 11.3) and z = (15.2, 20.5, 12.4, 10.5, 21.0, 19.1). Write out the design matrix for this regression along with the corresponding defined parameter vector. [3 MARKS] (b) List the assumptions required to fit this model and do inference. [2 MARKS] (c) This is a special case of linear regression. What is the name for this type of model [1 MARK] (d) Suggest a reason for why the indicator function of x might be used rather than the original x. What is a disadvantage of using the indicator [2 MARKS] (e) Is this a main effects or complete model If it is a main effects model, explain how to make it a complete model, if it is a complete model, explain how to make it a main effects one. [2 MARKS] (f) Figure 1 is a standard diagnostic plot for this model. What is this type of plot called Comment on if there is an issue with an assumption (if so, what), which assumption and for what reason. [3 MARKS] (g) Suppose we observed an issue with the homoscedasticity assumption in our diag- nostic plots. i. Which plot would have been used to check this assumption and what should we have seen if the assumption was satisfied. [2 MARKS] ii. What 2 strategies could we take to try to deal with this problem List a disadvantage of either one of the strategies suggested. [3 MARKS] (h) Comment on the DFFITS plot in Figure 2 for this model. [2 MARKS] 2 CONTINUED OVERLEAF/ Figure 1: Question 1(f) Figure 2: Question 1(h) 3 CONTINUED OVERLEAF/ 2. A biostatistician is interested in looking at the relationship between a range of recorded variables and the mean per capita cancer mortalities (TargetdeathRate) in some areas of the United States. He is curious to see what an automated model building approach suggests and uses an AIC backward search. Partial results from the first two steps of this search are given here: step(g,direction = “backward”) Start: AIC=3583.84 TargetdeathRate ~ avgAnnCount + incidenceRate + medIncome + popEst2015 + povertyPercent + studyPerCap + binnedInc + MedianAge + MedianAgeMale + MedianAgeFemale + AvgHouseholdSize + PercentMarried + PctNoHS18_24 + PctHS18_24 + PctSomeCol18_24 + PctBachDeg18_24 + PctHS25_Over + PctBachDeg25_Over + PctEmployed16_Over + PctUnemployed16_Over + PctPrivateCoverage + PctPrivateCoverageAlone + PctEmpPrivCoverage + PctPublicCoverage + PctPublicCoverageAlone + PctWhite + PctBlack + PctAsian + PctOtherRace + PctMarriedHouseholds + BirthRate Df Sum of Sq RSS AIC – studyPerCap 1 0 222012 3581.8 – PctAsian 1 2 222014 3581.8 – PctPrivateCoverage 1 11 222023 3581.9 – AvgHouseholdSize 1 96 222108 3582.1 – PctPrivateCoverageAlone 1 256 222267 3582.5 – avgAnnCount 1 264 222276 3582.5 – BirthRate 1 265 222276 3582.5 – popEst2015 1 309 222320 3582.7 – povertyPercent 1 330 222342 3582.7 – binnedInc 9 6536 228548 3583.0 – MedianAge 1 517 222529 3583.2 – MedianAgeFemale 1 524 222535 3583.2 – PctUnemployed16_Over 1 639 222650 3583.5 222012 3583.8 – PctPublicCoverage 1 768 222780 3583.9 – PctWhite 1 941 222952 3584.3 – PctPublicCoverageAlone 1 969 222980 3584.4 – PctHS18_24 1 974 222986 3584.4 – PctSomeCol18_24 1 980 222991 3584.4 – PctNoHS18_24 1 1005 223016 3584.5 – PctBachDeg18_24 1 1053 223065 3584.6 – medIncome 1 1300 223311 3585.3 – PctBlack 1 1429 223441 3585.6 – PctEmpPrivCoverage 1 1583 223595 3586.0 – PctHS25_Over 1 1592 223604 3586.1 4 CONTINUED OVERLEAF/ – MedianAgeMale 1 2296 224308 3587.9 – PctEmployed16_Over 1 2769 224781 3589.2 – PctBachDeg25_Over 1 3075 225087 3590.0 – PctOtherRace 1 3410 225421 3590.9 – PercentMarried 1 8775 230786 3604.8 – PctMarriedHouseholds 1 13638 235649 3617.1 – incidenceRate 1 37992 260004 3675.2 Step: AIC=3581.84 TargetdeathRate ~ avgAnnCount + incidenceRate + medIncome + popEst2015 + povertyPercent + binnedInc + MedianAge + MedianAgeMale + MedianAgeFemale + AvgHouseholdSize + PercentMarried + PctNoHS18_24 + PctHS18_24 + PctSomeCol18_24 + PctBachDeg18_24 + PctHS25_Over + PctBachDeg25_Over + PctEmployed16_Over + PctUnemployed16_Over + PctPrivateCoverage + PctPrivateCoverageAlone + PctEmpPrivCoverage + PctPublicCoverage + PctPublicCoverageAlone + PctWhite + PctBlack + PctAsian + PctOtherRace + PctMarriedHouseholds + BirthRate Df Sum of Sq RSS AIC – PctAsian 1 2 222014 3579.8 – PctPrivateCoverage 1 11 222023 3579.9 – AvgHouseholdSize 1 96 222108 3580.1 – PctPrivateCoverageAlone 1 255 222267 3580.5 – avgAnnCount 1 264 222276 3580.5 – BirthRate 1 265 222277 3580.5 – popEst2015 1 309 222321 3580.7 – povertyPercent 1 331 222343 3580.7 – binnedInc 9 6555 228567 3581.0 – MedianAge 1 517 222529 3581.2 – MedianAgeFemale 1 524 222536 3581.2 – PctUnemployed16_Over 1 639 222651 3581.5 222012 3581.8 – PctPublicCoverage 1 768 222780 3581.9 – PctWhite 1 942 222954 3582.3 – PctPublicCoverageAlone 1 969 222981 3582.4 – PctHS18_24 1 975 222987 3582.4 – PctSomeCol18_24 1 980 222992 3582.4 – PctNoHS18_24 1 1005 223017 3582.5 – PctBachDeg18_24 1 1054 223066 3582.6 – medIncome 1 1303 223315 3583.3 – PctBlack 1 1429 223441 3583.6 – PctEmpPrivCoverage 1 1586 223598 3584.1 – PctHS25_Over 1 1610 223622 3584.1 5 CONTINUED OVERLEAF/ – MedianAgeMale 1 2302 224314 3585.9 – PctEmployed16_Over 1 2788 224800 3587.2 – PctBachDeg25_Over 1 3076 225088 3588.0 – PctOtherRace 1 3410 225421 3588.9 – PercentMarried 1 8774 230786 3602.8 – PctMarriedHouseholds 1 13656 235668 3615.1 – incidenceRate 1 38106 260118 3673.5 (a) What is the next decision made by the algorithm [1 MARK] (b) Which candidate variable is likely to be removed in the next (third) step following the last piece of R output Will this definitely be the case Why/why not [3 MARKS] (c) Why might this approach be preferable to one using hypothesis testing to choose a final model How could one remedy the issue with using hypothesis tests [2 MARKS] (d) Once the search has concluded with a final model, suggest one piece of further exploration that could be worth doing. [1 MARK] (e) The final model output is given here: summary(final) Call: lm(formula = TARGETdeathRate ~ incidenceRate + medIncome + MedianAgeMale + PercentMarried + PctNoHS18_24 + PctHS18_24 + PctSomeCol18_24 + PctBachDeg18_24 + PctHS25_Over + PctBachDeg25_Over + PctEmployed16_Over + PctPrivateCoverageAlone + PctEmpPrivCoverage + PctPublicCoverage + PctPublicCoverageAlone + PctWhite + PctBlack + PctOtherRace + PctMarriedHouseholds, data = data) Residuals: Min 1Q Median 3Q Max -79.88 -10.83 0.16 10.97 107.32 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.732e+03 1.484e+03 1.842 0.06604 . incidenceRate 1.619e-01 1.670e-02 9.696 < 2e-16 *** medIncome 3.088e-04 1.755e-04 1.760 0.07899 . MedianAgeMale -6.698e-01 3.295e-01 -2.033 0.04255 * PercentMarried 1.939e+00 3.926e-01 4.939 1.03e-06 *** 6 CONTINUED OVERLEAF/ PctNoHS18_24 -2.578e+01 1.482e+01 -1.740 0.08247 . PctHS18_24 -2.539e+01 1.483e+01 -1.712 0.08738 . PctSomeCol18_24 -2.548e+01 1.483e+01 -1.718 0.08629 . PctBachDeg18_24 -2.618e+01 1.483e+01 -1.765 0.07806 . PctHS25_Over 5.834e-01 2.376e-01 2.456 0.01436 * PctBachDeg25_Over -9.527e-01 3.788e-01 -2.515 0.01217 * PctEmployed16_Over -8.915e-01 2.254e-01 -3.955 8.61e-05 *** PctPrivateCoverageAlone -7.551e-01 3.759e-01 -2.009 0.04503 * PctEmpPrivCoverage 6.314e-01 2.712e-01 2.328 0.02025 * PctPublicCoverage -1.175e+00 4.431e-01 -2.652 0.00822 ** PctPublicCoverageAlone 1.441e+00 4.841e-01 2.976 0.00304 ** PctWhite 2.010e-01 1.329e-01 1.512 0.13098 PctBlack 2.963e-01 1.244e-01 2.382 0.01755 * PctOtherRace -9.089e-01 3.057e-01 -2.973 0.00307 ** PctMarriedHouseholds -2.191e+00 3.660e-01 -5.985 3.82e-09 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 20.11 on 571 degrees of freedom Multiple R-squared: 0.4737,Adjusted R-squared: 0.4561 F-statistic: 27.04 on 19 and 571 DF, p-value: < 2.2e-16 i. What distribution (along with any details like degrees of freedom) is being used to produce the p-values in the final column of this output [2 MARKS] ii. What are the null and alternative hypotheses for the model F test and what is the conclusion here [2 MARKS] iii. The condition number for this model is 39976.49. What can we conclude from this Suggest a strategy for remedying this. [3 MARKS] iv. The parameters for MedianAgeMale and PctPrivateCoverageAlone were of particular interest to the investigator so a partial F-test for these two covari- ates was run. The p-value was 0.06. Give the null and alternative hypotheses for this test and comment on its result with respect to the previous model output. [2 MARKS] v. Interpret the coefficient of PercentMarried (which is the percentage of county residents who are married). [2 MARKS] (f) Give one advantage and one disadvantage of using maximum likelihood with nor- mality over ordinary least squares to estimate parameters in a linear model. [2 MARKS] 7 CONTINUED OVERLEAF/ 3. (a) A researcher into effective study habits looks at a set of volunteers and measures 3 continuous scores related to their lifestyle (x1, x2 and x5) and 2 categorical scores (x3 - whether they have a positive, neutral or negative outlook on life and x4 - assigned gender at birth). They fit a linear model with main effects for all variables and an interaction between x1 and x2. A partial anova table for the resulting model is given here: Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x1 1 362.7 362.7 u < 2.2e-16 *** x2 1 5148.9 5148.9 17737.0072 < 2.2e-16 *** x3 2 5.9 v 10.1954 0.0002459 *** x4 1 183.3 183.3 631.3107 < 2.2e-16 *** x5 1 0.0 0.0 0.0081 w x1:x2 1 381.4 381.4 1313.8350 < 2.2e-16 *** Residuals 42 12.2 0.3 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 i. How many observations/subjects did the researcher have in their analysis [1 MARK] ii. Calculate the missing values of u and v from the anova table. [2 MARKS] iii. What are the null and alternative hypotheses for the p-value of the test in the missing value w Using the statistical tables, calculate a critical value (listing the distribution and degrees of freedom) for the test of the x5 coefficient. What conclusion do you reach about this term What alternative test could you have run to check the significance of the x5 term Would it definitely have given the same result or not [6 MARKS] iv. Having stupidly thrown away their original data, the research now wants to run an F test comparing the model without the interaction and x5 term to the full model. Using the data in the table, produce the observed F statistic and run the test giving a conclusion on what model to retain. [4 MARKS] (b) The same researcher is looking at proportion of people, y, in each area in Glasgow who smoke. They want to fit a linear model to model the effect of various covariates on this outcome. A statistician convinces the researcher to use a transformed outcome y = arcsine(y) = sin 1(y) as the outcome instead. They end up with a simple linear regression of y on a single covariate x. The researcher wants to get a prediction and range of plausible values for the proportion of people who smoke when the value of x is equal to 0. The statistician 8 CONTINUED OVERLEAF/ presents them with two outputs from R: > predict(mod, data.frame(x=0), interval=”prediction”) fit lwr upr 1 0.2996512 0.2801427 0.3191596 > predict(mod, data.frame(x=0), interval=”confidence”) fit lwr upr 1 0.2996512 0.2977087 0.3015937 i. The researcher decides to take the second interval as it’s shorter. Comment on this decision and how they could have otherwise decided between the two. [2 MARKS] ii. Using the prediction interval output, explain what proportion of smokers are likely in an area with x = 0. [3 MARKS] iii. What checks should the statistician have done before producing this output [2 MARKS] 9 END OF QUESTION PAPER.