程序案例-STATS 330

Department of Statistics STATS 330: Statistical Modelling Exam Semester 1, 2021 Total: 70 marks Due: 20:00hrs (8:00pm) NZDT, 16 June, 2021 Notes: (i) Attempt ALL 3 questions. (ii) There are a total of 70 marks for this examination. (iii) This is an open book examination. You may consult your notes when doing the test. (iv) An Appendix contains useful information and R–output for use in Questions 1 and 2. 1 Question 1. The data for this question involves the migration of apprentices from Scottish counties into the city of Edinburgh in the late 18th century. The dataset apprentice.df has the following variables: County: The name of the county from which the apprentice migrated Distance: The distance of the county from Edinburgh (km) Apprentices: The number of apprentices from that county registered in Edinburgh between 1775 and 1799 Population: The population of the county Urban: The percentage of the county’s population living in an urban area The first 5 entries of the data frame created for this data and the output from summary() are as follows: County Distance Apprentices Population Urban 1 Midlothian 21 225 56000 18.8 2 West Lothian 24 22 18000 37.9 3 East Lothian 33 44 30000 43.4 4 Kinross 33 3 7000 30.3 5 Fife 36 41 94000 41.3 Distance Apprentices Population Urban Min. : 21 Min. : 0.0 Min. : 5000 Min. : 7.7 1st Qu.: 54 1st Qu.: 1.0 1st Qu.: 22000 1st Qu.:12.9 Median : 92 Median : 3.0 Median : 30000 Median :27.3 Mean :132 Mean : 14.2 Mean : 46576 Mean :28.6 3rd Qu.:174 3rd Qu.: 9.0 3rd Qu.: 72000 3rd Qu.:41.3 Max. :491 Max. :225.0 Max. :147000 Max. :69.9 We wish to create a model for the number of apprentices migrating to Edinburgh from Scottish counties. Three types of models that may be used for count data are the Poisson model, the quasi-Poisson model and the negative binomial model. These models have been created as follows: > poisson.fit<-glm(Apprentices ~ log(Distance) + Urban, + offset=log(Population/1000), + family="poisson", data=apprentice.df) > qpoisson.fit<-glm(Apprentices ~ log(Distance) + Urban, + offset=log(Population/1000), + family="quasipoisson", data=apprentice.df) > negbin.fit<-glm.nb(Apprentices ~ log(Distance) + Urban + + offset(log(Population/1000)), + data=apprentice.df) Output from each of these models can be found in the Appendix. 2 (a) Of these three models, negbin.fit is the most appropriate model for this data. Using the output from the three models provided in the Appendix give reasons to support this claim. You should identify all relevant evidence from the output that supports the negative binomial model over the other two models. Marks will be deducted for invalid reasons. (10 marks) (b) All three models contain the term log(Population/1000) as an offset. (i) Why is this term needed (ii) How does the inclusion of this term affect the way in which we interpret the model Your answers need to be framed in the context of the situation. (6 marks) (c) A further model was created, negbin2.fit, which includes the interaction between log(Distance) and Urban. The summary() output from this model is given on page X of the Appendix. Consider a hypothetical Scottish county with a population of 28000, 15% of whom lived in an urban area, situated a distance of 60km from Edinburgh. Based on the model negbin2.fit, calculate the expected number of apprentices migrating to Edinburgh from this county. Make sure you show how you calculated your answer. (6 marks) (d) The authors who used this data in an analysis stated that they expected that the number of apprentices was negatively related to both the distance the county was from Edinburgh, i.e., Distance, and the degree of urbanisation in the county, i.e., Urban. Does the model negbin2.fit support this statement Fully explain your answer. (8 marks) 3 Question 2. The data for this question consists of a sample of 873 married women taken from the first representative health survey for Switzerland. LFpart Factor. Did the individual participate in the labour force (yes or no) inc Logarithm of nonlabour income. age Age in years. educ Years of formal education. ykids Number of children under 7 years old. okids Number of children over 7 years old. foreign Factor. Is the individual a foreigner (yes or no) (a) Use the fitted model in Appendix 2.1 to describe the impact that the number of children under 7 has on the odds that a married woman participates in the labour force. (5 marks) (b) Suppose we wish to create a 95% confidence interval for the probability that LFpart=yes when inc=8, age=40, educ=15, ykids=0, okids=0, foreign=yes. Use the output in Appendix 2.2 to produce this confidence interval. Show your working. (5 marks) (c) We could also use “bootstrapping” to produce a confidence interval. Appendix 2.3 contains computer code for a bootstrap simulation. (i) Does the code in Appendix 2.3 indicate that a parametric bootstrap or a non-parametric bootstrap was used Justify your answer. (ii) Use the output in Appendix 2.3 to create a 95% confidence interval for the probability that LFpart=yes when inc=8, age=40, educ=15, ykids=0, okids=0, foreign=yes. Show your working. (6 marks) (d) Suppose we decide to use the logistic regression model swiss.glm for prediction. (i) In the context of this data set, explain what the terms “sensitivity” and “specificity” represent. (ii) Suppose we use the fitted model swiss.glm and predict LFpart=yes if the estimated probability is greater than 0.5 and LFpart=no otherwise for each observation in this data set. Then the following confusion table is obtained: predicted actual no yes no 337 134 yes 146 255 4 Use this table to estimate the sensitivity, specificity and the error rate. Show your working. (iii) The estimates obtained above are probably optimistic. Explain why this is the case. Also briefly describe how more realistic estimates can be obtained. (10 marks) (e) The following plot contains the ROC curve corresponding to the model swiss.glm. Specificity S en si tiv ity 1.0 0.8 0.6 0.4 0.2 0.0 0 .0 0 .2 0 .4 0 .6 0 .8 1 .0 l 0.467 (0.677, 0.713) (i) Explain what the numbers “0.467 (0.677, 0.713)” which are printed on this plot represent. (ii) Based on this plot would it be possible to have both a true positive rate of at least 0.7 and a false positive rate of 0.2 or less Justify your answer. (4 marks) 5 Question 3. Consider the following diagram which represents the causal relationships between 8 variables labelled A through G. A B C D E F G Use this diagram to answer the following questions. (a) List all of the variables that have a direct causal effect on G. (2 marks) (b) Suppose we wish to estimate the direct causal effect that A has on F. List all of the explanatory variables that should be include in the model. (2 marks) (c) Suppose we wish to estimate the total causal effect that A has on F. List all of the explanatory variables that should be include in the model. (2 marks) (d) Consider the direct effect of E on F. (i) List all of the variables that are confounders for this effect. (i) List all of the variables that are colliders for this effect. (4 marks) 6