程序案例-STATS 5098

Monday, 2nd August, 2021
EXAMINATION FOR THE DEGREES OF M.A., M.SCI. AND B.SC.
(SCIENCE)
Advanced Predictive Models (Level M)
Course code: STATS 5098
This paper consists of 9 pages and contains 6 question(s).
Candidates should attempt all questions.
Question 1 8 marks
Question 2 9 marks
Question 3 6 marks
Question 4 6 marks
Question 5 14 marks
Question 6 17 marks
Total 60 marks
The following material is made available to you:
Statistical tables
Statistical Tables
Formula sheet
1.5 hours for this exam under normal conditions and 3 hours for online exam.
1
CONTINUED OVERLEAF/
1. The probability density function of a known distribution is
f(y|θ) = 1
Γ(ν)

θ

yν 1 exp
(

θ
)
, y > 0,
where Γ(ν) is the Gamma function, and θ, ν > 0.
(a) Show that this distribution falls within the exponential family given ν is known.
Specify the components: a(y), b(θ), c(θ), d(y). [4 MARKS]
(b) Find the mean and variance of this distribution using the expressions obtained in
part (a). Results should be given in terms of θ and ν. [2 MARKS]
(c) Point out one practical issue with model fitting when the canonical link function
is used. [2 MARKS]
2. Table 1 shows data from a study on the relationship between an infant respiratory
disease and the type of feeding and sex for children in their first year of life. Below
you will find some R code for analysing the data. Answer the following questions based
on the code and the output.
Table 1: Incidence of respiratory disease in infants to the age of 1 year.
Group Feeding Type Sex Disease Nondisease
1 Bottle only Boy 77 458
2 Bottle only Girl 48 384
3 Breast with supplement Boy 19 147
4 Breast with supplement Girl 16 127
5 Breast only Boy 47 494
6 Breast only Girl 31 464
lmod <- glm(cbind(disease, nondisease) ~ sex + food, family=binomial, data=babyfood) summary(lmod) Call: glm(formula = cbind(disease, nondisease) ~ sex + food, family = binomial, data = babyfood) 2 CONTINUED OVERLEAF/ Deviance Residuals: 1 2 3 4 5 6 0.1096 -0.5052 0.1922 -0.1342 0.5896 -0.2284 Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6127 0.1124 -14.347 < 2e-16 *** sexGirl -0.3126 0.1410 -2.216 0.0267 * foodBreast -0.6693 0.1530 -4.374 1.22e-05 *** foodSuppl -0.1725 0.2056 -0.839 0.4013 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 26.37529 on 5 degrees of freedom Residual deviance: 0.72192 on 2 degrees of freedom AIC: 40.24 Number of Fisher Scoring iterations: 4 (a) Write down the systematic component of this logistic regression model. Define your variables clearly. [2 MARKS] (b) How does sex influence the odds of respiratory disease [3 MARKS] (c) Does the fitted model show any lack of fit Explain. [2 MARKS] (d) Is it necessary to further consider the interaction effect between variables food and sex Explain. [2 MARKS] 3. A dataset was recorded on 44 doctors working in an emergency service at a hospital to study factors affecting the number of complaints received. The dataset consists of a data frame with 44 observations on the following six variables: visits: number of patient visits complaints: number of complaints residency: is the doctor in residency training (Y/N) gender: gender of doctor (F/M) revenue: dollars per hour earned by the doctor hours: total number of hours worked 3 CONTINUED OVERLEAF/ Below you will find some R code for analysing the data. Answer the following questions based on the code and the output. ## Model 1 mod1 <- glm(complaints ~ residency + visits + gender + revenue, family=poisson, data=esdcomp, offset=log(hours)) summary(mod1) Call: glm(formula = complaints ~ residency + visits + gender + revenue, family = poisson, data = esdcomp, offset = log(hours)) Deviance Residuals: Min 1Q Median 3Q Max -1.8783 -0.9757 -0.2804 0.7702 1.6741 Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.4904751 0.6855108 -10.927 < 2e-16 *** residencyY -0.1753176 0.1973189 -0.888 0.37427 visits 0.0005593 0.0001840 3.040 0.00237 ** genderM 0.1890678 0.2160386 0.875 0.38149 revenue -0.0003377 0.0033956 -0.099 0.92077 --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 69.659 on 43 degrees of freedom Residual deviance: 52.181 on 39 degrees of freedom AIC: 184.96 Number of Fisher Scoring iterations: 5 ## Model 2 mod2 <- glm(complaints ~ visits, family=poisson, data=esdcomp, offset=log(hours)) summary(mod2) Call: glm(formula = complaints ~ visits, family = poisson, data = esdcomp, offset = log(hours)) Deviance Residuals: Min 1Q Median 3Q Max -1.8030 -0.9769 -0.2409 0.7348 1.9136 4 CONTINUED OVERLEAF/ Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.5331144 0.4030269 -18.69 <2e-16 *** visits 0.0005716 0.0001469 3.89 1e-04 *** --- Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 69.659 on 43 degrees of freedom Residual deviance: 54.380 on 42 degrees of freedom AIC: 181.16 Number of Fisher Scoring iterations: 5 anova(mod2, mod1, test="LRT") Analysis of Deviance Table Model 1: complaints ~ visits Model 2: complaints ~ residency + visits + gender + revenue Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 42 54.380
2 39 52.181 3 2.1985 0.5323
(a) Write down the formula for model 1. Define your notation clearly. [2 MARKS]
(b) According to the preferred model, how does the number of patient visits influence
complaints [2 MARKS]
(c) Is there obvious evidence of overdispersion under the preferred model Explain.
[2 MARKS]
4. Figure 1 is the time plot of a time series dataset on the end-of-day share price of a
newly established technology company over a period of two years.
(a) Describe the prominent features of the time series in Figure 1. [2 MARKS]
(b) The residuals after fitting a regression model to the data are shown in Figure 2.
Based on the figure, is the fitted regression model appropriate Do the residuals
form a stationary series Explain. [2 MARKS]
5
CONTINUED OVERLEAF/
25
50
75
0 200 400 600
Day
Sh
ar
e
pr
ic
e
(pe
nc
e)
Figure 1: End-of-day share price of a newly established technology company over a period
of two years.
20
0
20
0 200 400 600
R
es
id
ua
ls
Figure 2: Residuals obtained after fitting a regression model to the original data.
(c) Based on the plots in Figure 3, is there short-term correlation remaining in the
residuals What type of process do you recommend to model the residuals Ex-
plain.
[2 MARKS]
6
CONTINUED OVERLEAF/
0.00
0.25
0.50
0.75
1.00
0 10 20
AC
F
0.2
0.0
0.2
0.4
0 10 20
Lag
PA
CF
Figure 3: Autocorrelation (top) and partial autocorrelation (bottom) from the residuals
obtained after fitting a regression model to the original data.
5. Suppose Zt is a purely random process with zero mean and variance σ
2
z .
(a) Consider the time series process
Xt = exp{(t 1)2 + sin(2pit/365) + 2 cos(2pit/365) + Zt}.
Show how to make this process weakly stationary. [4 MARKS]
(b) Define βt to be a set of independent and indentically distributed random variables
from a normal distribution with both mean and variance being 1. Suppose βt and
Zt are independent of each other. Consider the following time series model:
Xt = Zt + βtZt 1.
i. Is this a standard MA(1) model Explain. [2 MARKS]
ii. Compute the expectation E(Xt) and variance Var(Xt) of the process Xt.
[3 MARKS]
iii. Determine if the process is weakly stationary and explain. [3 MARKS]
(c) Consider the ARIMA(p, d, q) process:
(1 4B + 3B2)Xt = (1 +B B2)Zt.
Specify the values of p, d, and q. Is this process weakly stationary Explain.
[2 MARKS]
7
CONTINUED OVERLEAF/
6. The ratdrink data consist of five weekly measurements of body weight for 27 randomly
selected rats. 10 rats are on a control treatment, seven rats have thyroxine added to
their drinking water, and the remaining 10 rats have thiouracil added to their water.
The aim of the study is to explore the effect of the treatment on body weight. The
data consist of four variables:
wt: weight of the rat (numeric)
weeks: number of the week when the rat is measured (integer, 0-4)
subject: the unique identifier for each rat (factor, 1-27)
treat: treatment given to the rat (control/thyroxine/thiouracil)
(a) A model was fitted to this data using the following R code:
mod1 <- lmer(wt ~ weeks + treat + (weeks | subject), data=ratdrink) Write down the statistical model corresponding to this code. Clearly define your notation and any distributional assumptions made for the model. [8 MARKS] (b) The following models were also fitted to the data: mod2 <- lmer(wt ~ weeks + treat + (1 | subject), data=ratdrink) mod3 <- lmer(wt ~ weeks + treat + (1 | subject) + (0 + weeks | subject), data=ratdrink) mod4 <- lm(wt ~ weeks + treat, data=ratdrink) Explain how each of these models differ from the model in part (a). Be sure to make reference to the parameters from the model in part (a). [5 MARKS] (c) Partial output of fitting model 1 is given below. > fixef(mod1)
(Intercept) weeks treatthiouracil treatthyroxine
54.2434978 23.1814815 0.9067537 -0.5202826
> ranef(mod1)
$subject
(Intercept) weeks
1 2.11474197 5.215274346
2 6.23082799 6.001228567
3 -5.33930997 9.398021552
4 -8.24640633 5.104327562
5 1.79522833 -0.005791506
Write down the estimated regression equation for rat 3. [2 MARKS]
8
CONTINUED OVERLEAF/
(d) Consider a rat not included in the study. Based on model 1, write down an
equation giving the predicted values for its weights in each of the five successive
weeks immediately after a specific treatment is given. [2 MARKS]
9
END OF QUESTION PAPER.