数学|UNIVERSITY OF SOUTHAMPTON MATH2010W1 SEMESTER 1 EXAMINATION 2021/22 MATH2010 Statistical Modelling I

UNIVERSITY OF SOUTHAMPTON MATH2010W1
SEMESTER 1 EXAMINATION 2021/22
MATH2010 Statistical Modelling I
Duration: 120 minutes (plus 30 minutes to upload PDF solutions to Blackboard)
This paper contains FOUR questions.
Answer ALL questions.
An outline marking scheme is shown in brackets for each question.
All students: This is an open book assessment. You may consult books, notes or internet
sources. You are permitted to use calculators or mathematical software, but to obtain full
marks, you must show and explain your working (including indicating any software com_x005f mands used) as well as the final answer. Answers should be written by hand.
The assessment must be carried out in accordance with the University Academic Integrity
regulations. It is not permitted to communicate with anyone else (be it private or online)
about the content of this exam during the whole time it is open.
Start a new question on a fresh sheet of paper.
Make sure your page is in portrait orientation.
Write in Black or Blue pen.
On each page, write your page number in the top left and your module and student ID
in the top right.
In addition to the given duration of this paper (2 hours), you have 30 minutes to scan
and upload your work.
It is your responsibility to upload the correct file(s) to the correct link. No marks will be
given for the assessment if the wrong file is uploaded.
Copyright 2022 University of Southampton Page 1 of 11
Contains Answers
If you experience issues
… with your internet access
You should take screenshots of your computer screen or of your internet service provider’s
status. If possible, ensure that photographs include the time they were taken. These
should be submitted with a request for Special Consideration.
… during file upload within the assessment window
You should take a photograph or screenshot of the error notification that clearly displays
the time, and email your completed work to D.Woods@soton.ac.uk and
maths-studentoffice@soton.ac.uk before the end of the assessment window.
If you do not upload your file within the assessment window, you should send it as soon as
possible after the assessment to D.Woods@soton.ac.uk and
maths-studentoffice@soton.ac.uk together with any supporting information to explain
why it is late and a request for Special Consideration. The Special Considerations Board
will decide whether or not to recommend to the Board of Examiners that they accept and
mark your late work.
Note: No academic enquiries will be answered by staff and no amendments to papers will
be issued during the examination. If you believe there is a misprint/typoe, note it in your
submission but answer the question as written.
2 MATH2010W1
1. [25 marks] Consider the following simple linear regression model without intercept:
Yi = βxi + i
, (1)
i = 1, . . . , n, where i ~ N(0, σ2
).
(a) [5 marks] Show that the least-squares estimator for β has the form:
β =
P
n
i=1 Yixi
P
n
i=1 x
2
i
.
Show that β is unbiased, and compute its variance.
(b) [2 marks] Given explanatory variables x1, . . . , xn and associated response
variables y1, . . . , yn, it is common to center the response and explanatory
variables, i.e. to compute
Yi
0 = Yi yˉ
x
0i = xi ˉx.
Consider the alternative model:
Yi
0 = β
0 x
0i + i
. (2)
Derive the least-squares estimator for β
0 in terms of xi and Yi
.
(c) [2 marks] Explain the difference between an estimator and an estimate. Write
down the estimate corresponding to β
0 from model (2).
(d) [4 marks] Recall that under the simple linear regression model
Yi = β0 + β1xi + i
(3)
the estimate for β1 is given by
b1 =
P
n
i=1 xiyi nˉxyˉ P
n
i=1 x
2
i nˉx
2
. (4)
Show that the estimate obtained in (c) is equal to b1.
Copyright 2022 University of Southampton Page 2 of 11
3 MATH2010W1
Newspaper Sales
1 69.2 22.1
2 45.1 10.4
3 69.3 9.3
4 58.5 18.5
5 58.4 12.9
Table 1: Advertising dataset
(e) The dataset in Table 1 gives the sales (yi
; in £) and the advertising budget spent
on newspaper advertisements (xi
; in £) for a variety of products.
These can be loaded into R using the command:
advertising.data = data.frame(
Newspaper=c(69.2, 45.1, 69.3, 58.5, 58.4),
Sales = c(22.1, 10.4, 9.3, 18.5, 12.9)
)
(i) [8 marks] Compute a 95% confidence interval for β in model (1), and β
0 in
model (2). Test the null hypothesis that β = 0, and the null hypothesis that
β
0 = 0, at the 95% level. You may find the following outputs from R helpful:
qt(0.95, 4)
## [1] 2.13
qt(0.95, 3)
## [1] 2.35
qt(0.975, 4)
## [1] 2.78
qt(0.975, 3)
## [1] 3.18
(ii) [4 marks] Consider the location x0 = 0, and let x
00 = x0 ˉx. Let
Y
0 = βx0 + 0 and Y 0
0 = β
0 x
00 + 0. Write down the conditional means
E(Y
0) and E(Y 0
0
). Considering the computed conditional means, do you think
model (1) or model (2) is more appropriate for this data Considering the
confidence intervals computed above, do you believe there is a significant
relationship between budget spent on newspaper advertisements and sales
Copyright 2022 University of Southampton
TURN OVER
Page 3 of 11
4 MATH2010W1
2. [25 marks] Consider the usual multiple linear regression model
Y = Xβ + ε ,
with Y an n × 1 response vector, X an n × (k + 1) design matrix, β a k + 1 vector
of unknown parameters and ε ~ N(0, σ2
In). Assume least squares will be used to
estimate β.
(a) [2 marks] Describe two diagnostic plots that are commonly used to check the
assumptions that underpin a linear model, and discuss which assumptions they
are intended to verify.
(b) [2 marks] Consider the following two plots. Explain what evidence of deviation
from the model assumptions are shown in each plot.
2 1 0 1 2
Fitted Values
(i)
2 1 0 1 2
Theoretical Quantiles
(ii)
(c) [8 marks] The leverage ` i of the i
th observation is defined to be the derivative of
the i
th fitted value with-respect-to the i
th observation, i.e.:
`
i =
y i
yi
.
Show that ` i
is given by
`
i = x
>i
(X> X)
1xi
.
Copyright 2022 University of Southampton Page 4 of 11
2 1 0 1 2
Standardised Residuals
6 4 2 0
2
4
6
Sample Quantiles
5 MATH2010W1
(d) [2 marks] A data point (xi
, yi) is said to be influential if the regression line
changes significantly when that point is excluded from the analysis. There can be
more than one influential point for a regression model. Consider a scatter plot of
leverages on the x-axis against standardised residuals on the y-axis such as the
figure below.
A B
C
D 5.0
2.5
0.0
2.5
5.0
0.25 0.50 0.75 1.00
Leverage
Which of the labelled points on the figure above do you think could be influential
Explain why.
(e) Consider the “leave-one-out” procedure, in which observation i is left out of the
regression analysis. Let b( i) denote the parameter vector obtained when
observation i is left out, and recall that we have the following formula for the
change in the parameter vector:
δi
:= b b( i) =
(X> X)
1xi
1 ` i
× ri
.
where ri
is the i
th residual, ri = yi y i
. Consider the simple linear regression
problem Yi = β0 + β1xi + i
. In this case, δi
is a vector of length 2,
δi = [δi,0, δi,1]. The dataset in Table 2 gives 4 observations of the response yi
,
with explanatory variable xi
, as well as the values of ` i and δi
for all but the 4th
observation.
The data can be read into R with the command:
(Question 2 continued on next page)
Copyright 2022 University of Southampton
TURN OVER
Page 5 of 11
Standardised Residual
6 MATH2010W1
i xi yi ` i δi,0 δi,1
1 -1.0 0.7569 0.4237 0.0178 -0.0045
2 -0.9 0.6948 0.4070 -0.0325 0.0080
3 1.0 1.2881 0.2500 0.0133 -0.0001
4 5.0 2.2798
Table 2: Dataset for Q2.
data = data.frame(
x = c(-1, -0.9, 1, 5),
y = c(0.7569, 0.6948, 1.2881, 2.2798)
)
(i) [8 marks] Compute ` 4, δ4,0 and δ4,1. If you use R, or other software, to
answer this question, please clearly state the commands you use.
(ii) [3 marks] Do you think the fourth point is more influential than the first three
points Explain why, or why not, particularly commenting on the values of ` 4,
δ4,0 and δ4,1 in your answer.
Copyright 2022 University of Southampton Page 6 of 11
7 MATH2010W1
3. [25 marks] In a particular data set, n = 20 observations were made on a response
variable (y) and three explanatory variables (x1, x2 and x3).
(a) [6 marks] Partial outputs from the summary command in R applied to lm objects
regressing each of the explanatory variables on the response separately are
given below. From this output, calculate AIC for each of the three models.
##
## Call:
## lm(formula = y ~ x1)
##
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.3004 0.5331 24.95 2.1e-15
## x1 0.0989 0.5857 0.17 0.87
##
## Residual standard error: 2.33 on 18 degrees of freedom
## F-statistic: 0.0285 on 1 and 18 DF, p-value: 0.868
##
## Call:
## lm(formula = y ~ x2)
##
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.334 0.262 50.96 < 2e-16 ## x2 2.255 0.308 7.32 8.5e-07 ## ## Residual standard error: 1.17 on 18 degrees of freedom ## F-statistic: 53.6 on 1 and 18 DF, p-value: 8.51e-07 ## ## Call: ## lm(formula = y ~ x3) ## (Question 3 continued on next page) Copyright 2022 University of Southampton TURN OVER Page 7 of 11 8 MATH2010W1 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.302 0.250 53.18 < 2e-16 ## x3 2.339 0.301 7.77 3.7e-07 ## ## Residual standard error: 1.12 on 18 degrees of freedom ## F-statistic: 60.3 on 1 and 18 DF, p-value: 3.73e-07 (b) [2 marks] Given that AIC for the null model (containing no explanatory variables) is 31.783, which is the first variable that you would include when building a model using forward selection and AIC (c) [2 marks] A multiple linear regression model containing all three variables is now fitted using lm, and partial output from the summary command is given below. ## ## Call: ## lm(formula = y ~ x1 + x2 + x3) ## ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.186 0.244 54.09 <2e-16 ## x1 0.515 0.281 1.83 0.086 ## x2 -1.192 3.171 -0.38 0.712 ## x3 3.660 3.227 1.13 0.273 ## ## Residual standard error: 1.06 on 16 degrees of freedom Using t-tests, which variables can be individually dropped from the model without adversely affecting the quality of the fit (d) [5 marks] Give a possible explanation for the contradiction between your results from parts (b) and (c). (e) [4 marks] Given that the sample variance of the response is s 2 y = 5.157, what is the adjusted R2 value for the above multiple regression model Is the model a good fit Copyright 2022 University of Southampton Page 8 of 11 9 MATH2010W1 (f) [6 marks] Test the significance of the multiple regression model for these data; that is, compare it to the null model. You may find the following quantities from R useful: qf(0.95, 4, 16) ## [1] 3.01 qf(0.95, 3, 16) ## [1] 3.24 qf(0.95, 3, 20) ## [1] 3.1 Copyright 2022 University of Southampton TURN OVER Page 9 of 11 10 MATH2010W1 4. [25 marks] Every year Forbes produces a data set of the top 2000 companies world-wide, ranked by quantities such as Sales and Assets, across different market sectors, countries etc. A linear model was fitted to the 2017 data, regressing log(Sales) against log(Assets) (a quantitative variable) and Sector (a qualitative factor with 11 levels). The following analysis of variance table was obtained. Df Sum Sq Mean Sq F value Pr(>F)
log(Assets) 561.84 7.7510e-136
Sector 893.35 1.9527e-188
log(Assets):Sector 38.68 3.7597e-07
Residuals 1539.23
(a) [6 marks] Write down each of the nested model comparisons summarised in this
table.
(b) [8 marks] Complete the ANOVA table. Give reasons for your choices for each
entry in the “DF” column.
(c) [1 mark] What proportion of variation is explained by the model
(d) [10 marks] Simplified output from the summary command from the fitted lm
object in R for this model is given on the next page. The units of each of Sales
and Assets are billions of dollars; in the data set, Apple had sales of 217.5 and
assets of 331.1, both in billions of dollars, larger than any other “Information
Technology” company.
What is the equation for the estimated mean response (log(Sales)) from a
company in the “Information Technology” sector in terms of log(Assets) What
is it for the “Consumer Discretionary” Sector Sketch a plot with these two
equations. At what value of Assets is an information technology company
predicted to have greater mean sales than a company in the consumer
discretionary sector
The company with largest sales in the data set is Wal-Mart (a consumer
discretionary company), with sales of 485.3. According to the regression model,
how large would Apple’s assets need to be to surpass Wal-Mart’s sales What
caveats do you have about this prediction
Copyright 2022 University of Southampton Page 10 of 11
11 MATH2010W1
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4227 0.1535 2.75 0.0060
log(Assets) 0.3664 0.0551 6.65 0.0000
SectorConsumer Discretionary 0.2437 0.2176 1.12 0.2629
SectorConsumer Staples 0.1977 0.2791 0.71 0.4788
SectorEnergy 0.3432 0.2664 1.29 0.1978
SectorFinancials -1.4278 0.1964 -7.27 0.0000
SectorHealth Care -0.3529 0.2928 -1.21 0.2283
SectorIndustrials -0.0717 0.2697 -0.27 0.7904
SectorInformation Technology -0.3698 0.2678 -1.38 0.1675
SectorMaterials 0.2063 0.2720 0.76 0.4483
SectorTelecommunication Services -0.4817 0.4264 -1.13 0.2587
SectorUtilities -0.7474 0.3709 -2.01 0.0441
log(Assets):SectorConsumer Discretionary 0.3180 0.0755 4.21 0.0000
log(Assets):SectorConsumer Staples 0.3815 0.0999 3.82 0.0001
log(Assets):SectorEnergy 0.1716 0.0810 2.12 0.0342
log(Assets):SectorFinancials 0.2741 0.0615 4.46 0.0000
log(Assets):SectorHealth Care 0.4674 0.0970 4.82 0.0000
log(Assets):SectorIndustrials 0.3847 0.0913 4.21 0.0000
log(Assets):SectorInformation Technology 0.4947 0.0944 5.24 0.0000
log(Assets):SectorMaterials 0.2309 0.0956 2.42 0.0158
log(Assets):SectorTelecommunication Services 0.4140 0.1228 3.37 0.0008
log(Assets):SectorUtilities 0.3852 0.1125 3.42 0.0006
END OF PAPER
Copyright 2022 University of Southampton Page 11 of 11
12 MATH2010W1
Learning objectives:
LO1 Use the theory of linear models and matrix algebra to investigate standard and non standard problems.
LO2 Interpret the output from an analysis including the meaning of interactions and terms
based on qualitative factors.
LO3 Understand how to make a critical appraisal of a fitted model.
LO4 Carry out t-tests and calculate confidence intervals by hand and by computer.
LO5 Using a variety of procedures for variable selection.
LO6 Fit multiple regression models using the adopted software package.
LO7 Carry out simple linear regression by computer.
LO6 and LO7 are assessed via coursework.
Copyright 2022 University of Southampton Page 12 of 11
Solutions i MATH2010W1
1. LO1, L03, LO4; unseen example; part (a) is bookwork; CT18 31203-31209.
(a) [5 marks] We begin by writing down the form for the least-squares estimator of β:
S(β) =
n
X
i=1
(Yi βxi)
2
The derivative is given by 1
dS
dβ = 2
n
X
i=1
(Yi βxi) × xi
= 2
n
X
i=1
Yixi + 2β
n
X
i=1
x
2
i
Setting this to zero we obtain the normal equation: 1
2
n
X
i=1
Yixi + 2β
n
X
i=1
x
2
i = 0
= 2β
n
X
i=1
x
2
i = 2
n
X
i=1
Yixi
= β =
P
n
i=1 Yixi
P
n
i=1 x
2
i
To show that β is unbiased, we take its expectation: 2
E(β ) =
P
n
i=1 E(Yi)xi
P
n
i=1 x
2
i
=
P
n
i=1 βxi × xi
P
n
i=1 x
2
i
= β
P
n
i=1 x
2
i
P
n
i=1 x
2
i
= β
so that it is unbiased. To compute its variance: 1
(Question 1 continued on next page)
Copyright 2022 University of Southampton
TURN OVER
Page 1 of 12
Solutions ii MATH2010W1
Var(β ) = Var
P
n
i=1 xiYi
P
n
i=1 x
2
i