R-MTH6991 - Father Essays

Main Examination period 2021 – May/June – Semester B
Online Alternative Assessments
MTH6991: Computational Statistics with R
You should attempt ALL questions. Marks available are shown next to the ques-
tions.
In completing this assessment:
You may use books and notes.
You may use calculators and computers, but you must show your working
for any calculations you do.
You may use the Internet as a resource, but not to ask for the solution to an
exam question or to copy any solution you find.
You must not seek or obtain help from anyone else.
All work should be handwritten and should include your student number.
You have 24 hours to complete and submit this assessment. When you have finished:
scan your work, convert it to a single PDF file, and submit this file using the
tool below the link to the exam;
e-mail a copy to maths@qmul.ac.uk with your student number and the module
code in the subject line;
with your e-mail, include a photograph of the first page of your work together
with either yourself or your student ID card.
You are expected to spend about 2 hours to complete the assessment, plus the time
taken to scan and upload your work. Please try to upload your work well before the
end of the submission window, in case you experience computer problems. Only one
attempt is allowed – once you have submitted your work, it is final.
Examiners: J. Griffin, H. Maruri-Aguilar
Queen Mary University of London (2021) Continue to next page
MTH6991 (2021) Page 2
Some questions use digits from your 9-digit ID number. The digits used are A, B and
C, the third-to-last, second-to-last and last digits of your ID number . . .ABC.
Question 1 [30 marks]. Let A, B and C be the last three digits of your ID number.
Consider the samples x = (9,10+ A) and y = (7,20+ B,30+ C) from two
populations. We want to know if the mean in the population associated with the first
sample is different from the population mean for the second sample. We are not
prepared, however, to assume that the data are normally distributed.
(a) Suppose we want to perform a permutation test. State an appropriate null
hypothesis and a test statistic. Carry out this test at the 10% significance level to
test the hypothesis. In your answer, calculate the full null distribution. [13]
(b) Calculate the value of Mann-Whitney statistic UX for the two samples x and y. [4]
(c) In R, the function dwilcox calculates the probability mass function of the null
distribution for the Mann-Whitney statistic. Suppose that this function is run
and outputs the following:
> dwilcox(0:3, m=2, n=3)
[1] 0.1 0.1 0.2 0.2
Use this output to calculate a p-value for the same comparison as part (a). [5]
(d) Suppose that the x and y samples were of the same size as each other, m = n. If
we carried out a Mann-Whitney test with a two-sided significance level of 5%,
what is the smallest m that would allow us to reject the null hypothesis [6]
(e) If the samples were stored in x and y in R, what would be the purpose of the
following code
length(c(x, y)) == length(unique(c(x, y))) [2]
Question 2 [14 marks].
Let (x1,y1), . . . , (xn,yn) be n pairs of data from continuous distributions. The null
hypothesis is that the differences xi yi, i = 1, . . . ,n, have a distribution which is
symmetric about zero. Let the test statistic K be the number of pairs for which
xi yi > 0.
(a) For sample size n = 3, using the assumptions of symmetry about zero and
continuous data, calculate the null distribution for K. In other words, calculate
P(K = k) for each k for which this probability is non-zero. [10]
(b) Let A, B and C be the last three digits of your ID number. Suppose that the
observed data are
(x1,y1), (x2,y2), (x3,y3) = (A,10+ B), (9,10+ C), (14,2)
Using the null distribution found in part (a), calculate the one-sided p-value
testing if the first member of each pair tends to be greater than the second
member. [4]
Queen Mary University of London (2021) Continue to next page
MTH6991 (2021) Page 3
Question 3 [24 marks].
Suppose that we have bivariate data of the form (y1,x1), . . . , (yn,xn). We wish to fit
models of the form E(Yi) = f (xi,β), where f is a known functional form and β is a
vector of parameters to be estimated.
(a) Describe the procedure for using leave-one-out cross-validation to obtain both a
set of predictions y [1], . . . , y [n], and the predicted or cross-validated residuals. [7]
(b) Assume now that f (xi,β) is a linear model, and let H be the hat matrix when the
model is fitted to the whole dataset. You do not need to state the formula for H.
(i) State the formula relating the predicted residuals to the ordinary residuals
found when fitting the model to the original dataset. How does this
formula allow us to save computing time [5]
(ii) How do the predicted residuals compare in magnitude to the ordinary
residuals [2]
(c) Suppose that f depends on a set of spline functions, λ > 0 is a smoothing
parameter, and we estimate β by minimizing the penalized sum of squares
SP =
n
∑
i=1
(yi f (xi,β))2 + λ
∞∫
∞
f ′′(x,β)2 dx
The answers do not need any details about spline functions.
Suppose that we have fitted this model for a range of values of λ and calculated
the PRESS statistic each time, with the results as plotted below. The PRESS
statistic is defined as
PRESS =
n
∑
i=1
e2[i],
where e[i] is the ith predicted residual.
Explain how this graph would be used to select a value of λ. By doing this,
what feature of the fitted model are we selecting for Why would PRESS
initially decrease as λ increases for small values of λ [10]
Queen Mary University of London (2021) Continue to next page
MTH6991 (2021) Page 4
Question 4 [32 marks].
Suppose that we have two samples x = (x1, . . . ,xm) and y = (y1, . . . ,yn), which in R
are stored in vectors named x and y, respectively. Consider the following R code:
N = 5000
v = vector(length=N)
for(i in 1:N){
xb = sample(x, replace=TRUE)
yb = sample(y, replace=TRUE)
v[i] = mean(xb) – mean(yb)
}
sd(v)
a = 0.025
k = floor(a*(N+1))
sv = sort(v)
c(sv[k], sv[N+1-k])
(a) What is the name for the statistical procedure that the code in the loop is
carrying out [3]
(b) Explain what each of the three lines of code inside the loop is doing. [7]
(c) In statistical terms, what will the command sd(v) output [3]
(d) In statistical terms, what will the last line of code output [4]
(e) What does the statistical method assume about:
(i) the observations in the sample x [3]
(ii) the relationship between the samples x and y [3]
Suppose now that the first two lines of code inside the loop are replaced by:
xb = rgamma(length(x), shape=ax, rate=bx)
yb = rgamma(length(y), shape=ay, rate=by)
where it is assumed that ax, bx, ay, by have been given values in previous code.
(f) With that change, what is the name of the statistical procedure [2]
(g) Explain what the line starting with xb = rgamma is doing. [4]
(h) What does this method assume about the observations in the sample x [3]
End of Paper – An appendix of 1 page follows.
Queen Mary University of London (2021) Continue to next page
MTH6991 (2021) Page 5
Appendix: Normal distribution function
Table 1: The standard normal cumulative distribution function (cdf)Φ(x) for the given
values of x. The cdf for x < 0 can be found using the fact that Φ(x) = 1 Φ( x). For x ≥ 3.8, 1 Φ(x) < 10 4. x Φ(x) x Φ(x) x Φ(x) x Φ(x) x Φ(x) 0.0 0.500 0.8 0.788 1.6 0.945 2.4 0.992 3.2 0.9993 0.1 0.540 0.9 0.816 1.7 0.955 2.5 0.994 3.3 0.9995 0.2 0.579 1.0 0.841 1.8 0.964 2.6 0.995 3.4 0.9997 0.3 0.618 1.1 0.864 1.9 0.971 2.7 0.997 3.5 0.9998 0.4 0.655 1.2 0.885 2.0 0.977 2.8 0.997 3.6 0.9998 0.5 0.691 1.3 0.903 2.1 0.982 2.9 0.998 3.7 0.9999 0.6 0.726 1.4 0.919 2.2 0.986 3.0 0.999 3.8 0.9999 0.7 0.758 1.5 0.933 2.3 0.989 3.1 0.999 End of Appendix. Queen Mary University of London (2021)