程序案例-MATH6173 - Father Essays

MATH6173 Statistical Computing — Coursework 1
Instructions for Coursework 1
1. This coursework assignment is worth 50% of the overall mark for the module MATH6173.
2. The deadline is 1500 on Monday 22nd November 2021 and your completed work must be submitted
electronically via Blackboard.
3. Your submission must consist of exactly one written report including the answers and any tables
and graphs to the questions in Section 1: R Programming Part I, together with one R script file
containing all the R code you used to obtain your results for questions in both Section 1 and Section 2.
In the R script, you must use comments to make it clear to which questions the R codes are answering.
4. Your written report should be in pdf format. Please name your report report as .pdf.
For example, if your student ID is 12345678, then your report must have the filename 12345678.pdf.
5. Your R script file must have the filename .R. For example, if your student ID is
12345678, then your script must have the filename 12345678.R.
6. Your entire R script file should be reproducible and run without any errors.
7. You must present elegant and informative plots in the written report, including the use of appropriate
plotting area, points/lines sizes and colours. Also please include meaningful labels, titles and legends.
8. You must informatively comment your R code including indicating where each question and task begins.
9. You must not load any add-on packages, e.g. using function library or require.
10. The standard Faculty rules on late submissions apply: for every weekday, or part of a weekday, that
the coursework is late, the mark will be reduced by a factor of 10%. No credit will be given for work
which is more than one week late.
11. All coursework must be carried out independently. You are reminded of the University’s Academic
Integrity Policy, see https://www.southampton.ac.uk/quality/assessment/academic_integrity.page.
12. In the interests of fairness, queries which relate directly to the coursework must be made via the
MATH6173: Coursework 1 Discussion Forum on Blackboard. This ensures that everybody has access
to the same information.
Total marks available: 100
1
Section 1: R Programming Part I
Question 1. Two-sample testing
[22 marks]
The dataset in the file diab.txt is originally from the US National Institute of Diabetes and Digestive and
Kidney Diseases and contains measurements from n = 768 female individuals. The data contains a series of
eight measurements variables, and a response variable indicating whether or not a patient has diabetes. The
variables in the dataset are described in following table.
Variables Description
preg Number of times pregnant
gluc Plasma glucose concentration after 2 hours in an oral
glucose tolerance test
bp Diastolic blood pressure
skin Triceps skin fold thickness
insulin 2-Hour serum insulin
bmi Body mass index
ped Diabetes pedigree function
age Age
out Binary response variable (0 = no diabetes, 1 = diabetes)
(a) [2 marks] Import the dataset from the file diab.txt into your own work space. Calculate the following
quantities:
number of patients that have diabetes, and number of patients that do not have diabetes
median age of patients
mean body mass index for patients that have diabetes, and mean body mass index for patients that do
not have diabetes
(b) [2 marks] Produce a side by side boxplot to show the distributional information of all the eight
measurements after scaling (subtracting the mean and dividing by the standard deviation).
(c) [2 marks] We want to test whether the mean value of body mass index (BMI) are the same between
patients with diabetes and patients without diabetes. Let us use Student’s t-test here and assume the
variances of BMI in both groups are equal. List your null and alternative hypotheses, perform your test
and give the p-value. Do you reject your null hypothesis at the significance level of 0.05
(d) [6 marks] Suppose now we want to write our own code to perform a similar test in (c) without assuming
equal variance. To achieve this, we can use the Welch’s t-test. Suppose we have two independent groups
of random variables x1, . . . , xn1 and y1, . . . , yn2 . We define their sample means as:
xˉ = 1
n1
n1∑
i=1
xi, yˉ =
1
n2
n2∑
i=1
yi,
and their sample variances as:
σ 2x =
1
n1 1
n1∑
i=1
(xi xˉ)2, σ 2y =
1
n2 1
n2∑
i=1
(yi yˉ)2.
Then we can define the unbiased variance estimator as:
σ 2 = σ
2
x
n1
+
σ 2y
n2
.
2
Therefore, we obtain
t = (xˉ yˉ)
σ
,
which is the so-called Welch’s t-test statistic. Under the null hypothesis of mean equality, we have the
distribution of this test statistic can be approximated by a Student’s t-distribution with degrees of
freedom
df =
(σ 2x/n1 + σ 2y/n2)2
(σ 2x/n1)2
n1 1 +
(σ 2y/n2)2
n2 1
.
Write your own code to calculate the Welch’s t-test statistic for the same hypothesis testing problem
in (c). Give the value of this test statistic, the degrees of freedom, and the p-value of your test. Do you
reject the null hypothesis at the significance level of 0.05
(e) [3 marks] The F-distribution is a continuous probability distribution that arises frequently in probability
theory and statistics. It has two parameters, and we write X ～ Fd1,d2 if a random variable X follows
from a F -distribution with parameters d1 and d2. You can use the R functions df, pf, qf and rf to
obtain the probability density function, cumulative distribution function, quantile function and random
generation function of F -distribution. Read the help file of above functions and produce the following
four plots in one page with 2× 2 layout.
plot 1: probability density function of F8,10.
plot 2: cumulative distribution function of F8,10.
plot 3: a probability histgram for 100 independently generated random numbers from F8,10.
plot 4: empirical density function of the same 100 random numbers in plot 3.
(f) [7 marks] Suppose now we want to simultaneously test if the mean values of all the eight measurements
are equal between patients that have diabetes and patients that do not have diabetes. To achieve
this, we can use the Hotelling’s T -squared test, which is a natural generalizations of the Student’s
t-test to multivariate hypothesis testing. Suppose we have two independent groups of random p-vectors
x1, . . . ,xn1 and y1, . . . ,yn2 (all the vectors are p-dimensional). We define the their sample mean vectors
as:
xˉ = 1
n1
n1∑
i=1
xi, yˉ =
1
n2
n2∑
i=1
yi,
and their sample covariance matrices as:
Σ x =
1
n1 1
n1∑
i=1
(xi xˉ)(xi xˉ)T , Σ y = 1
n2 1
n2∑
i=1
(yi yˉ)(yi yˉ)T .
Then we can define the unbiased pooled covariance matrix as:
Σ = (n1 1)Σ x + (n2 1)Σ y
n1 + n2 2 .
Therefore, we obtain
T 2 = n1n2
n1 + n2
(xˉ yˉ)Σ 1(xˉ yˉ)T ,
which is the so-called Hotelling’s two-sample T -squared statistic. If we further assume that both
groups consists of observations that are identically and independently drawn from multivariate normal
distribution N(μ,Σ), then
T 2 ～ T 2(p, n1 + n2 2),
where T 2(p, n1 + n2 2) denotes the so-called Hotelling’s T -squared distribution with parameters p and
n1 + n2 2. If a random variable X has a Hotelling’s T -squared distribution, i.e., X ～ T 2p,m, we have
m p+ 1
pm
X ～ Fp,m p+1.
3
Let group 1 be the eight measurements for all the patients with diabetes, and let group 2 be the eight
measurements for all the patients without diabetes. List your null and alternative hypotheses. Write
your own code to calculate the Hotelling’s two-sample T -squared statistic, derive the p-value of your
test. Do you reject your null hypothesis at the significance level of 0.05
Hints:
(1) In hypothesis testing, the p-value is defined as the probability of obtaining test statistics at least as
extreme as the results actually observed, under the assumption that the null hypothesis is true.
(2) Observe the Hotelling’s two-sample T -squared statistic, if the mean difference between two groups is
small (then xˉ yˉ should be close to 0), the value of T 2 will tend to be small. Otherwise if the mean
difference between two groups is large, the value of T 2 will tend to be large as well.
Question 2: Non-parametric kernel density estimation
[28 marks]
Density estimation is the problem of reconstructing the probability density function (pdf) from a set of given
data points. Namely, we observe X1, . . . , Xn and we want to estimate the underlying unknown pdf, denoted
by f(x), generating this dataset, which is one of the central topics in (non-parametric) statistical research.
The so-called kernel density estimator (KDE) is one of the most famous method for this problem. Here we
give a formal definition of the KDE in the following formula:
f n(x) =
1
nh
n∑
i=1
K
(
Xi x
h
)
,
where f n(x) is the empirical estimation of the true pdf f(x), K(·) is called the kernel function that is generally
a smooth, symmetric function, and h > 0 is called the bandwidth that controls the amount of smoothing in
the estimation.
(a) [5 points] Consider the Gaussian kernel function:
K(y) = 1√
2pi
exp
(
y
2
2
)
,
where exp(·) is the exponential function. First, generate 200 random numbers from N(0, 1) distribution
as your dataset. Next, write your own code to estimate the pdf of N(0, 1) from the dataset, i.e.,
calculate f n(x) for x being a vector of equally spaced 250 points on [ 4, 4] with h = 0.3. Please present
your calculated f n(x) and the true pdf f(x) of N(0, 1) on the same plot, and show this plot on your
report.
(b) [10 points] There are many other forms of kernel functions, for example:
Uniform: K(y) = 12 I( 1 ≤ y ≤ 1),
Epanechnikov: K(y) = 34 max{1 y
2, 0},
where I(·) is the indicator function. Now write a function mysimpleKDE with the following features:
Arguments
(1) data is a numeric vector containing the data that we want to estimate the pdf from.
(2) h is a numeric object specifying the bandwidth of the kernel function.
(3) kernel is a character object specifying the kernel function we want to use, by default it is
“gaussian”, and we can also set it to be “epanechnikov” or “uniform”.
4
(4) range is a two-variate numeric vector specifying the starting point and end point of an interval
of x on which we want to estimate the density.
Computation
– Calculate f n(x) for x being a vector of equally spaced 250 points on the interval specified by
range.
Return
– A list containing the following elements:
(1) density is a numeric vector corresponding to the empirical probability density value for
each data point.
(2) kernel is a character object corresponding to the type of kernel function we used in the
estimation. It should be either “gaussian”, “epanechnikov” or “uniform”.
(c) [5 points] Generate 50 random numbers from N(1, 1). Use your function mysimpleKDE and multiple
for-loops to produce a 3× 3 plots, each on appropriate x-range, showing f n(x) with combinations of
three different kernel functions and different values of h ∈ {0.1, 0.3, 0.5}.
(d) [8 points] In practice, we should carefully select h to minimise the asymptotic mean square error (AMSE)
of density estimation. Let us fix the kernel function to be “gaussian”. Repeatedly generate 200 random
numbers from N(0, 1) distribution for 100 times. In each time j we can calculate the KDE f h,jn (x) at x
for different values of bandwidth h, where h is from the set {0.1, 0.15, 0.2, . . . , 1}. The corresponding
averaged mean squared error for each h can be approximated as:
A MSE(h) =
100∑
j=1
m∑
i=1
|f h,jn (xi) f(xi)|2
100m ,
Here again we fix x = (x1, . . . , xm)T being a vector of equally spaced 250 points on [ 4, 4]. Calculate
A MSE(h) for different values of h and show them in the report. What is the optimal value of h based
on your results
Hints:
(1) You can use the function seq to generate equally spaced points in a certain range.
(2) You may find the element-wise maximum function pmax and element-wise and operator & helpful.
(3) You will need to use for-loops and conditional statements in this question.
5
Section 2: R Programming Part II
Question 1: Multivariate distribution
[13 marks]
If a p-dimensional random vector X is distributed according to a multivariate normal distribution (MVN), then
we write X ～ MVN(μ,Σ), where μ is the p-dimensional mean vector and Σ is the p× p variance-covariance
matrix. The probability density function of an MVN is given by
fX(x) = (2pi)
p
2 det(Σ) 12 exp
[
12(x μ)
TΣ 1(x μ)
]
. (1)
(a) [8 marks] Write a function calcLogMVN that calculates the natural logarithm of the joint probability
density for any number of MVN random variables from the same p-dimensional MVN distribution. The
calcLogMVN function have the follwoing features.
Arguments:
(1) xMat is a numeric matrix, where the rows are independent random variables of a p-dimensional
MVN distribution.
(2) mu is a numeric vector representing μ in equation (1).
(3) Sigma is a numeric matrix representing Σ in equation (1).
Computation:
Without using the %*% operator, the calcLogMVN function calculates the natural logarithm of the
joint density of the random variables in xMat given mu and Sigma according equation (1).
Return:
A numerical value as calculated in Computation.
IMPORTANT: Heavy penalties will be applied if the following requirements are not met for this
question.
You must not use any existing functions that already calculate the density of an MVN.
For matrix multiplications, you must only use loops and operators/functions for scaler addition
and multiplication.
(b) [3 marks] For the calculations to be feasible in part (a), the arguments have certain restrictions.
mu is a numeric vector with a length of p, which is the number of columns of xMat.
Sigma is a squrare matrix.
The numbers of rows and columns of Sigma must both equal to the length of mu.
Add additional R commands at the beginning of your calcLogMVN function in part (a) to check that valid
inputs have been provided. If invalid inputs are passed to this function, then it will stop prematurely
along with an apporpriate error message.
(c) [2 marks] Use the function created in part (a) & (b) to calculate the natural logarithm of the joint density
of the random vectors X1 = ( 1.00, 0.10, 0.80), X2 = ( 0.50, 0.10, 1.25) and X3 = ( 1.20, 0.50, 0.75)
under the distribution MVN(μ,Σ) with
μ =
0.75 0.05
1.00
and Σ =
1.0 0.2 0.50.2 2.0 0.7
0.5 0.7 3.0
.
6
Question 2: Integration by the trapezoidal rule
[17 marks]
Figure 1 shows the cubic polynomial
f(x) = (x+ 3)(x 1)(x 2.5). (2)
3 2 1 0 1 2 3
10
5
0
5
10
15
x
f(x
)
a b
(a)
3 2 1 0 1 2 3
10
5
0
5
10
15
x
f(x
)
a’ b’
(b)
Figure 1: f(x) = (x+ 3)(x 1)(x 2.5). In panel (a), the area shaded in grey is the area between f(x) and
the x-axis within the x-interval [a, b], while the shaded region in panel (b) is the area between f(x) and the
x-axis within the x-interval [a′, b′]
.
When a curve f(x) is above the x-axis for the x-interval [a, b] as in Figure 1 (a), we can calculate the area
between f(x) and the x-axis within [a, b] using the trapezoidal rule for integration:
∫ b
a
f(x) ≈
N∑
i=1
f(xi) + f(xi 1)
2 (xi xi 1) (3)
with subintervals defined by a = x0 < ... < xN = b. (a) [10 marks] Write a function calcArea that calculates the area between the curve in equation (2) and the x-axis for any given x-interval [a, b]. Arguments: (1) xVec is a numeric vector with elements corresponding to x0, ..., xN in equation (3). Note: You can assume that for marking, the x-intercepts will not fall within any of the subintervals; however, the subintervals are allowed to have different lengths. The x-intercepts are the values of x that give f(x) = 0. Computation: This function calculates the area between the x-axis and the cubic polynomial in equation (2) according to the trapezoidal rule with subintervals defined by xVec. Hint: Multiply by -1 for the part(s) of the curve with f(x) < 0. Return: A numerical value which is the area as calculated in Computation. 7 (b) [5 marks] Write a function calcAreaGen that generalises the calcArea function in part (a): The calcAreaGen function have additional numeric arguments s, u, v and w, which corresponds to the parameters of the cubic polynomial f(x) = s(x u)(x v)(x w). The calcAreaGen function calculates and returns the area between the x-axis and f(x) = s(x u)(x v)(x w). (c) [2 marks] Use the function calcAreaGen created in part (b) to calculate the area between the x-axis and f(x) = 0.25(x+5)(x+1.5)(x 0.5), using the sequence a = x0 = 5.5 < 5.4 < ... < 0.9 < 1.0 = xN = b to define the subintervals. In this case the subintervals all have the same length of 0.1. IMPORTANT: You must use loops for this question, otherwise a heavy penalty will be applied. Question 3: Finding a set of unique paths [20 marks] A path in two dimensional space is a line that has a starting point at (x, y)-coordinates of (xo, yo) and an end point at (xe, ye). In this question, you can assume that xo ≤ xe. 0.5 1.0 1.5 2.0 2.1 2.2 2.3 2.4 2.5 2.6 x y (a) 0.5 1.0 1.5 2.0 2.1 2.2 2.3 2.4 2.5 2.6 x y (b) 0.5 1.0 1.5 2.0 2.1 2.2 2.3 2.4 2.5 2.6 x y (c) Figure 2: Examples of paths in two dimensional space. (a) A path with a starting point at (xo, yo) = (0.5, 2) and end point at (xe, ye) = (1.5, 2.5). (b) Two overlapping paths. The red path has a starting point at (0.5, 2) and end point at (1.2, 2.35), while the blue path has starting and end points at (1.0, 2.25) and (1.6, 2.55) respectively. (c) The path formed by merging together the red and blue paths in (b). (a) [5 marks] Write a function getYIntAndSlope, which calculates the y-intercept and the slope of a path. Thus, the function will have the following features. Arguments: xo is a numeric object representing the x-coordinate of the starting point (xo) of a path, yo is a numeric object representing the y-coordinate of the starting point (yo) of a path, xe is a numeric object representing the x-coordinate of the end point (xe) of a path, and ye is a numeric object representing the y-coordinate of the end point (ye) of a path. Computations: This function calculates the y-intercept and the slope of the path given by the starting and end points as specified by the arguments. Returns: 8 A numeric vector, where its first element is the y-intercept and the second is the slope. Note: the y-intercept and slope in the vector must be in that order. (b) [15 marks] Write a function getPathSet that extracts the unique set of paths. Specifically, it would have the following features. Arguments: (1) pathMat is a numeric matrix with four columns. Each row specifies a path in two dimensional space, while the columns specify the following information. column 1: the x-coordinate of the starting point (xo), column 2: the y-coordinate of the starting point (yo), column 3: the x-coordinate of the end point (xe), and column 4: the y-coordinate of the end point (ye). For example, if the path in Figure 2 (a) is in pathMat, then the row corresponding to that path should look like 0.5 2.0 1.5 2.5. Computation: This function extracts the unique paths in pathMat. Overlapping paths are replaced with a single path formed by merging those paths together. For example, in Figure 2 (b) there are two overlapping paths in red and blue. They should be merged into one path (purple) as in panel (c). The purple path should be added to the output matrix, while the red and blue paths are excluded. Hint: Use the getYIntAndSlope function in part (a) as part of the computation. Return: A numeric matrix with four columns. – The rows represent the set of unique paths generated in Computation. – The columns of the vector are the (x,y)-coordinates of the starting and end points of those paths, and the order of the columns are the same as in the argument pathMat. IMPORTANT: You must use loop(s) and conditional statements for part (b) of this question, otherwise a heavy penalty will be applied. You will only obtain marks for part (b) of this question if your function produces a matrix with the correct set of paths. 9