R-STAT3006

STAT3006: Statistical Learning
Lecture Notes
Ian A. Wood
School of Mathematics and Physics
University of Queensland, St. Lucia 4072, Australia
July 26, 2021
Contents
1 Multivariate Statistics 3
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Marginal and Conditional Distributions . . . . . . . . . . . . . . . . . . . 4
1.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Marginal and Conditional Densities of the Multivariate Normal . . . . . 10
1.6 Samples from the multivariate normal . . . . . . . . . . . . . . . . . . . . 14
1.6.1 James–Stein estimators . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.6.2 Wishart Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 Hotelling’s T 2 test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.7.1 Confidence region for mean vector . . . . . . . . . . . . . . . . . . 22
1.7.2 Two sample test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Clustering 26
2.1 K-means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3 Fitting a normal mixture model . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Estimating uncertainty about the parameter values . . . . . . . . . . . . 41
2.5 Choosing the number of components . . . . . . . . . . . . . . . . . . . . . 42
2.5.1 Cross-validation of the likelihood . . . . . . . . . . . . . . . . . . 47
2.6 Decompositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6.1 Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 49
1
2.7 Mahalanobis Distances and Transformation . . . . . . . . . . . . . . . . . 50
3 Discriminant Analysis 52
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Statistical Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Optimal Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Quadratic Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Fisher’s Linear Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 60
3.7 Estimation of Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.8 Mixture Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . 69
3.9 Kernel Density Discriminant Analysis . . . . . . . . . . . . . . . . . . . . 70
3.10 Nearest Neighbour Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.11 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 High-dimensional analysis 85
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . 88
4.3 Single-variable analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Variable selection and the lasso . . . . . . . . . . . . . . . . . . . . . . . . 96
.1 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
.2 Appendix B: Code for knn example . . . . . . . . . . . . . . . . . . . . . . 101
2
Chapter 1
Multivariate Statistics
Note: this chapter draws heavily on Morrison (2005). Hardle and Simar (2019) covers
much of the same material.
1.1 Introduction
Assume we will observe data as a result of a random experiment, i.e. we will observe n
random variables {Xi, i = 1, . . . , n}, with each Xi being p-dimensional. The values we
record for of these random variables are written {xi, i = 1, . . . , n}. Each element can
be continuous or discrete, but we will assume they are all continuous in most cases.
Note that when necessary, we will use the index i = 1, . . . , n to index observations in
the sample, and the index j = 1, . . . , p to index the elements of an observation.
We typically assume that all n random variables are drawn independently from the
same distribution, i.e. that of X . The distribution of X can be described by its joint
distribution function
F (X ≤ x) = P (X1 ≤ x1, . . . , Xp ≤ xp).
3
When F is absolutely continuous, one can write
F (x) =
∫ xp

. . .
∫ x1

f(u1, . . . , up) du1 . . . dup,
where f(x) is the joint density function of the elements of X .
If the elements of X happen to be independent, then
f(x) =
p∏
j=1
f(xj) and F (x) =
p∏
j=1
F (xj).
Also, if such a factorisation is possible, it implies independence of the variables. How-
ever, we will usually assume that there is some dependency among the p elements of
X .
It can be the case that while f(x) is straightforward to write out analytically, F (x)
is not. This is true for the multivariate normal distribution in p dimensions. The distri-
bution function is analytically intractable, but the density can be written as
f(x) =
1
(2π)p/2|Σ|1/2 exp
[
1
2
(x μ)TΣ 1(x μ)
]
, x ∈ Rp,
where μ is the p-dimensional mean vector and Σ is the p× p covariance matrix.
1.2 Marginal and Conditional Distributions
If we wish to determine the (marginal) joint density of a subset of the random variables,
e.g. the first q < p variables, this can be obtained by integrating out the variables which are not in the subset. Hence the joint density g(x1, . . . , xq) = ∫ ∞ ∞ . . . ∫ ∞ ∞ f(x1, . . . , xp) dxq+1 . . . dxp. (1.1) 4 Correspondingly, the joint distribution function is given by G(x1, . . . , xq) = P (X1 ≤ x1, . . . , Xq ≤ xq) = F (x1, . . . , xq,∞, . . . ,∞). (1.2) Sometimes we are instead interested in a conditional distribution, that is, the joint dis- tribution of a subset of the random variables when the remainder are held fixed at certain values. E.g. assume that we want to know the joint density of X1, . . . , Xq, q < p, with the remaining variables fixed at values Xq+1 = xq+1, . . . , Xp = xp. The definition of conditional probability states that, when P (B) > 0, P (A|B) =
P (A ∩ B)/P (B). When extended to densities and considered for this conditional den-
sity, we let A = {X1 = x1, . . . , Xq = xq} and B = {Xq+1 = xq+1, . . . , Xp = xp}. So
h(x1, . . . , xq|xq+1, . . . , xp) = f(x1, . . . , xq, xq+1, . . . , xp)∫∞
∞ . . .
∫∞
∞ f(x1, . . . , xp) dx1 . . . dxq
=
f(x1, . . . , xp)
g(xq+1, . . . , xp)
.
(1.3)
Note that g(xq+1, . . . , xp) is used here to mean the relevant marginal density evaluated
at these x values. It is not the same marginal density as in 1.1.
1.3 Moments
The expected value of a p-dimensional random vector X is just the vector of expecta-
tions of its elements, i.e.
E(X) = [E(X1), . . . , E(Xp)]
T (1.4)
The covariance of two elements, j and k, of the random vector is
σjk = cov(Xj, Xk) = E[(Xj E(Xj))(Xk E(Xk))] = E(XjXk) E(Xj)E(Xk) (1.5)
=
∫ ∞

∫ ∞

fjk(xj, xk)xjxk dxjdxk
∫ ∞

fj(xj)xj dxj
∫ ∞

fk(xk)xk dxk, (1.6)
where fjk(xj, xk) is the (marginal) joint density of Xj and Xk, and fj is the (marginal)
density of Xj .
5
The variance of the jth element of the random vector X can be written as σjj or
σ2j , where σj is the standard deviation of the jth element of the random vector. The
covariance matrix of X is defined to be
var(X) = Σ = cov(X,XT ) = E[(X E(X))(X E(X))T ] =