SOST 30062: Data Science Modelling
Introduction.
Eduardo Fe
06/03/2020
Introduction
Statistical learning
I a set of tools
I for understanding data
Introduction
Main text:
Click HERE for a link to the website.
Introduction
Premises
1. Many statistical learning (SL) methods are relevant in a wide rage
of settings
2. No single SL method will perform well in all possible applications
3. The internal mechanisms of the methods we will cover are complex
(and interesting) but will not need them to apply SL techniques
succesfully
4. The focus is on real-life applications.
Introduction
Two types of tools:
I Supervised: Build a statistical model for predicting or estimating
an output based on one or more inputs
I Unsupervised: Apply to situations with only inputs and one wants
to learn relationships and structure from such data
Supervised learning: Overview
Motivating example
100
200
300
20 40 60 80
Age
W
a
ge
100
200
300
2004 2006 2008
Year
W
a
ge
100
200
300
1. < HS Grad2. HS Grad3. Some College4. College Grad5. Adv nced Degree
Wage$education
W
a
ge
Motivating example
In the example
I Years of education, age and year are input variables. Also
predictors, independent variables, features, covariates
I Wages is the output variable, also response variable, dependent
variable
Despite the variation/noise, the figures suggest there are some overall
relationships between the inputs and the output.
That overall relationship is what interests us.
Some notation
I Inputs are denoted X1, X2, . . . , Xp
I Ouputs are denoted Y
I A unit (person, firm, village, etc) is denoted i, and there are N of
these units, so i = 1, 2, . . . , N
I Unit’s i value for variable j is xij (where j = 1, 2, . . . , p).
We believe there is some relationship between Y and
X = (X1, X2, . . . , Xp) which can be writen, in general form
Y = f(X1, X2, ..., Xp) + ε ≡ f(X) + ε
I f is some fixed but unknown function
I ε is a random error term
I ε is independent of X1, X2, . . . , Xp
I ε has 0 mean
Why estimate f
Prediction
Values of the inputs are known, but output is not observed at the time.
If f can be estimated then we can predict Y given the levels of
X1 . . . Xp
Y = f (X1, . . . Xp) ≡ f (X)
The accuracy of the prediction depends on
I Reducible error: In general f 6= f but this error can be reduced
using the most appropriate statistical method
I Irreducible error: Recall Y = f(X) + ε, where ε cannot be
predicted. No matter how well we estimate f , we cannot reduce
the error introduced by ε
Why estimate f
Prediction
More formally, the average squared error is a valid measure of accuracy
of the prediction, E(Y Y )2.
It can be shown
E(Y Y )2 =E
[
f(x) f (X)
]2
+ V (ε)
=Reducible+ Irreducible (1)
Why estimate f
Inference
How do X1, X2, ...Xp affect Y
I Which predictor is associated with the response
I Can the relationship between inputs and output be approximated
using a specific model
How is f estimated
There are i = 1, . . . n units, j = 1...p inputs and one output.
Let yi unit i’s value for the output
Let xij unit i’s value for input j
Let’s put these values into a vector xi = (xi1, xi2, ..., xip)′
The full dataset set is the set {(x1, y1), (x2, y2)...(xn, yn)}. In SL this is
called the training data.
We will use this training data to estimate (learn) the unknown
function f .
How is f estimated
Parametric methods
1. First assume a functional form (a shape) for f . For instance,
2. Find a procedure that uses the training data to fit (or train) the
model
Example:
1. Assume
f(X) = β0 + β1 ·X1 + β2 ·X2 + ...+ βp ·Xp
The model specifies everything; the only unknown bit are the
parameters βj , j = 1...p.
2. Finding the βj , j = 1...p; the common way of doing this is using
Ordinary Least Squares.
This method is parametric in the sense that the problem of finding f is
reduced to estimating a small set of parameters.
How is f estimated
Parametric methods
Pros
I Tend to rely on models that can be estimated quickly and easily
I Parametric models are easy to interpret, so they are good for
inference
Cons - The choice of the functional forms is subjective - The model we
choose will usually not match the true f and this will lead to poor
inference and prediction if the differences are too big
To address the last point, we can devise flexible parametric models, but
this will generally reduce the interpretability of the model and might
lead to a problem of overfitting
Example
The motorcycle data; We have fitted a model of the form
Acceleration = β0 + β1 ·Milliseconds+ ε
100
50
0
50
10 20 30 40
Milliseconds
Ac
ce
le
ra
tio
n
How is f estimated
Non-Parametric methods
Do not make explicit assumptions about the functional form of f
Instead, they try to estimate an f that gets as close as possible to the
data points without being too wiggly or rough.
I These methods can potentially fit a wider range of shapes for f
I They avoid the risk of misspecification1
However
I These results produced by these methods are more difficult to
interpret
I Since they do not rely on prior information (in the form of
parameters) they need a lot more data (information) to work
optimally
I They normally rely on tuning parameters that determine the
amount of smoothing; these tuning parameters need to be chosen
before estimation.
1While in a parametric model the proposed shape might be far away from f , this
is avoided in nonparametric methods which do not impose any a priori shape on f )
Example
The motorcycle data; We have fitted a nonparametric model (a local
linear regression)
100
50
0
50
10 20 30 40
Milliseconds
Ac
ce
le
ra
tio
n
Regression vs Classification.
Two types of variables:
I Quantitative: take on numerical values (age, weight, income, etc)
I Qualitative or categorical: take values in one of, say, K different
classes or categories (e.g. female vs non-female; brand A vs brand
B vs brand C; primary education, secondary education, college,
university degree)
When the output variable Y in a supervised problem is -quantitative
we tend to talk about Regression problems. -qualitative we tend to
talk about Classification problems.
Because in supervised learning problems prediction is critical, different
statistical methods will tend to apply depending on whether we have
quantitative or qualitative outputs2
2A school of thought in Economics has been promoting the use of linear
regression when output is dichotomous; this approach is often justified, but only
because Economists are primarily concerned about problems of inference; when
prediction is at stake, this kind of approximation is troublesome.
Assessing model accuracy.
No statistical model/method strictly dominates any other method for
all problems.
THus it is important to decice, for a particular type of problem and a
given dataset, which method produces the best results.
This is the most challenging aspect of SL.
Assessing model accuracy: Quality of fit.
However:
I In SL, we use a training data set to find a model. . .
I . . . and we use that model to predict the outcome in a test data
set (which includes inputs, but not the output of interest)
In general,
I we do NOT care about accuracy in the training data
I we care about accuracy in the test data
Assessing model accuracy: Quality of fit.
For instance, in regression settings, a popular measure is the Mean
Squared Error (MSE):
MSE = 1
n
n∑
i=1
(yi f (xi))2
This measure will be small if predictions are very close to responses, on
average.
Assessing model accuracy: Quality of fit.
How can we go about trying to select a method that minimises the
test3 MSE
CASE 1
I Often we may have a test data set available (e.g. a set of
observations that were not used to train the data).
I In that case we can evaluate the MSE of competing models in the
test data set and select the model with smallest MSE in the test
data set
CASE 2
I Often, however, there is no test data.
I In this case we need to rely on the MSE of the models in the
training data set
I Suppose we made then a choice of model, say f (.), to predict f
I For this to be useful, we need to assume that future test data will
come from a similar, underlying model.
3As opposed to the training MSE.
Assessing model accuracy: Quality of fit.
Assessing model accuracy: Quality of fit.
Assessing model accuracy: Quality of fit.
Assessing model accuracy: Bias-Variance trade-off
Let xo denote a set of inputs in the test data set. It can be shown that
that the MSE resulting from predicting the output in the test set using
the model estimated with the training set is,
MSE = V (f (xo)) + [Bias(f (xo))]2 + V (ε)
This tells us that, to minimise the expected test error, we need to select
a method that reduces both the variance and the bias.
Assessing model accuracy: Bias-Variance trade-off
I V (f (xo)) tells us how our predictions would change under a
different training data set.
I If we want to reduce variability across data sets, we need to
sacrifice some of the wiggles and detail shown in the data (as the
blue model in the previous two figures)
I But this means that we will incur in a bigger amount of bias
across samples
Assessing model accuracy: Bias-Variance trade-off
I [Bias(f (xo))]2 tells us the error in our predictions, incurred by
using a model.
I If we want to capture as much detail as possible in the training
data set, we will need a very flexible model. This, however, might
result in small changes in our predictions following small changes
in inputs (higher variance)
Assessing model accuracy: Bias-Variance trade-off
In general, as we use more flexible methods, teh variance will increase
and the bias will decrease.
The relative rate of change of these two quantities determines whether
the test MSE increases or decreases
As we increase the flexibility of a class of methods, the bias tends to
decrease faster than the variance increases; this reduces the MSE.
However at some point, further flexibility does not lead to great
improvements in terms of bias, whilst the variance continues to
increase.
At that point the MSE starts to increase.
Assessing model accuracy: Quality of fit.
Assessing model accuracy: Quality of fit.
As a side note,
MSE = V (f (xo)) + [Bias(f (xo))]2 + V (ε)
Since V (ε) is the irreducible error, we see that MSE can never lie below
this irreducible error.
Classification
The ideas just explained apply to classification problems.
However, the most common way of quantifying the accuracy of
classification models is to estimate the training error rate, the
proportion of mistakes made if we apply the model to training
observations,
1
n
n∑
i=1
I(yi 6= y i)
where, now, y i is the predicted class for unit i.
Given a test data set, we can similarly define the test error rate,
Average(I(yo 6= y o))
A good classifier is one for which this latest error is smallest.
Classification
The Bayes classifier
I Minimises Average(I(yo 6= y o))
I Assigns an observation to the most likely class given its predictor
values:
P (Y = j|X = xo)
Classification
Classification
The Bayes classifier
I Minimises Average(I(yo 6= y o))
I Assigns an observation to the most likely class given its predictor
values:
P (Y = j|X = xo)
The problem is, of course, that this probability is know know and
must be estimated (we will see how to do so in a future lecture).