程序案例-SOST 30062

SOST 30062: Data Science Modelling
Introduction.
Eduardo Fe
06/03/2020
Introduction
Statistical learning
I a set of tools
I for understanding data
Introduction
Main text:
Click HERE for a link to the website.
Introduction
Premises
1. Many statistical learning (SL) methods are relevant in a wide rage
of settings
2. No single SL method will perform well in all possible applications
3. The internal mechanisms of the methods we will cover are complex
(and interesting) but will not need them to apply SL techniques
succesfully
4. The focus is on real-life applications.
Introduction
Two types of tools:
I Supervised: Build a statistical model for predicting or estimating
an output based on one or more inputs
I Unsupervised: Apply to situations with only inputs and one wants
to learn relationships and structure from such data
Supervised learning: Overview
Motivating example
100
200
300
20 40 60 80
Age
W
a
ge
100
200
300
2004 2006 2008
Year
W
a
ge
100
200
300
1. < HS Grad2. HS Grad3. Some College4. College Grad5. Adv nced Degree Wage$education W a ge Motivating example In the example I Years of education, age and year are input variables. Also predictors, independent variables, features, covariates I Wages is the output variable, also response variable, dependent variable Despite the variation/noise, the figures suggest there are some overall relationships between the inputs and the output. That overall relationship is what interests us. Some notation I Inputs are denoted X1, X2, . . . , Xp I Ouputs are denoted Y I A unit (person, firm, village, etc) is denoted i, and there are N of these units, so i = 1, 2, . . . , N I Unit’s i value for variable j is xij (where j = 1, 2, . . . , p). We believe there is some relationship between Y and X = (X1, X2, . . . , Xp) which can be writen, in general form Y = f(X1, X2, ..., Xp) + ε ≡ f(X) + ε I f is some fixed but unknown function I ε is a random error term I ε is independent of X1, X2, . . . , Xp I ε has 0 mean Why estimate f Prediction Values of the inputs are known, but output is not observed at the time. If f can be estimated then we can predict Y given the levels of X1 . . . Xp Y = f (X1, . . . Xp) ≡ f (X) The accuracy of the prediction depends on I Reducible error: In general f 6= f but this error can be reduced using the most appropriate statistical method I Irreducible error: Recall Y = f(X) + ε, where ε cannot be predicted. No matter how well we estimate f , we cannot reduce the error introduced by ε Why estimate f Prediction More formally, the average squared error is a valid measure of accuracy of the prediction, E(Y Y )2. It can be shown E(Y Y )2 =E [ f(x) f (X) ]2 + V (ε) =Reducible+ Irreducible (1) Why estimate f Inference How do X1, X2, ...Xp affect Y I Which predictor is associated with the response I Can the relationship between inputs and output be approximated using a specific model How is f estimated There are i = 1, . . . n units, j = 1...p inputs and one output. Let yi unit i’s value for the output Let xij unit i’s value for input j Let’s put these values into a vector xi = (xi1, xi2, ..., xip)′ The full dataset set is the set {(x1, y1), (x2, y2)...(xn, yn)}. In SL this is called the training data. We will use this training data to estimate (learn) the unknown function f . How is f estimated Parametric methods 1. First assume a functional form (a shape) for f . For instance, 2. Find a procedure that uses the training data to fit (or train) the model Example: 1. Assume f(X) = β0 + β1 ·X1 + β2 ·X2 + ...+ βp ·Xp The model specifies everything; the only unknown bit are the parameters βj , j = 1...p. 2. Finding the βj , j = 1...p; the common way of doing this is using Ordinary Least Squares. This method is parametric in the sense that the problem of finding f is reduced to estimating a small set of parameters. How is f estimated Parametric methods Pros I Tend to rely on models that can be estimated quickly and easily I Parametric models are easy to interpret, so they are good for inference Cons - The choice of the functional forms is subjective - The model we choose will usually not match the true f and this will lead to poor inference and prediction if the differences are too big To address the last point, we can devise flexible parametric models, but this will generally reduce the interpretability of the model and might lead to a problem of overfitting Example The motorcycle data; We have fitted a model of the form Acceleration = β0 + β1 ·Milliseconds+ ε 100 50 0 50 10 20 30 40 Milliseconds Ac ce le ra tio n How is f estimated Non-Parametric methods Do not make explicit assumptions about the functional form of f Instead, they try to estimate an f that gets as close as possible to the data points without being too wiggly or rough. I These methods can potentially fit a wider range of shapes for f I They avoid the risk of misspecification1 However I These results produced by these methods are more difficult to interpret I Since they do not rely on prior information (in the form of parameters) they need a lot more data (information) to work optimally I They normally rely on tuning parameters that determine the amount of smoothing; these tuning parameters need to be chosen before estimation. 1While in a parametric model the proposed shape might be far away from f , this is avoided in nonparametric methods which do not impose any a priori shape on f ) Example The motorcycle data; We have fitted a nonparametric model (a local linear regression) 100 50 0 50 10 20 30 40 Milliseconds Ac ce le ra tio n Regression vs Classification. Two types of variables: I Quantitative: take on numerical values (age, weight, income, etc) I Qualitative or categorical: take values in one of, say, K different classes or categories (e.g. female vs non-female; brand A vs brand B vs brand C; primary education, secondary education, college, university degree) When the output variable Y in a supervised problem is -quantitative we tend to talk about Regression problems. -qualitative we tend to talk about Classification problems. Because in supervised learning problems prediction is critical, different statistical methods will tend to apply depending on whether we have quantitative or qualitative outputs2 2A school of thought in Economics has been promoting the use of linear regression when output is dichotomous; this approach is often justified, but only because Economists are primarily concerned about problems of inference; when prediction is at stake, this kind of approximation is troublesome. Assessing model accuracy. No statistical model/method strictly dominates any other method for all problems. THus it is important to decice, for a particular type of problem and a given dataset, which method produces the best results. This is the most challenging aspect of SL. Assessing model accuracy: Quality of fit. However: I In SL, we use a training data set to find a model. . . I . . . and we use that model to predict the outcome in a test data set (which includes inputs, but not the output of interest) In general, I we do NOT care about accuracy in the training data I we care about accuracy in the test data Assessing model accuracy: Quality of fit. For instance, in regression settings, a popular measure is the Mean Squared Error (MSE): MSE = 1 n n∑ i=1 (yi f (xi))2 This measure will be small if predictions are very close to responses, on average. Assessing model accuracy: Quality of fit. How can we go about trying to select a method that minimises the test3 MSE CASE 1 I Often we may have a test data set available (e.g. a set of observations that were not used to train the data). I In that case we can evaluate the MSE of competing models in the test data set and select the model with smallest MSE in the test data set CASE 2 I Often, however, there is no test data. I In this case we need to rely on the MSE of the models in the training data set I Suppose we made then a choice of model, say f (.), to predict f I For this to be useful, we need to assume that future test data will come from a similar, underlying model. 3As opposed to the training MSE. Assessing model accuracy: Quality of fit. Assessing model accuracy: Quality of fit. Assessing model accuracy: Quality of fit. Assessing model accuracy: Bias-Variance trade-off Let xo denote a set of inputs in the test data set. It can be shown that that the MSE resulting from predicting the output in the test set using the model estimated with the training set is, MSE = V (f (xo)) + [Bias(f (xo))]2 + V (ε) This tells us that, to minimise the expected test error, we need to select a method that reduces both the variance and the bias. Assessing model accuracy: Bias-Variance trade-off I V (f (xo)) tells us how our predictions would change under a different training data set. I If we want to reduce variability across data sets, we need to sacrifice some of the wiggles and detail shown in the data (as the blue model in the previous two figures) I But this means that we will incur in a bigger amount of bias across samples Assessing model accuracy: Bias-Variance trade-off I [Bias(f (xo))]2 tells us the error in our predictions, incurred by using a model. I If we want to capture as much detail as possible in the training data set, we will need a very flexible model. This, however, might result in small changes in our predictions following small changes in inputs (higher variance) Assessing model accuracy: Bias-Variance trade-off In general, as we use more flexible methods, teh variance will increase and the bias will decrease. The relative rate of change of these two quantities determines whether the test MSE increases or decreases As we increase the flexibility of a class of methods, the bias tends to decrease faster than the variance increases; this reduces the MSE. However at some point, further flexibility does not lead to great improvements in terms of bias, whilst the variance continues to increase. At that point the MSE starts to increase. Assessing model accuracy: Quality of fit. Assessing model accuracy: Quality of fit. As a side note, MSE = V (f (xo)) + [Bias(f (xo))]2 + V (ε) Since V (ε) is the irreducible error, we see that MSE can never lie below this irreducible error. Classification The ideas just explained apply to classification problems. However, the most common way of quantifying the accuracy of classification models is to estimate the training error rate, the proportion of mistakes made if we apply the model to training observations, 1 n n∑ i=1 I(yi 6= y i) where, now, y i is the predicted class for unit i. Given a test data set, we can similarly define the test error rate, Average(I(yo 6= y o)) A good classifier is one for which this latest error is smallest. Classification The Bayes classifier I Minimises Average(I(yo 6= y o)) I Assigns an observation to the most likely class given its predictor values: P (Y = j|X = xo) Classification Classification The Bayes classifier I Minimises Average(I(yo 6= y o)) I Assigns an observation to the most likely class given its predictor values: P (Y = j|X = xo) The problem is, of course, that this probability is know know and must be estimated (we will see how to do so in a future lecture).