STA302H1

STA302H1: Methods of Data Analysis I (Lecture 4) Mohammad Kaviul Anam Khan Assistant Professor Department of Statistical Sciences University of Toronto Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 1 / 40 ANOVA ANOVA is another way of testing the significance of the regression line This focuses on variance decomposition The total variation of Y is explained by total sum of squares (SST) SST =∑ni=1(yi yˉ)2, which is basically the numerator of S2y The target is to explain some of the variability of SST by the regression line Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 2 / 40 ANOVA The SST can be decomposed as following, n∑ i=1 (yi yˉ)2 = n∑ i=1 (yi y i + y i yˉ)2 = n∑ i=1 (yi y i)2 + n∑ i=1 (y i yˉ)2 + 2 n∑ i=1 (yi y i)(y i yˉ) The third term in the equation becomes, n∑ i=1 (yi y i)(y i yˉ) = n∑ i=1 (y i(yi y i) yˉ(yi y i)) = n∑ i=1 y iei yˉ n∑ i=1 ei = 0 We know that ∑ni=1 ei = 0. Using the second normal equation (Week 1), we can show that ∑ni=1 xiei = 0, which implies ∑ni=1 y iei = 0 Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 3 / 40 ANOVA Thus, the SST can be divided into two parts, n∑ i=1 (yi yˉ)2 = n∑ i=1 (yi y i)2 + n∑ i=1 (y i yˉ)2 The first term on the right hand side ∑ni=1(yi y i)2 is the residual sum square, ((n 2)S2) The second term explains the variance in y i or the variation in fitted values from the regression. One can easily show that ∑ni=1 y i/n = yˉ The second term on the right hand side is called the regression sum squares (SSreg) The total variation in Y has been decomposed into two parts. One part that is explained by the regression line and the second part is explained by random errors Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 4 / 40 ANOVA What is degrees of freedom For SST the degrees of freedom is n 1 since there is one constraint, which is all the data were used to calculate yˉ The SSreg is determined completely by one parameter estimate β 1. Thus the degrees of freedom is 1 For RSS there are two constraints. Both β 0 and β 1 needs to be calculated. Thus, the degrees of freedom is n 2 We will later show (during the lectures of multiple linear regression) that, SSreg σ2 ~ χ21 (1) RSS σ2 ~ χ2n 2 (2) Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 5 / 40 ANOVA Using (1) and (2), we can see that, F0 = SSreg 1 RSS n 2 Under H0 : β1 = 0, the F0 ~ F1,n 2 distribution Ideally we want the SSreg to as close to SST as possible However, in real life data that is not always possible The F test here detects how close SSreg is to TSS. The closer it is the bigger the value of F0 One can also show that t2n 2 = F1,n 2 Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 6 / 40 ANOVA Under H0 The F test has the following assumption, the errors are independent of each other they have constant variance and mean 0 and are Normally distributed The F test which was conducted using the variance decomposition is called the ANalysis Of VAriance (ANOVA) test Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 7 / 40 ANOVA Table The ANOVA test is traditionally presented with the ANOVA table For example for the production data the ANOVA table looks as following, Sources of variation Sum Squares DF Mean Squares F value Regression SSreg 1 MSreg = SSreg 1 F0 = MSreg MRSS Residuals RSS n-2 MRSS = RSSn 2 Total SST n-1 The p value is calculated P(F1,n 2 > F0) Also, we can reject the H0 if, F0 > F1 α,1,n 2 at α level of significance Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 8 / 40 The Coefficient of Determination Another often used measure to assess whether the regression line explains enough of the variability in the response is the coefficient of determination, R2 This summary gives the proportion of the total sample variability in the response that has been explained by the regression model, so it is naturally based on the sum of squares. It can be calculated in two ways: R2 = SSregSST R2 = 1 RSSSST The range is 0 ≤ R2 ≤ 1. R2 ≈ 1 implies that the model is a good fit and X is an important predictor of Y R2 ≈ 0 implies that the model is not a good fit and X is not an important predictor of Y It provides an idea of how much variation the regression line explains, but since it is not a formal test, we cannot say how much is enough. Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 9 / 40 Categorical Predictors So far we have considered the predictor X to be continuous However, often the predictor X could be categorical For example, let the outcome be Y = blood pressure and predictor X is smoking status Here the predictor is binary (whether the person smokes or not) How to deal with these type of predictors Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 10 / 40 Dummy Variables Recall the indicator variable defined in the first lecture From categorical predictor we can create indicator variables. These are called the dummy variables Let’s considers the simplest form of a categorical predictor with only two values. A common use for dummy variable regression is for comparing a response variable for two different groups Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 11 / 40 Dummy Variables We will be working with a new dataset regarding the time it takes for a food processing center to change from one type of packaging to another. The data in the file ‘changeover times.txt’ contains: The response: change over time (in minutes) between two types of packaging The categorical predictor: indicating whether the new proposed method of changing over the packaging type was used, or the old method. We have 48 change-over times under the new method, and 72 change-over times under the old method. Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 12 / 40 Food Processing and Packaging Dataset The test can be performed with a two sample t test However, we may also want to model the relationship between Y and X directly. We can use a simple linear regression E (Y |X = x) = β0 + β1x where Y is the change-over time x = 1 when the new change-over method is used, and x = 0 when the existing method is used. Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 13 / 40 Food Processing and Packaging Dataset Based on the test, we see that we have a significant decrease in change-over time from the existing method to the new method. What would be the interpretation of slope β1 Since, X is no longer continuous, we can not interpret based on per unit change in X Based on these, we can say the slope reflects the average reduction in change-over time of 3.2 minutes when switching from the existing method to the new method. because we only have two levels of the variable, it is not the average change in response for a unit change in X , but rather the average difference in response between the two methods. The slope provides the magnitude of the difference, while the hypothesis test tells us whether the difference is statistically significant. Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 14 / 40 Least Squares for Multiple Linear Regression Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 15 / 40 Least Squares for Multiple Linear Regression Again to obtain the least squares estimates we need to minimize the residual sum square RSS(β0, …, βp) = ∑n i=1 e2i , i.e, RSS(β0, …, βp) = n∑ i=1 (yi p∑ j=0 βjxij)2 To obtain the least square estimates we need to minimize the RSS with respect to the regression parameters. That is, RSS(β0, …, βp) b0 = 2 n∑ i=1 (yi p∑ j=0 βjxij) RSS(β0, …, βp) bj = 2 n∑ i=1 (yi p∑ j=0 βjxij)xij There would be p + 1 normal equations and p + 1 unknowns Solving these equations to obtain solutions can be tedious (What should we do ) Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 16 / 40 Matrix Algebra Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 17 / 40 Multiple Linear Regression To solve these p + 1 equations we need to apply matrix algebra In the next part of this lecture we are going to focus on how to use matrix (linear) algebra to obtain least squares estimates for multiple linear regression For this, we are going to right the regression in a matrix form, Y = Xβ + ε where, 1. Y is a n × 1 vector 2. X is a n × (p + 1) matrix. The first column is just a vector of 1’s 3. β is a (p + 1)× 1 vector and 4. ε is a n × 1 vector Are you comfortable with the notations Let’s review some very basic linear/matrix algebra Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 18 / 40 Definitions 1. Matrix: An n × p matrix A is a rectangular array of elements in n rows and p columns A = a11 a12 … a1p a21 a22 … a2p … … … … an1 an2 … anp When n = p, then the matrix becomes a square matrix 2. Vector: A matrix with only one row (row vector) or one column (column vector). For example, Y = (Y1,Y2, …,Yn) is row vector of dimensions 1× n and Y = Y1 Y2 … Yn is a column vector with dimensions n × 1 Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 19 / 40 Definitions 3. Transpose of a Matrix: Let A′ is the transpose matrix of A as defined previously then A′ is an p × n matrix with, A′ = a11 a21 … an1 a12 a22 … an2 … … … … a1p a2p … anp here we can see that the rows of A are columns of A′ and vice-versa 4. Symmetric Matrix: If A is a square matrix and A = A′ then A is a symmetric matrix Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 20 / 40 Definitions 5. Diagonal Matrix: square matrix where all elements are zero except those on the main diagonal (top left to bottom right). For example, D = d11 0 … 0 0 d22 … 0 … … … … 0 0 … dnn The diagonal elements can be different in this case 6. Identity Matrix: a diagonal matrix where the elements on the diagonal are all equal to 1, denoted by I. For example, I = 1 0 … 0 0 1 … 0 … … … … 0 0 … 1 Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 21 / 40 Definitions 7. Inverse of a matrix: If A is a square matrix then let B be another matrix such that AB = I. Then B is the inverse of matrix A and B = A 1 8. Orthogonal Vectors: Two vectors u and v are orthogonal if their dot product u · v = 0 9. Orthogonal Matrix: If A is a square matrix then if A′A = I, then A is a orthogonal matrix. That is A 1 = A′ 10. Idempotent Matrix: Let, A be a square matrix. Then if AA = A, then A is called an idempotent matrix. 11. 1n Vector: A vector is called an 1n vector if all the n elements of the vector are 1. 12. Jn Matrix: An n × n square matrix with all the elements being 1. Basically, Jn = 1n1′n 13. Rank of a Matrix: The rank of a matrix is given by the number of linearly independent columns or the number of linearly independent rows. If the all the columns or rows are linearly independent then the matrix has full rank Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 22 / 40 Matrix Operations Addition and Subtraction: matrix addition/subtraction involves doing element-wise addition/subtraction. Only valid when matrices have same orders (dimensions) Addition and Subtraction Let, A = ( 1 2 3 4 ) and B = ( 3 4 1 5 ) . Then A+ B = ( 4 6 4 9 ) addition is commutative, i.e. A+ B = B+ A Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 23 / 40 Matrix Operations Multiplication: multiple each row of first matrix to each column of second matrix, where we perform element-wise multiplication and then sum the resultant products within each row-column combination. only valid if number of columns of first matrix equal number of rows in second matrix. Multiplication Let, A = ( 1 2 3 4 ) and B = ( 3 4 1 5 ) . Then AB = ( 5 14 13 32 ) multiplication is not always commutative, i.e. AB = BA Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 24 / 40 Matrix Operations Transpose of a sum is equal to the sum of the transposed matrices, i.e., (A+ B)′ = A′ + B′ Transpose of a product is equal to the product in reverse order of the transposed matrices, i.e., (AB)′ = B′A′ Scalar multiplication involves multiplying each element by the scalar quantity. Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 25 / 40 Matrix Operations Determinant of a square matrix: Let a matrix be A = ( a11 a12 a21 a22 ) . Then the determinant of A is |A| = ∣∣∣∣∣ ( a11 a12 a21 a22 ) ∣∣∣∣∣ = a11a22 a21a12 This gets a little complicated for n × n matrix but R automatically calculates these if determinant is zero, the matrix is singular or not of full rank, otherwise it is non-singular. some properties of determinants: 1. |I| = 1 when I is an identity matrix. For any diagonal matrix the determinant is just the product of the diagonal elements 2. |A| = |A′| and |A 1| = |A| 1 3. |cA| = cn|A| 4. For square matrices A and B, |AB| = |A||B| 5. The inverse of a matrix is calculated as A 1 = 1|A| ( a22 a12 a21 a11 ) Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 26 / 40 Matrix Operations For any two column vectors x = (x1, x2, …, xn)′ and y = (y1, y2, …, yn)′, the dot product is given by x · y =∑ni=1 xiyi If the dot product of two column vectors x and y are 0 then x and y are orthogonal. That is x ⊥ y If x = (x1, x2, …, xn)′ is a column vector then ||x ||2 = √∑n i=1 x2i is the L2 or Euclidean norm A vector x = (x1, x2, …, xn)′is called an unit vector if √∑n i=1 x2i = 1 Projection matrix P is a square matrix of order p that is both symmetric P′ = P and idempotent P2 = P. linear transformation y = Px means y is the projection of x onto the subspace defined by columns of P Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 27 / 40 Matrix Operations Let A be a p × p matrix. The Trace of the matrix is defined by, tr(A) = p∑ i=1 aii it is a linear mapping: for all square matrices A and B, and all scalars c, tr(A+ B) = tr(A) + tr(B) tr(cA) = ctr(A) Trace has some nice properties, 1. tr(AB) = tr(BA) 2. tr(ABC) = tr(CAB) An important property of an idempotent matrix is that the rank of the matrix is the trace of that matrix Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 28 / 40 Matrix Operations Few things to remember, Addition and subtraction only works for matrices with same order Inverse matrix can be constructed only for square non-singular matrix For multiplication of matrices, the number of columns of the first matrix have to be same with the number of rows for the second matrix For large matrices, hand calculation can be very tedious. Thus, we are going to use R for these calculations Let’s do some R demonstration Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 29 / 40 Expectations and Variances of Vectors Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 30 / 40 Expectations and Variances of Vectors Let, Y = (Y1,Y2, …,Yn)′ be a random vector then, E (Y) = (E (Y1),E (Y2), …,E (Yn))′ The Variance of a random vector is given by a covariance matrix The covariance matrix will have variances at the diagonals and covariances as the off diagonals. i.e., Var(Y) = Var(Y1) Cov(Y1,Y2) … Cov(Y1,Yn) Cov(Y1,Y2) Var(Y2) … Cov(Y2,Yn) … … … … Cov(Y1,Yn) Cov(Y2,Yn) … Var(Yn) Basically the matrix is created by Cov{(Y E (Y))(Y E (Y))′} It is very easy to see that covariance matrix by definition is symmetric Let b be vector and Y be a random vector. Then Var(b′Y) = b′Var(Y)b Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 31 / 40 LS estimation for MLR The Residual Sum Squares is given by, RSS(β) = ε′ε = (Y Xβ)′(Y Xβ) From the orders of the vectors/matrices we can see that even for multiple linear regression the RSS is a scalar Expanding the RSS we get, RSS = (Y Xβ)′(Y Xβ) = Y′Y Y′Xβ β′X′Y+ β′X′Xβ = Y′Y 2β′X′Y+ β′X′Xβ We can write Y′Xβ = β′X′Y, since they produce a scalar Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 32 / 40 LS estimation for MLR Differentiating the RSS w.r.t. β we get, RSS β = β ( Y′Y 2β′X′Y+ β′X′Xβ) = 0 0 2X′Y+ 2X′Xβ = 0 X′Xβ = X′Y (X′X) 1X′Xβ = (X′X) 1X′Y β = (X′X) 1X′Y Which is the least square estimate of the β Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 33 / 40 Matrix Algebra with SLR For simple linear regression we can write, X = 1 x1 1 x2 … … 1 xn and Y = (y1, y2, · · · yn)′ Then X′X = ( n ∑ni=1 xi∑n i=1 xi ∑n i=1 x2i ) = n ( 1 xˉ xˉ 1n ∑n i=1 x2i ) and the |X′X| = n2 ( 1 n ∑n i=1 x2i (xˉ)2 ) = nSXX Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 34 / 40 Matrix Algebra with SLR Thus we have, ( X′X ) 1 = 1 n ∑n i=1 x2i SXX xˉ SXX xˉSXX 1 SXX You can see if you multiply the matrix by σ2 you can get the variance-covariance matrix (recall lecture 2) Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 35 / 40 LS estimation for MLR Thus the projection of Y on X is given by, Y = Xβ = X(X′X) 1X′Y What are the dimensions of the matrix X(X′X) 1X′ This is often called the projection or hat matrix. That is H = X(X′X) 1X′. The hat matrix maps the vector of observed values to the vector of fitted values. The residuals can be calculated as, e = Y Y = Y X(X′X) 1X′Y = (I X(X′X) 1X′)Y = (I H)Y Again, we can see that the residuals are also a linear combination of Y Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 36 / 40 LS estimation for MLR We can see that H′ = (X(X′X) 1X′)′ = X(X′X) 1X′ = H Thus, H is symmetric HH = HH = X(X′X) 1X′X(X′X) 1X′ = X(X′X) 1X′ = H since, (X′X) 1X′X = I Thus H is idempotent (I H)(I H) = (I H)(I H) = I H H+HH = I H So I H is also idempotent Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 37 / 40 Partition Matrix We have seen that the hat matrix H = X(X′X) 1X′ Let the X matrix can be column partitioned in two matrices X1 with size n × k and X2, with size n × (p + 1) k. We can see that HX = X and X′H = X′ This implies that HX = [HX1 HX2] = X = [X1 X2] That is HX1 = X1 and HX2 = X2 Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 38 / 40 Assumptions of Multiple Linear Regression Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 39 / 40 Assumptions of MLR Recall E (Y|X) = Xβ For multiple linear regression the assumptions are still the same as the simple linear regression 1. Linearity 2. Homoscedasticity 3. Normality (for testing) Thus we assume ~ N(0, σ2I).Here, 0 = (0, 0, …, 0)′ and σ2 is a scalar This implies that Y|X ~ N(Xβ, σ2I) Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 40 / 40