1 ECA 5304 Homework 1 (Due by Friday February 18th 11pm to LumiNUS Homework 1 submission folder) Instructions: If working in groups, submit one copy per group and indicate clearly the names of the collaborators. You should use R to answer the computational questions. The submission format will be as follows: 1) Merge all your handwritten works and typed up answers/report into a single PDF file; 2) Append your R code at the end of the same PDF file; 3) Name your file as your NUS recorded name. E.g., if I were a student registered under “Tkachenko, Denis”, my filename would be “Tkachenko Denis.pdf”. Therefore, you will submit ONE pdf file per student or group that contains all the answers and the code appended at the end. Verify that your code runs seamlessly as a whole, containing commands to load all the necessary libraries etc. Where randomness is involved, remember to seed the seed(s) of the random number generator for replicability. You should verify that your code produces the same answers when run several times. Answers to computational questions should be formatted as a report (i.e., type /write up your answers and supplement with graphs/tables/numbers as necessary – do not just comment answers between the lines of the R script (don’t use Rmarkdown either – it looks ugly and hard to follow the answers) or printscreen the whole output when you only need 1 or 2 numbers from it. Finally, read the hints carefully and good luck! Question 1 (In-sample vs. out-of-sample MSE) In this question you will establish an important and generally valid result in a simplest possible setting. Consider the data generated by the following simple constant plus noise process: = μ + , ~ (0,σ ) You
plan to estimate the equation by OLS. Suppose you have a training
dataset =( , . . . , ) and a test dataset ′ = ( ′ , ′ . . . ,
′ ) of the same size generated by the same process. Hints: If you forgot, re-derive the OLS estimator for the model with only a constant. Also, recall the key properties of OLS estimators – these will be useful in working out the answers. 1) Derive the in-sample and out-of-sample mean squared error (MSE) expressions. 2) Using the results in (1), argue that the in-sample MSE is always going to be less than or equal to the out-of-sample MSE. 3) Explain what determines the difference between the two MSE’s. Do you think this result can be valid more generally Discuss the significance and any potential usefulness of this result. Question 2 (Some intuition on ridge regression) In this question you will establish an important property of ridge estimates in a simplest possible setting. Suppose that N =2, P=2, and the design matrix X is such that = = , = = (the indices refer to the row/column positions of the elements of the X matrix). 2 Furthermore, assume that the variables are demeaned, so there is no intercept included in the model and hence there is no constant in the design matrix. 1) Can you estimate the parameters using OLS in this setting Explain. 2) State the ridge regression optimization problem in this setting. 3) Solve the problem in (2) and argue that and obtained from ridge estimation for a given lambda will be equal in this setting. (Hint: you do not need to resolve the problem fully – derive the F.O.C.’s and see whether there is something you can note that gives away the answer. You also don’t need to use matrix algebra here.). 4) Without using any derivation, what would you intuitively expect to happen to and in this setting if instead of ridge you used the LASSO penalty Question 3 (Revisit the Boston housing data with new tools) 1) Load the quarterly ‘Boston’ dataset from the MASS package. Use the ‘ Boston’ command to retrieve the description of the dataset and the variables – discuss each variable and provide economic intuition on why it may be a useful predictor for the median house value (medv). Explain which variables you expect intuitively to be the most important predictors of medv (do not run any quantitative analysis yet). Randomly split the dataset into the training set of 400 observations and the test set of 106 observation. 2) Use the first block of code in the Boston_aug.R file to create a mildly expanded set of variables (original data plus cubic polynomials in all continuous variables, dummy for zn > 0). Perform best subset selection with the max number of 39 variables. Select the best models using AIC and BIC (use 3 methods for each: variance estimate from the model (take Cp/BIC computed by regsubsets()), variance from the largest model, and iterative variance). Discuss your results. 3) (Hint: it may be useful to start a new script here or run ‘rm(list = ls())’ to clean up the previous results to avoid clutter). Use the second code block in the Boston_aug.R file to create a greatly expanded set of variables – this is what is sometimes called “feature engineering”. Read the comments there to understand what variables are created. How many predictors are there in the augmented dataset Suppose we wanted to do subset selection – explain which strategies are possible here and which are not. 4) Perform forward stepwise selection, restricting the maximum model size at 200 variables. Use AIC and BIC with iterated variance estimation to select the best model. Contrast your results with the best subset results from previous section. 3 5) Fit the Ridge regression model using 10-fold cross-validation and evaluate its performance on the test set. There is a “1-standard-error rule of thumb” in the machine learning literature, stating that it may be desirable to utilize the most penalized model which is within one standard error in terms of cross-validation MSE (the logic is that we use an even simpler model that does not seem to be statistically different in performance on cross-validation). The glmnet package reports the corresponding lambda value as ‘lambda.1se’ in the results of the cv.glmnet() function. Evaluate the model corresponding to ‘lambda.1se’ – does this rule of thumb look like a good idea 6) Fit the LASSO model using 10-fold cross-validation and evaluate its performance on the test set for both ‘lambda.min’ and ‘lambda.1se’. Comment on your results briefly. 7) Now fit the LASSO with plug-in lambda using the rlasso() function from the hmd package. Use the default setting (adjusted for heteroskedasticity). Perform post-LASSO estimation as well using the same package under default settings. Comment on your findings. 8) Summarize all your results obtained in a big table for ease of reference. Now it’s time to check how much value added the fancy machinery brought to the table: 1) compute the test MSE for the polynomial in lstat only model we chose in Lecture 1 (4th order polynomial in lstat only); 2) compute the test MSE for the model fit in the original paper 1978 paper: include all the initial 13 predictors in levels, but square the rm variable (i.e., do not include its level, only the square). Comment on the results here and more broadly on the whole exercise, e.g., you may want to address the questions: What have you learned from this application Were there results that surprised you, were there results you expected to obtain Can you name some plausible reasons for why your methods did as well/badly as they did Do you have any takeaways that occurred to you in the application that were not apparent from the textbook/lectures Do you feel you are a more powerful data analyst than the researcher from 1978