R-BA810

BA810 individual assignment: exploring the bias-variance trade-off with Lasso, Due 9/26/2021 at 12pm (noon) EST Georgios Zervas Assignment description The purpose of this assignment is to explore the bias-variance trade-off while practicing your data.table, ggplot2, and R Markdown skills. The deliverable for this assignment is a PDF file t hat y ou w ill p roduce u sing R M a rkdown. T he PDF file should contain all the code you used to generate your results – show your work, and explain what your code is doing while being as concise and brief as you can. You should submit your PDF file o n Blackboard. Below I have provided a template to get you started. It shows you how to load and pre-process the data, and provides a skeleton for other things you need to do. Please do not change the definition o f t he t rain/test data – use as is. In addition to the assignment, your R Markdown document should include the following two components: Create a new R markdown file by clicking File -> New File -> R Markdown from R Studio. Enter a title and author name. Keep HTML output. Modify the top of your source code with the following: — title: ‘BA810 individual assignement’ author: YOUR NAME AND BU ID HERE, eg, Georgios Zervas (U12312345) — Delete the example code that R sticks in this new document (everything below the ## R Markdown section.) To compile you document click the “Knit” button. This should create a new window with an HTML version of your homework. If you are not familiar with R Markdown, datacamp has a great tutorial: https://www.datacamp.com /courses/reporting-with-r-markdown Once you are done, to create a submitable PDF, simply click “open in browser” and the print the document to PDF using you your browser. At the bottom of the document, complete the mandatory statement of collaboration. To complete this assignment, you will need to the use glmnet to fit a number of Lasso regressions to the CA housing dataset. Each Lasso will be fit using a different value of λ. Then, you will produce a figure with the following elements: 1. The x-axis should be λ, and it should be flipped so that larger values of λ appear on the left, and smaller values on the right – this is the opposite how we normally order the x-axis. 2. The y-axis should be MSE. 1 3. The plot should contain two lines with different colors: one line will be the MSE of your train data, and the other the MSE of your test data. 4. Finally, the plot should be annotated with two points (the big dots in the example figure below) that mark the minimum point on each MSE curve. The final figure should resemble the example below: Finally, you are asked to show the coefficients for the model with lowest test MSR and discuss these results (just 1-2 sentences explaining how you interpret this regression.) Assignment template Setup library(data.table) library(ggplot2) library(ggthemes) library(glmnet) theme_set(theme_bw()) Load the CA housing dataset The CA housing dataset is available on Blackboard. Download it to your computer, and load into R. You will have to edit the path below, depending on where you place the dataset on your computer. dd <- fread("data/housing.csv") The total_bedrooms column has missing data that we are going to impute. total_bedrooms_median <- median(dd$total_bedrooms, na.rm = TRUE) dd[is.na(total_bedrooms), total_bedrooms := total_bedrooms_median] Next, we will split our dataset in train and test. For simplicity and consistency, we will use rows 1 to 5,000 as training data, and rows 15,001 to 18,000 as test data. (In practice, you would split your data in train and test randomly.) 2 # use this piece of code as is train_offsets <- seq(5000) test_offsets <- 15000 + seq(3000) x_data <- model.matrix( ~ -1 + total_rooms + total_bedrooms + households + housing_median_age + population + median_income + ocean_proximity, dd) # outcome is median house value in millions y_data <- dd$median_house_value / 1e6 x_train <- x_data[train_offsets, ] y_train <- y_data[train_offsets] x_test <- x_data[test_offsets, ] y_test <- y_data[test_offsets] Run Lasso regressions (10 points) Next, we will use glmnet to fit a set of Lasso regression. When you invoke glmnet, by default it fits 100 different Lasso regressions for 100 decreasing values of lambda. (Sometimes fewer than 100 λ’s are returned – this is fine.) # this will fit 100 lasso regressions for different values of lambda (chosen automatically) est <- glmnet(..., ..., alpha = ..., nlambda = 100) You can inspect the values of λ used for each Lasso regression using est$lambda which is vector containing 100 values. Predict responses (5 points) Next, we will use each of these 100 models to create predictions for both train and test. y_train_hat <- ... y_test_hat <- ... Compute MSEs (5 points) Using these predictions, we are now ready to compute MSEs. Specifically, we want to create two MSE vectors (one for train, one for test). Each MSE vector will be of length 100, corresponding to the different values of λ. # write code to create a vector that contains 100 MSE estimates for the train data mse_train <- ... # write code to create a vector that contains 100 MSE estimates for the test data mse_test <- ... lambda_min_mse_train <- mse_train[which.min(mse_train)] lambda_min_mse_test <- ... Aggregate all MSEs in a single dataset (5 points) Now, we have all the required components for the plot: the vector of lambdas, the two vectors of MSEs, and the two values of lambda that minimize train/test MSE. As a first step, create a new data.table that contains these vectors. The data frame should have 3 columns: lambda, mse, and dataset, where dataset is either Train or Test. The data frame should contain 200 rows: 100 for Train and 100 for Test. 3 # create a data.table of train MSEs and lambdas dd_mse <- data.table( lambda = ..., mse = ..., dataset = "Train" ) # Use the rbind command to combine dd_mse_train # and dd_mse_test into a single data table dd_mse <- rbind(dd_mse, data.table( lambda = ..., mse = ..., dataset = "Test" )) Plot the MSEs (10 points) Now, that you have put together a data frame containing λ’s and MSEs, plot the figure described in the beginning of the assignment using ggplot2. Extract the results of the best fitting model (5 points) You can inspect the value of λ that minimizes test MSE: print(lambda_min_mse_test) Now, use the coef function with λ=lambda_min_mse_test. You can pass a specific value of λ to the coef function using the argument s=.... coef(...) Collaboration statement (10 points) Include the names of everyone who that helped you with this assignment, and explain how each person helped. While asking for help is OK, do NOT share exact solutions with other students. 4