R-ST340

ST340 Lab 6: Linear models
2021–22
Distance functions
(a) Write a similar function to calculate a matrix of pairwise `2 distances:
distances.l2 <- function(X,W) { # YOUR CODE HERE } (b) Write a function to calculate the `1 distances between pairs of row vectors in two matrices: distances.l1 <- function(X,W) { # YOUR CODE HERE } (c) Write a similar function to calculate the Mahalanobis distance between the row vectors, given a D ×D covariance matrix S: distances.maha <- function(X,W,S) { # YOUR CODE HERE } Linear models: Setup The dataset SmokeCancer.csv shows lung cancer rates by U.S. state in 2010, with a number of covariates such as Federal Year 2010 cigarette sales per 100,000. Read the data file on lung cancer and create a data frame with variables of interest. X = read.table("SmokeCancer.csv", header=TRUE,sep=",",row.names=1) LungCancer = data.frame(CigSalesRate=100000*X[,"FY2010Sales"]/X[,"Pop2010"], X[,c("CigYouthRate","CigAdultRate","LungCancerRate")]) Linear regression (a) Fit a linear model for LungCancerRate ( lm for a reminder about lm): summary(lm(LungCancerRate~CigSalesRate+CigYouthRate+CigAdultRate,data=LungCancer)) Ridge regression (b) Fit a linear model for LungCancerRate and use the ridge estimator: summary(lm(LungCancerRate~CigSalesRate+CigYouthRate+CigAdultRate,data=LungCancer)) 1 Classical statistical methods for model validation Consider the same dataset SmokeCancer.csv. (a) The Akaike Information criterion (AIC) and Bayesian Information criterion (BIC) are statistical measures of the model performance (validation). They are (different) ways of quantifying the trade-off between model complexity (in terms of, e.g. the number of parameters) and the fit to the training data (in terms of likelihood), defined as follows: Akaike Information criterion (AIC) = (2×#parameters 2× log(likelihood)), and Bayesian information criterion (BIC) = (log(amount of data)×#parameters 2× log(likelihood)). Use AIC and BIC to find a good linear model for LungCancerRate in terms of CigSalesRate, CigYouthRate and CigAdultRate. You could also try using transformations of the covariates by adding terms such as I(CigSalesRate 2) and I(CigSalesRate*CigAdultRate) to your formulae. (We say that a model is good whenever it has the smallest BIC or AICamong all the linear models with the same covariates) Write a function that takes a formula and then calculates AIC and BIC. Use your function to find a good linear model for LungCancerRate. 2