r-STATS 330-Assignment 3

Department of Statistics STATS 330: Advanced Statistical Modelling Assignment 3 Semester 2, 2021 Total: 80 marks Due: 11:59pm, 30th September, 2021 Notes: (i) Write your assignment using R Markdown. Knit your report to either a Word or PDF document. (ii) Create a section for each question. Include all relevant code and output in the final document. (iii) Five presentation marks are available. Please think of your markers – keep your code and plots neat and remember to check your spelling. (R Markdown has an inbuilt spellchecker!) (iv) Please remember to upload your Word or PDF document to CANVAS by the due date. 1 In this assignment, we will examine modelling from a number of perspectives — all of which should be considered when deciding how to model data: inspecting the data and making allowances for other background events pertinent to these data, looking at how these models fit your data, see how well the model predicts, using all the above ideas to see which model you are most comfortable with. 2 Question 1 In this question, we will revisit the caffeine data in Assignment 2 and investigate ideas about fitting the data adequately and, hopefully, realising when we may be over-fitting the data. Caffeine data Re-read the section describing these data from assignment 2 — if necessary. In the following we will fit five models of increasing complexity. There will be polynomial models or order zero (null model) to order three (a cubic polynomial) along with the gam model. ## null model (order 0) mod.0=glm(cbind(Agrade,n-Agrade)~1, family=binomial, data =Caffeine.df) ## linear (order 1) mod.1=glm(cbind(Agrade,n-Agrade)~caffeine, family=binomial, data =Caffeine.df) ## quadratic (order 2) mod.2=glm(cbind(Agrade,n-Agrade)~caffeine+I(caffeine^2), family=binomial, data =Caffeine.df) ## cubic (order 2) mod.3=glm(cbind(Agrade,n-Agrade)~caffeine +I(caffeine^2)+I(caffeine^3), family=binomial, data =Caffeine.df) mod.gam=gam(cbind(Agrade,n-Agrade)~s(caffeine), family=binomial, data =Caffeine.df) # look at null, order 1 and GAM fits (adapt this below ) plot(I(Agrade/n)~caffeine, ylim=c(0,.6), main =”Proportion of A grades vs Caffeine”, data=Caffeine.df) # add lines caffs=seq(0, 500, by=1) new.df=data.frame(caffeine=caffs) p.gam=predict(mod.gam, newdata=new.df,type=”response”) lines(caffs, p.gam, col=”blue”) p0=predict(mod.0, newdata=new.df, type=”response”) lines(caffs, p0, col=”red”) p1=predict(mod.1, newdata=new.df, type=”response”) lines(caffs, p1, col=”orange”) legend(‘topright’, lty=1,lwd=3, col=c(“blue”, “red”,”orange”) , legend=c(“gam”, “null”,”linear”)) 3 (a) Run the above code and adapt it to include all of the five models’ fitted values. Which model fits the data best Comment briefly. [5 marks] (b) Use the anova function to evaluate the models that can be thought of as being subsets of each other. Comment on how you know they are subsets of each other and which model best fits these data with the least number of terms used (i.e. apply the principle of Occam’s razor). [5 marks] Hint: read your notes and see which of these models can be thought of as being simplifications (subsets) of other models. (c) Compute the AIC and for all models (including mod.gam) and comment on what is the best model. Comment briefly. [5 marks] (d) The following code can be used to find the best overall model using the the dredge function. The object msubset allows us to keep all lower order terms once we have found a significant higher order term. Submit and adapt this code to find the best model. Comment briefly on what it recommends. [5 marks] library(MuMIn) options(na.action = “na.fail”) msubset <- expression(dc(caffeine, `I(caffeine^2)`, `I(caffeine^3)`)) all.fits <- dredge(mod.3, subset=msubset) (e) Having done all this heroic work, what model do you propose is the best Briefly justify how you can to this conclusion. [5 marks] 4 Nerd alert — the following is some technical background which talks about best practice when doing polynomial regression. The cubic model above fits the following model: Yi ～ Bi(ni, pi) where Yi the number of people who got A grades out of ni people for caffeine level xi and log ( pi 1 pi ) = β0 + β1xi + β2x 2 i + β3x 3 i . It turns out that higher order fitted models can yield a hidden curse known as mul- ticollinearity (MC) and can give very computationally unstable estimates. We will examine this and come up with a better solution. You will recall that MC can come about when some of the variables are explained by other (or a combination thereof) variables. A simple way to diagnose this is by looking at correlation between these variables. If they are highly correlated with each other this may mean we have a problem with MC. We can extract the design or explanatory matrix object from out fitted model as follows: model.matrix(mod.3) ## (Intercept) caffeine I(caffeine^2) I(caffeine^3) ## 1 1 0 0 0 ## 2 1 50 2500 125000 ## 3 1 100 10000 1000000 ## 4 1 150 22500 3375000 ## 5 1 200 40000 8000000 ## 6 1 250 62500 15625000 ## 7 1 300 90000 27000000 ## 8 1 350 122500 42875000 ## 9 1 400 160000 64000000 ## 10 1 450 202500 91125000 ## 11 1 500 250000 125000000 ## attr(,"assign") ## [1] 0 1 2 3 One way to see if we have MC is by looking at the relationship between variables using s20x’s pairs20x function. Note that we ignore the column of 1’s which we use to estimate the β0 parameter. pairs20x(model.matrix(mod.3)[,-1]) 5 caffeine 0 50 00 0 15 00 00 25 00 00 0.96 0 100 200 300 400 500 0.91 0 50000 150000 250000 I(caffeine^2) 0.99 0 10 0 20 0 30 0 40 0 50 0 0.0e+00 4.0e+07 8.0e+07 1.2e+08 0. 0e +0 0 4. 0e +0 7 8. 0e +0 7 1. 2e +0 8 I(caffeine^3) round(cor(model.matrix(mod.3)[,-1]),3) ## caffeine I(caffeine^2) I(caffeine^3) ## caffeine 1.000 0.963 0.909 ## I(caffeine^2) 0.963 1.000 0.986 ## I(caffeine^3) 0.909 0.986 1.000 Here we see high correlation between our explanatory variables indicating possible instability in our coefficient estimates. Okay, so we have diagnosed a problem so here’s the solution - use the R function poly() as follows: mod.3a=glm(cbind(Agrade,n-Agrade)~poly(caffeine,3), family=binomial, data=Caffeine.df) (f) Your task is to prove that the predictions of the data we observe using mod.3 and mod.3a are identical. Also, you will need to show and that we have eliminated the MC using the new model matrix produced by the poly function. Comment briefly on your output. [5 marks] 6 Question 2 Another crazed and ethically challenged scientist was interested in working out the peak amount of caffeine a student needs to consume in order to optimise their A-grade per- formance. They collected the following data by randomly allocating 300 students each to groups where the caffeine levels where 0, 50, 100, 150 and 200mg, respectively. They recorded the following number of students who obtained A grades from the test from each group: 109, 155, 175, 158, and 103 A-grades recorded, respectively. (a) Plot these data (along with the fitted data) and show that a quadratic model (call it, say, mod.quad for future reference) in terms of caffeine fits this data best. Comment briefly. [5 marks] Hints: Ignore ‘best practice’ (discussed above) and use raw terms i.e. don’t use the recommended poly function for this example – it makes interpretation easier. You may need to create your own data frame object called, say, Caffeine2.df. Nerd alert — the sequel. One way to find the peak amount of caffeine for our model: Yi ～ Bi(ni, pi) where Yi the number of people who got A grades out of ni people for caffeine level xi where log ( pi 1 pi ) = β0 + β1xi + β2x 2 i is to use some calculus. It turn out that value of,say, xpeak = β1 2β2 . This can be estimated from the (random) data using the estimates β T = c(β 0, β 1, β 1) and so needs and estimate of its variance, and therefore standard error. If we know the variance matrix of β T = (β 0, β 1, β 2) called, say, Var(β ) an estimate of the variance of xpeak is given by the matrix operation ( g) TVar(β ) g 1 where ( g)T = (0, 1 2β 2 , β 1 2β 22 ) The variance of β for a model object in R can be obtained by the function vcov. To illustrate for the quadratic model called, say, mod.quad is vcov(mod.quad) ## (Intercept) caffeine I(caffeine^2) ## (Intercept) 1.272482e-02 -2.226622e-04 8.287286e-07 ## caffeine -2.226622e-04 7.027689e-06 -3.232232e-08 ## I(caffeine^2) 8.287286e-07 -3.232232e-08 1.618650e-10 (b) Produce an estimate of xpeak based on your quadratic model. Inspect the value and comment briefly. [5 marks] 1 For my fellow nerdlings, if f(β) = β1 2β2 then ( g)T = ( f(β) β0 , f(β) β1 f(β) β2 ) — also known as the gradient vector. 7 (c) Produce a vector in R called Delta.g according to our definition above using your quadratic model.Inspect the value and comment briefly. [5 marks] (d) Obtain an estimate of the variance for x peak by computing the following code below. Inspect the value and comment briefly. [5 marks] t(Delta.g)%*%vcov(mod.quad)%*%Delta.g It also turn out that any function of asymptotically normal random variables is also asymptotically normal. Your model estimates are asymptotically normal. (e) Use the above to create a 95% confidence interval for xpeak. Comment briefly. [5 marks] Phew! There must be an easier way. There is — simulation. So ‘buckle up’ and let’s do this! 8 Question 3 (a) Produce the estimates of the proportion of success, called preds, say, from 300 trials apiece, and use this to simulate a random binomial observations as follows (an example of a parametric bootstrap): ns=Caffeine2.df$n xs=Caffeine2.df$caffeine # create your preds ( ) in here # preds= ys=rbinom(length(ns),size=ns, prob=preds) ys Inspect the values you obtain and comment briefly. [5 marks] (b) Simulate 1000 xpeak values as you did above using 1000 random ys values as done above. Also, obtain the residual deviance statistics from these 1000 simulations. Name these vectors xpeaks and devs respectively. [5 marks] (c) Verify that the xpeak values are approximately Normal and compute the inner 95% quantiles for the xpeaks. Comment briefly on how it agrees with the answer you obtained in question 2 above. [5 marks] (d) Verify the deviances, devs, you obtained seem to come from an appropriate Chi- squared distribution. Comment briefly. [5 marks] Hint: find out what ‘ ’ is in the code below and then run it. hist(devs,breaks=100,prob=T) dvs=sort(devs) lines(dvs, dchisq(dvs, ),col="blue") 9