R-MATH6169

MATH6169 Flexible Regression
Coursework 2021/22
The coursework for this module accounts for 50% of the total mark. You should hand in your solutions as
a single PDF file via Blackboard by Monday 2 May 2022, 11pm. Please aim to restrict your report to
not more than 20 pages (at most 25 pages) including references, tables, graphs and any code. There are
two parts. Part A covers mixed models and requires the use of the package STATA. Part B is on smoothing
with splines and requires the use of the package R.
Part A [50 marks]
1. [20 marks] Here we use the data taken from Agresti (2013: 219) on graduate school applications to
the 23 departments within the College of Liberal Arts and Sciences at the University of Florida during
the 1997–1998 academic year. The dataset contains the department ID (department), the number of
applications (napplied), and the number of students admitted (nadmitted) cross-classified by gender
(female) and is available as STATA-file admissions.dta. The underlying research question is whether
admission decisions are independent of gender.
a) Describe the logistic regression model (including the associated log-likelihood) precisely for this
case taking the number admitted out of the number applied as response and gender and de-
partment as fixed effects. Provide and discuss the results of your model fit. What are the
disadvantages of this model [5 marks]
b) Now precisely describe the mixed effects logistic regression model (including the associated log-
likelihood) which takes gender as fixed effect and department as random effect. Provide and
discuss the results of your model fit. What are the benefits of the model in comparison to the
model used in a) Is the random department effect required in the mixed model Provide an
evaluation table using AIC and BIC. [5 marks]
c) Now investigate whether there is a gender-department interaction. Clearly state the random
effects distribution. Assuming that a random intercept is required, fit the model with random
slope for gender with and without allowing correlation with the random intercept. Provide and
discuss the results of your model fit. Add this model to your model evaluation table. [8 marks]
d) Provide a discussion of your analysis including a graph of admitted female students and their
model fits versus department and derive some conclusions in relation to the underlying research
question. [2 marks]
2. [30 marks] This part of the coursework will be using a longitudinal dataset from Diggle et al. (2002),
consisting of weight measurements of 48 pigs on 9 successive weeks. Pigs are identified by the variable
id and is available as STATA-file pigs.dta.
a) Explore the data by providing a spaghetti-plot (plot each pig in its weight over time). Give an
interpretation of the plot. [5 marks]
1
b) Describe precisely the model which takes week as a continuous linear effect and includes a random
effect for pig. Provide the results from the fit of the model. You may assume that the response
weight is normally distributed. [5 marks]
c) Now extend the model in b) and include a random slope for the week effect. Give the precise
mathematical formulation and provide the results of your analysis. There are three basic models
here: only random intercept, independent random intercept and slope, correlated random inter-
cept and slope. Provide a model evaluation table using the appropriate likelihood. [10 marks]
d) A different approach uses pig and week as crossed random effects. The formal model is
weightij = β0 + β1weekij + ui + wj + ij
where ui is the random effect for pig i and wj the random effect for week j. We assume that
ui ~ N(0, σ2u), wj ~ N(0, σ2w), ij ~ N(0, σ2)
and all independent. Provide a graphical sketch what this model means. Provide a model analysis
and discuss your results. Does this model provide a better fit than the random coefficient model
[Hint: to do the fit in STATA you need to leave the level field empty and fill in the crossed factor
field the relevant random effect.] [10 marks]
References:
Agresti, A. 2013. Categorical Data Analysis. 3rd ed. Hoboken, NJ: Wiley.
Diggle, P. J., P. J. Heagerty, K.-Y. Liang, and S. L. Zeger. 2002. Analysis of Longitudinal Data. 2nd ed.
Oxford: Oxford University Press.
2
Part B [50 marks]
A botanist has recently discovered a new flowering plant species and is interested in its nectar production
across different regions. One question of interest is the relationship between temperature and the nectar
production per month per flower on average. There are three variables in the nectar.csv file:
temp: the temperature of the region when the nectar production is measured.
weight: the average amount (in mg) of nectar produced per month per plant on average.
group: the grouping information to be later used for cross-validation.
1. [1 mark] Read the data in the nectar.csv file into R and create a scatterplot of temp (x-axis) versus
weight (y-axis). Describe the relationship between these two variables.
2. [8 marks] This question concerns with fitting a regression with a cubic spline using B-spline basis,
where weight is the response and temp is the covariate:
(a) Use the bs() function to create B-splines, including the intercept, with knot positions at temp
values of 7.5, 15, 22.5 and 30. [2 marks]
(b) Fit a cubic spline regression with the basis from part (a) and provide the summary output of this
model. [2 marks]
(c) Create another scatterplot as in Question 1 and add the fitted values (as a red line) of the cubic
spline regression from part (b) to this plot. [1 mark]
(d) Use the predict() function to obtain the 95% confidence intervals for the fitted values. [2
marks]
(e) Add the confidence intervals (as blue lines) to the plot in part (c). [1 mark]
3. [10 marks] Fit a regression with a cubic spline using the truncated power series basis, where weight
is the response and temp is the covariate:
(a) Create the truncated power series basis with knot positions at temp values of 7.5, 15, 22.5 and
30. [5 marks]
(b) Fit a cubic spline regression with the basis from part (a) and provide the summary output of this
model. [2 marks]
(c) How do the uncertainty associated with the coefficient estimates in part (b) compare to those
obtained in Question 2 Explain the differences you have observed. [3 marks]
4. [14 marks] Fit a regression with a natural cubic spline, where weight is the response and temp is the
covariate.
3
(a) If we want to fit a natural cubic spline that has the same degrees of freedom as the cubic spline
model in Question 2, how many internal knots are there [3 marks]
(b) Use the ns() function to create the basis for a natural cubic spline, including the intercept, and
set the boundary knots to 7 and 28 degrees C for temp. The number of internal knots is what
you have calculated for part (a), and the internal knot positions divide the interval between the
boundary knot values of temp equally. Also, assume that all the internal knots are not equal to
the boundary knots. [2 marks]
(c) Fit a natural cubic spline regression with the basis from part (b). [2 marks]
(d) Create a scatterplot as in Question 1 and add the fitted values of part (c) (as a red line) to this
plot. [1 mark]
(e) Use the predict function to obtain the 95% confidence interval for fitted values of the natural
cubic spline obtained in part (d). [2 marks]
(f) Add the confidence intervals from part (e) (as blue lines) to the plot in part (d). [1 mark]
(g) Does the scatterplot in part (f) display any problems Explain your answer. [3 marks]
5. [7 marks] Fit smoothing splines using the smooth.spline() function. As above, weight is the response
and temp is the covariate.
(a) Fit a smoothing spline with 5 effective degrees of freedom (df) and print out the model. [1 mark]
(b) Fit a smoothing spline with effective df = 10 and print out the model. [1 mark]
(c) Fit a smoothing spline with effective df = 20 and print out the model. [1 mark]
(d) For each of the models in parts (a), (b) and (c), create a scatterplot as in Question 1 and add
the fitted values (as a red line). [1 marks]
(e) Examine and compare the fitted values from parts (a), (b) and (c), and discuss any concerns you
have observed. [3 marks]
6. [10 marks] Cross-validation (CV) method can be used for choosing the smoothing parameter of the
smoothing spline model, but it can be more generally applied to other methods. In this question, we
use CV to choose the number of internal knots in a natural cubic spline.
Set the boundary knots to temp values of 1 and 34. The candidate number of internal knots are integer
values from 1 to 10 inclusive. The internal knots positions divide the interval between the boundary
knots of temp into equal subintervals. We assume that all the internal knots are not equal to the
boundary knots. The group variable divides the observations into 10 equal subsets, each containing
25 observations.
Write your own R commands to perform 10-fold CV to calculate the overall CV error when fitting a
natural cubic spline with each of the candidate number of internal knots in turn. You must use the
groupings defined by group for the cross-validation. Present the overall CV errors for the 10 candidate
4
number of internal knots, and also plot the overall CV errors (y-axis) against number of internal knots
(x-axis).
IMPORTANT: The R codes required for this question must be reasonably easy to follow by the marker.
5