math38141-MATH38141

MATH38141 Regression Analysis – Coursework
This coursework accounts for 30% of your overall mark for this course and it may take around 10
hours to complete. Please present your solution in the form of a report which you should upload
on Blackboard as a single file before the deadline. You can use R to do your calculations, but
you must show the formulae (not as R code) that you have used to calculate in the text. Marks
will be awarded for correct and accurate calculations and their interpretation. Interpretations
should be made using the context instead of the generic symbols only. The marks will be more
difficult to obtain if the presentation of the results is not effective or if any formulas used in the
calculations are missing from the text.
Submit your solution as a single file to Blackboard by 6 pm on November 21, 2022.
1. The taste of matured cheese is related to the concentration of several chemicals in the final
product. In a study of cheddar cheese from the La Trobe Valley of Victoria, Australia,
samples of cheese were analyzed for their chemical composition and were subjected to taste
tests. The data set consists of 30 samples of mature cheddar cheese. Observations were
made on 4 variables:
Taste – subjective taste test score,
Acetic – natural logarithm of concentration of acetic acid,
H2S – natural logarithm of concentration of hydrogen sulfide,
Lactic – concentration of lactic acid.
An EXCEL spreadsheet containing the above data is available on Blackboard.
(a) Draw scatterplots of Taste against each of the other three variables. Describe any
observable trends in your plots.
(b) Formulate a multiple linear regression model for the dataset, using Taste as the
response and the remaining three variables as regressors.
(c) Calculate LSEs and construct 95% confidence intervals for all regression coefficients.
(d) Give interpretations of the estimated coefficients obtained in (d).
(e) Calculate and interprete the R2 statistic for the model.
(f) It is argued that when fitting a multiple linear regression model to the data, using
Taste as the response and the other factors as the explanatory variables, the intercept
term β0 should be set to zero. Is this argument reasonable Why
(10 marks)
It is well accepted that the chemicals ‘H2S’ and ‘lactic acid’ contribute significantly to the
good taste of cheddar cheese. To investigate whether ‘acetic acid’ also affects the taste
of cheddar cheese, two multiple linear regression models were fitted to the ‘taste’ data,
yielding the following results:
1
Explanatory variables
Model 1 H2S, lactic
Model 2 acetic, H2S, lactic
(g) Decide which one is the reduced model. Then fill in the following ANOVA table to
compare the nested models.
Source s.s. d.f. m.s. F-ratio
Regression fitting reduced model – –
Extra
Residual fitting full model –
Total – –
(h) Calculate the p-value associated with the significance of ‘acetic acid’. Do you think
‘acetic acid’ should be included in the multiple linear regression model
(i) Regressing ‘taste’ on ‘acetic acid’ alone and test at the 5% level for the significance
of ‘acetic acid’ under this simple linear regression model. Does your conclusion con-
tradict that given in (h) Comment.
(6 marks)
2. A dataset concerns the price per capita of beef annually from 1925 to 1941 together with
other variables relevant to an economic analysis of the price of beef. It contains the
following variables:
YEAR = Year to which the data refer;
PFO = Retail food price index;
DINC = Disposable income per capita index;
CFO = Food consumption per capita index;
RDINC = Index of real disposable income per capita;
RFP = Retail food price index adjusted by the CPI;
PBE = Price of beef (cents/lb).
An EXCEL spreadsheet containing the above data is available on Blackboard.
A multiple linear regression model is proposed to describe the relationship between the
response variable PBE and the other 6 explanatory variables (YEAR, PFO, DINC, CFO,
RDINC, RFP).
An agriculturalist believes, however, that the variation in PBE can be adequately explained
by the variable CFO alone, and hence proposes a simple linear regression model ω for the
data.
(a) Specify the models and ω, and state the model assumptions clearly.
(b) Calculate the residual sums of squares fitting and ω respectively.
(c) Explain why in (b) the residual sum of square of is not larger than that of ω.
(d) Under model , test whether the regression coefficient of DINC is 2 at 10% level and
give conclusion.
2
(e) Suppose that we predict the explanatory variables to have the following figures in the
year 2015:
Year PFO DINC CFO RDINC RFP
2015 200.0 200.0 200.0 220.0 2000.0
2018 190.0 210.0 210.0 210.0 2100.0
Calcualte the prediction interval of the change of PBE from 2015 to 2018.
(f) It is suggested that the changes in the relationships between PBE and CFO depends
on the year when the data are collected, i.e. the variable YEAR. Answer the following
questions.
(1) Propose a suitable model where model ω is nested in and explain why it is
suitable.
(2) Denote the proposed model in (1) above by 1. Carry out a hypothesis test to
compare ω against 1 and make conclusion.
(3) Based on the fitted model 1, plot four fitted regression lines on the same diagram
to display the relationships between PBE and CFO in the years 1925, 1930, 1935
and 1940, respectively.
Comment on the changes in the relationship between PBE and CFO across the
period 1925–1940, i.e. years leading to the Second World War.