R-STAT2004 2021-Assignment 4

STAT2004 2021 – Assignment 4 Due date: 1 November 2021 at 16:00 The first four exercises are worth a total of 40 marks, contributing 10% to your final grade. Exercise 5 is a bonus question. A complete and correct solution of this exercise earns you an extra 1% for this assignment. Reminder: while discussion of the Assignment questions (amongst yourselves, with lecturers and/or tutors) is encouraged, the final write-up must be your own. If you cannot express a solution in your own words, then you must cite your source(s). Exercise 1 (Comparing ratings across groups) (8 marks) A recent poll asked social media users to provide their opinions on a decision by a popular photo-sharing app to remove the number of “likes” from their posts. Each respondent was asked to express their opinion on the following five-point scale: 1 = Strongly disagree 2 = Disagree 3 = Neutral 4 = Agree 5 = Strongly Agree Of the n = 198 respondents, 98 were “influencers” (with over 10,000 followers each) while the other 100 were regular users. The full dataset can be downloaded as a .csv file from Blackboard > Assessment > Assignment 4 > likes.csv. (a) (1 mark) Summarise the data using an appropriate set of summary statistics. (b) (3 marks) Do the two types of users exhibit differing opinions regarding the recent changes to the photo-sharing app Answer this question by carrying out an appropri- ate hypothesis test. Clearly state the null and alternative hypotheses, propose a test statistic, compute and interpret a p-value, and write your conclusions in a way that is understandable to a social scientist. (c) (2 mark) [Audio question:] State and critically assess any assumption(s) you made in answering (b). (d) (2 marks) A social scientist suggests comparing the two groups using a two-sample t-test applied directly to the five-point responses. Explain to her why this is inappropriate here. 1 Exercise 2 (Tuberculosis and blood type) (14 marks) Overfield and Klauber (1980) published the following data on the incidence of tuberculosis in relation to ABO blood groups in a sample of Eskimos: blood type tuberculosis severity O A AB B moderate or advanced 7 5 3 13 minimal 27 32 8 18 not present 55 50 7 24 We want to investigate whether tuberculosis incidence is related to blood type. Let pij denote the underlying proportion of the population with tuberculosis severity i ∈ {moderate/advanced, minimal, not present} and blood type j ∈ {O, A, AB, B}. For con- venience, write p = (pij) for the 3× 4 vector of proportions. (a) (1 mark) Write down the null and alternative hypotheses in words. (b) (1 mark) Write down the likelihood function for p given the observed counts x. Under the null hypothesis, pij = pi. × p.j for each i and j, where pi. is the overall proportion with tuberculosis severity level i and p.j is the overall proportion with blood type j. (c) (3 marks) Show that under the null model the MLEs of each p i. and p j. are given, respectively, by p i. = xi./n and p .j = x.j/n , where xi. is the observed number of cases of tuberculosis severity i, x.j is the observed number of cases of blood type j, and n is the total sample size. (d) (1 mark) Using the results from part (c), or otherwise, what counts would we expect to see in each cell of the table if the null hypothesis is indeed true Under the alternative hypothesis, there are no restrictions on the cell proportions pij (except that they must all sum to 1). (e) (1 mark) State the MLEs p ij of each cell proportion pij under the alternative. (You do not have to prove that these are the MLEs). (f) (2 marks) Using your results from parts (b), (c) and (e), or otherwise, numerically evaluate the generalized likelihood ratio test statistic, Λ = supH0 L(p|x) supH1 L(p|x) , for testing the association between tuberculosis and blood type based on the observed counts in the table above. Also, numerically compute the transformation 2 log Λ. (g) (1 mark) Using your results from part (d), or otherwise, compute Pearson’s χ2 statistic,∑ cells i,j ( Observedij Expectedij )2 Expectedij . Is Pearson’s χ2 statistic numerically close to the 2 log Λ statistic from part (f) 2 (h) (2 marks) Carry out the hypothesis test by computing and interpreting a p-value, and state your conclusion in a way that is understandable to a population health scientist. Notice that one of the cells in the table contains only 3 counts. This may render the asymp- totic χ2 distribution inaccurate for part (h). (i) (2 marks) Run an alternative analysis of the dataset to test the association between tuberculosis incidence and blood type. Does your conclusion from part (h) change Exercise 3 (Analysis of character agg lomerations) (8 marks) There is sometimes more than one way to correctly solve a given maths problem, but most definitely many more ways to incorrectly solve it. Can a textual/symbolic analysis of solutions help characterise the qualitative differences between correct and incorrect solutions To investigate this, a sample of 25 incorrect and 15 correct solutions to Assignment 3 Bonus Question 5 was taken from last year’s STAT2004 class. “Incorrect” solutions were defined as attempts which scored at most 2 marks out of 4, whilst “correct” solutions were attempts that scored at least 3 marks out of 4. The frequencies with which the following seven mathematical terms were used are tabulated below: incorrect correct maths terms solutions solutions μ 298 165 X 146 264 S or S2 119 16 ≤ or ≥ or < or > 105 328 1.96 or 1.96 29 203√ n 134 165 ∞ 29 30 Total 860 1171 (a) (3 marks) Are the relative frequencies of usage of these mathematical terms different between the correct and incorrect solutions Carry out an appropriate hypothesis test to assess this. Clearly define the parameter(s) of interest, state the null and alternative hypotheses, compute and interpret a p-value, and write a short conclusion. Of the sample of 25 incorrect solutions, 8 attempts scored 0 marks out of 4. The frequencies which with the seven mathematical terms were used in each of these 8 attempts are tabulated below: incorrect solution maths terms 1 2 3 4 5 6 7 8 μ 12 11 12 12 13 11 13 12 X 6 6 7 6 5 8 7 5 S or S2 5 5 5 4 5 3 4 4 ≤ or ≥ or < or > 4 4 6 2 4 2 6 6 1.96 or 1.96 2 0 0 0 2 0 2 2√ n 5 6 4 5 4 4 6 5 ∞ 2 0 2 2 0 0 0 2 3 (b) (3 marks) Are the relative frequencies of usage of these mathematical terms different across the 8 incorrect solutions Carry out an appropriate hypothesis test to assess this. Clearly define the parameter(s) of interest, state the null and alternative hypotheses, compute and interpret a p-value, and write a short conclusion. (c) (1 mark) [Audio question:] Is the p-value from part (b) suspicious to you If so, what does it suggest (d) (1 mark) What additional analysis should be performed before we can confirm our suspicions (if any) from part (c) Exercise 4 (Weight gain in pigs) (10 marks) A trial was conducted in Iowa, USA, examining the effects of vitamin B12 dietary supplements and antibiotics on weight gain in pigs. Twelve adult pigs were randomly divided into four groups (one using standard pig chow, one using pig chow with added vitamin B12, one using pig chow with added antibiotics, and one using pig chow with both added vitamin B12 and antibiotics). After one week of feeding, the pigs were weighed and their weight gain (in grams) was recorded. The data are plotted below: Vitamin B12 W e ig ht g ai n (kg ) 0 10 0 20 0 30 0 40 0 50 0 No Yes A A A A AA P P P P PP Weight gain in pigs, by Vitamin B12 level and [P]resence or [A]bsence of Antibiotics We can model the weight gains {Yjki} using a two-way ANOVA with interactions: Yjki = μ+ αj + βk + δjk + jki , where j = 1, 2 denotes the level of factor A (antibiotics), k = 1, 2 denotes the level of factor B (vitamin B12), and i = 1, 2, 3 indexes the observations in each group. Assume that the errors jki iid~ N(0, σ2) across all j, k and i. The common variance σ2 is taken to be unknown. If we parametrize this model using the sum constraints,∑ j αj = 0, ∑ k βk = 0 and ∑ j δjk = ∑ k δjk = 0 , 4 then μ can then be interpreted as the overall mean, each αj is the mean difference between level j of factor A from the overall mean, each βk is the mean difference between level k of factor B from the overall mean, and each interaction δjk is the difference μjk μ αj βk between the mean μjk of group (j, k) and the value given by the additive model μ+αj + βk. (a) (3 marks) Show that under the sum constraints the MLE of each parameter is given by μ = Y , α j = Y j Y , j = 1, 2, β k = Y k Y , k = 1, 2, δ jk = Y jk Y j Y k + Y , j = 1, 2, k = 1, 2. (b) (2 marks) Show that the following sum-of-squares decomposition holds: SSTotal = SSA + SSB + SSAB + SSresidual , where SSTotal = ∑ jki (Yjki Y )2 is the overall sum-of-squares ignoring groups, SSA = ∑ jki (Y j Y )2 is the sum-of-squares between levels of factor A, SSB = ∑ jki (Y k Y )2 is the sum-of-squares between levels of factor B, SSAB = ∑ jki (Y jk Y j Y k + Y )2 is the interaction sum-of-squares, SSresidual = ∑ jki (Yjki Y jk )2 is the residual sum-of-squares within groups. Hint: Start with the following identity: Yjki Y = (Yjki Y jk )+(Y j Y )+(Y k Y )+(Y jk Y j Y k +Y ) (c) (1 mark) Briefly explain why the residual sum-of-squares has distribution given by SSresidual σ2 ~ χ2 dfresidual , where dfresidual = JK(r 1) = 8. [Here, J = 2 is the number of levels of factor A, K = 2 is the number of levels of factor B, and r = 3 is the number of replications in each group.] (d) (1 mark) Briefly argue why the residual sum-of-squares SSresidual is independent of the interaction sum-of-squares SSAB. Using similar calculations to part (c), it also can be shown that under the null hypothesis H0 : all interactions δjk = 0, the interaction sum-of-squares has distribution given by SSAB σ2 ~ χ2 dfAB , where dfAB = (J 1)(K 1) = 1. 5 (d) (1 mark) Using parts (c), (d) and the above result, or otherwise, argue why the null distribution of the so-called F -ratio, F = MSAB MSresidual := SSAB/dfAB SSresidual/dfresidual , is an F distribution with numerator degrees-of-freedom dfAB and denominator degrees- of-freedom dfresidual. A partially-complete two-way ANOVA table for the pigs weight dataset is given below: Source df SS MS F P VitaminB12 1 218700 218700 60.33 < 0.0001 Antibiotics 1 19200 19200 5.30 ≈ 0.05 VitaminB12:Antibiotics 1 172800 172800 47.67 < 0.0005 Residuals 8 29000 3625 Total 11 439700 (e) (2 marks) Using your results from parts (b) and (d), or otherwise, complete the above ANOVA table. Hence, summarise the main finding(s) of this experiment and write a short conclusion. Exercise 5 (Bonus question) (4 marks) Let {Ak, k = 1, 2, . . . ,K} be a collection of K ≥ 1 arbitrary events in some sample space . In particular, the Aks are not assumed to be mutually exclusive nor mutually independent. (a) (3 marks) Prove the following version of the Bonferroni inequality: P ( K k=1 Ack ) ≥ 1 K∑ k=1 P (Ak) . (b) (1 mark) Using the result from (a), or otherwise, show that if each of K ≥ 1 hypothesis tests are carried out at level α/K, then the overall chance of a Type I error across all K tests is at most α. 6