1 MAST90044 Thinking and Reasoning with Data Semester 1 2022 Assignment 2 Due: 17:00 PM, Wednesday 27 April Student name:______________________________ Student number:____________________________ Please label your assignment with the following information in the appropriate spots at the top of this document: o your name o your student number This assignment is worth 15% of the marks in this subject, and covers the work done up to Week 7 (with a focus on Lab Chapters 4–6). The total number of marks for this assignment is 65. Your assignment should show all working and reasoning, as marks will be given for method as well as for correct answers. Please spellcheck your document. Each question is followed by an empty box for the answer. Please answer each question in the dedicated box. If you need more space for a question you can add this at the end of this document BUT clearly state in the box that the answer to the question continues on the additional pages. Please DO NOT resize/move the boxes, or add additional pages (except for right at the end). The document needs to be the correct format (boxes on the right pages) for ease of marking. Paste any R code and output into the boxes along with your answers. Graphics from R can be resized within your document; make them smaller (but still legible) as necessary to ensure they are in the box. Tutors will not help you directly with assignment questions. However, they may give you some help with R if you ask e.g. what does the hist() function do Please note that we may mark only a subset of questions. Any extensions need to be approved by Julia. Please email both Julia and Tina if you need an extension. Late assignments are penalised with a 20% reduction per day. Any assignment submitted more than 3 days (72 hours) after the due date without an extension will receive a score of 0. Assignments are to be saved as a pdf once complete and submitted (uploaded) via GradeScope. You can resubmit your assignment at any time up to the deadline. We highly recommend that you upload a draft version of your assignment well in advance of the due date/time, as ‘technical issues’ or ‘failure to upload properly’ will not be accepted as a valid excuse for not submitting on time and you will be penalised. Please note that only your final (most recent) submission will be marked. 2 Question 1 [5+3+4+5+4+5+4+5] The dataset unescoSample.csv (available on Canvas) contains economic and demographic information from the 1990 UNESCO yearbook on a sample of the world’s countries. Definitions of the variables in the dataset are as follows: Birth rate per 1,000 of population Death rate per 1,000 of population Infant deaths per 1,000 of population Life expectancy at birth for males (years) Life expectancy at birth for females (years) Gross National Product (GNP) per capita Geopolitical group: 1. Eastern Europe 2. South America and Mexico 3. Western Europe, North America, Japan 4. Middle East 5. Asia 6. Africa During the following analysis, each country may be treated as a single, equally weighted observation (despite some countries having larger populations than others). (a) Use an appropriate graphical tool to explore the relationship between life expectancy at birth for females and geopolitical group. Using the graph, compare female life expectancy across groups and comment on anything else interesting that you see. 3 (b) Calculate point estimates of average female life expectancy for each geopolitical group. Do this by using the tapply function, including the argument na.rm=TRUE to exclude missing data from the calculation. Find a 95% confidence interval for the mean female life expectancy in the Middle East. 4 (c) Test whether the mean female life expectancy for Group 2 is significantly higher than 65 years. Use = 5%. Be sure to state your hypotheses, the p-value (or critical value) and your conclusion in the context of the problem. Would your conclusion change if = 1% (d) Test whether there is a difference in mean GNP between geopolitical groups 1 and 5 by using the t.test function with var.equal=TRUE. State your
hypotheses, the p-value (or critical value) and your conclusion, for =
5%. What assumptions have been made while doing this test and do you think they are reasonable Why/why not 5 (e) Repeat the test in part (d), this time setting var.equal=FALSE. Compare the 95% CIs generated under the two methods. 6 (f) Fit a regression model with GNP as a predictor of male life expectancy, for Africa only. Plot male life expectancy against GNP and superimpose the regression line for your model. Identify any unusual data points. Discuss whether or not you believe these points should be removed when fitting the model and why/why not. 7 (g) You discover that Tina made a data entry error when uploading the dataset. Remove one datapoint that Tina accidentally added. Refit the model from (f) without this datapoint. State the equation of the new fitted model and give estimates for all relevant parameters. Give a measure of how much variability in male life expectancy is explained by variability in GNP. (h) Present two relevant diagnostic plots for the model fitted in (g). List the model assumptions and comment on whether they hold or not in this case, with reference to your diagnostic plots. 8 9 Question 2 [4+4+2+3] The ‘Black Summer’ Australian bushfires in 2019–2020 burned over 24 million hectares of land, directly caused 33 deaths and blanketed parts of Australia in smoke for weeks. The table below gives a sample of AQI (air quality index) readings for Canberra city centre during December and January, 2018–2019 and 2019–2020, measured at the same date and time across the two years. The higher the reading, the more particles are in the air, with an AQI over 300 rated as ‘hazardous’ to human health. Date and time 18–19 19–20 2/12 1:00 20 89 11/12 6:00 24 170 26/12 18:00 42 286 29/12 9:00 58 320 6/1 4:00 62 2333 9/1 1:00 34 414 16/1 6:00 67 77 22/1 0:00 43 33 22/1 3:00 35 23 24/1 13:00 45 192 Source: data.act.gov.au/Environment/Air-Quality-Monitoring-Data/94a5-zqnn Note: air quality is a random variable that can change at different times of the day and year, depending on the weather and pollution levels. (a) Use a t-test to test whether the average air quality index during December–January was significantly higher in 2019–20, compared with 2018–19, using = 10%. Be sure to state your hypotheses, p-value and conclusion in the context of the problem. 10 (b) Use the sign test to test whether the median air quality index during December–January was significantly higher in 2019–20, compared with 2018–19, using = 10%. Do this step-by-step (i.e. do not use the binom.test or sign.test automated functions in R). Be sure to show all working, state your p-value and conclusion. 11 (c) Out of the tests used in (a) and (b), which do you believe is more appropriate in this scenario and why For your preferred test, is it possible that you could have made a Type I or II error, in this case (d) Pretend that it is summer 2022–2023 and that you are the Chief Minister of Canberra. There is another (similarly sized) bushfire nearby but this time, all your air quality measuring devices are malfunctioning. You decide to make an announcement about whether or not the average air quality is worse, based on the decision of your past hypothesis testing. For this scenario, what are the definitions and consequences of making a Type I error or Type II error Would it be worse to make a Type I error, or a Type II error, in this case 12 Question 3 [4+3+5+5] Let us return to the face recognition example from Assignment 1. On completion of both parts of the test (the ‘face memory test’ and the ‘face sorting test’), you will receive an overall percentage score of how well you did. We collected the following 20 scores from 90044 students: (66%, 62%, 61%, 62%, 57%, 53%, 68%, 57%, 73%, 58%, 57%, 62%, 57%, 69%, 65%, 53%, 59%, 58%, 64%, 69%) According to the UNSW research team, participant scores are distributed as follows: Score ≤ 60% [61%,64%] [65%,68%] [69%,71%] ≥72% Percentage of people with score 50% 25% 15% 5% 5% (a) Use a chi-squared test to determine whether the scores for the 90044 students are distributed significantly differently to the UNSW distribution. Be sure to state your hypotheses, the value of the test statistic and critical cut-off point (or alternatively, p-value) and conclusion. Use the bins/categories given in the table above and = 5%. 13 (b) What are the assumptions of this chi-squared test Have we violated any of these assumptions while applying the test in part (a) If yes, which one/s Justify your answer. (c) Repeat the test in part (a), but this time combining bins if required, to improve the validity of the test. This time, complete the test without the help of the chisq.test function in R. 14 (d) Jacob states that a student studying 90044 has a greater than 40% chance of getting a score of 60% or above in the face recognition test. State the hypotheses and calculate both an exact and approximate p- value for testing Jacob’s theory. What conclusion would you make at the = 10% level 15 Extra working space Question: 16 Extra working space Question: