SOST30062 Data Science Modelling – Final report Semester 2, 2021/22 Administrative arrangements The final report contributes 60% of your total mark for SOST30062 Data Science Modelling. Please submit your work using the link to the TurnItIn anti-plagiarism service on the course Blackboard. The submission link will be placed on the “Assessment / Assignments, tasks” page of Blackboard, but you will be notified about the exact location before submission opens. You can also find detailed instructions there on how to upload your assignment to TurnItIn. The report is due by 2pm (UK time) on Tuesday, 20th of May, 2022. Your report should not be longer than 2,000 words. (Figure/table captions, bibliography, and appendices do NOT count towards the word limit). You are advised to keep a copy of the work you hand in. Your attention is drawn to the sections regarding late submissions, mitigating circumstances and plagiarism in the Course Outline (available on the “Essential Information / Course outline” page in Blackboard). If you have any questions concerning this assignment, you should email András at: andras.voros@manchester.ac.uk. Description of the task We provide you with a large social science dataset, which you will need to analyse in your report. You can download the dataset and additional description from the “Assessment / Assignments, tasks” page of the course Blackboard. Using the dataset, you are asked to 1. select one variable, Y, to be explained by the other variables, the Xs, in the dataset; 2. motivate and formulate a research question about Y and the Xs (we present many examples of possible research questions throughout the course); 3. apply an unsupervised learning technique (lecture week 9) to explore the dataset, such as PCA, clustering techniques (Method I – exploration); 4. choose a supervised learning technique (lecture weeks 3-5 and 8) that is appropriate for Y and answers your research question, such as regression models, splines, LDA, trees- based methods (Method II – inference); 5. apply the method selected in the last step in a suitable advanced analytic approach (lecture weeks 6-7), such as subset selection, ridge regression, cross-validation, bootstrap, LASSO (Method III – the “twist”); 6. write a report on the above steps, producing no more than one descriptive figure or table and one inferential figure or table that sum up your results. 2 Overview of the task Structure of the report We suggest the following section structure for your report (you may choose to structure your report differently): 1. Introduction: briefly describe the empirical context and present the dataset 2. Research question: motivate and formulate your research question 3. Methods: briefly present your method choices (Methods I-III) and the steps of your analysis 4. Results: present and interpret your results, with two key tables/figures 5. Conclusion: briefly discuss what the results imply for your research question, discuss one or two key limitations of your analysis 6. Appendices: add any important additional figures/tables, include R code (so that your analyses can be reproduced) Marking criteria 20% – Data presentation and research question (sections 1-2 from the above structure) Do you present the context of the data (e.g. topic) and the dataset (e.g. types of variables, number of observations you use) clearly Do you formulate your research question clearly 3 Do you explain why you think the research question is interesting and relevant Can you answer your research question using the available variables 20% – Method choices (section 3) Are the chosen methods appropriate for your Y and X variables Do you motivate their use (explain why they are appropriate for Y and the Xs) Do you explain the steps of your analysis clearly Can you answer your research question with your planned analysis 20% – Data analysis (section 4, appendix R code) Is your application of the methods to the data technically correct (e.g. are the variables transformed if needed and used in the correct role in the analysis) Do you use the R packages and functions most appropriate for your analysis Do you provide a reproducible R code in the appendix 20% – Interpretation of results (sections 4-5) Are the interpretations of your results correct and clearly explained Do you discuss what the results imply for your research question Is your conclusion about your question correct in light of the results Do you discuss one or two important limitations of your approach in answering the research question 20% – Visual presentation of results (section 4) Are the tables and figures readable and clear (e.g. they are clearly annotated, do not have overlapping labels) Are they appropriate representations of your findings Do you interpret them correctly in the text Additional notes on marking We will value original and well-motivated research questions. We will also reward thoughtful ideas about the limitations of your analysis (and how one could possibly overcome them). Though your R codes will not be assessed, we appreciate it if you provide a clear and understandable code in the appendix. As the marking weight of your two main figures/tables is quite high (20%), we will evaluate these critically. We will provide you with good examples and further guidelines for the different elements of the report throughout the semester (for example, about writing clean R codes and making readable figures).