Published 10th January 2022 Coursework Assessment Brief Module code/name MSIN0025 Data Analytics II Module leader name Xingyi Li Academic year 2021/22 Term 2 Assessment title Individual Report of DA II Individual/group assessment Individual Return and status of marked assessments: Within 4 weeks from the date of submission as per UCL guidelines. The module team will update you if there are delays through unforeseen circumstances (e.g. ill health). All results when first published are provisional until confirmed by the Examination Board. Copyright Note to students: Copyright of this assessment brief is with UCL and the module leader(s) named above. If this brief draws upon work by third parties (e.g. Case Study publishers) such third parties also hold copyright. It must not be copied, reproduced, transferred, distributed, leased, licensed or shared any other individual(s) and/or organisations, including web-based organisations, without permission of the copyright holder(s) at any point in time. Academic Misconduct: Academic Misconduct is defined as any action or attempted action that may result in a student obtaining an unfair academic advantage. Academic misconduct includes plagiarism, obtaining help from/sharing work with others be they individuals and/or organisations or any other form of cheating. Refer to Academic Manual Section 9: Student Academic Misconduct Procedure – 9.2 Definitions. Referencing: You must reference and provide full citation for ALL sources used, including articles, text books, lecture slides and module materials. This includes any direct quotes and paraphrased text. If in doubt, reference it. If you need further guidance on referencing please see UCL’s referencing tutorial for students here: https://library- guides.ucl.ac.uk/referencing-plagiarism/welcome. Failure to cite references correctly may result in your work being referred to the Academic Misconduct Panel. Content of this assessment brief Section Content A Core information B Coursework brief and requirements C Module learning outcomes covered in this assessment D Groupwork instructions (if applicable) E How your work is assessed F Additional information Published 10th January 2022 Section A: Core information Submission date 03/05/2022 Submission time 1:00pm UK time Assessment is marked out of: 100 marks % weighting of this assessment within total module mark 50% Maximum word count/page length/duration 2000 words (EXCLUDING the code appendix). Footnotes, appendices, tables, figures, diagrams, charts included in/excluded from word count/page length Appendices are excluded from the word count. Footnotes, and substantial prose in tables, figures, diagrams, and charts is included in the word count. Bibliographies, reference lists included in/excluded from word count/page length The bibliography is excluded from the word count. Penalty for exceeding word count/page length Standard UCL penalties for exceeding (deduction of 10 percentage points, capped at 40% for Levels 4,5, 6, and 50% for Level 7) Refer to Academic Manual Section 3: Module Assessment – 3.13 Word Counts. Penalty for late submission Standard UCL penalties apply. Students should refer to Refer to https://www.ucl.ac.uk/academic-manual/chapters/chapter-4- assessment-framework-taught-programmes/section-3-module- assessment#3.12 Submitting your assessment The assignment MUST be submitted to the module submission link located within this module’s Moodle ‘Submissions’ tab by the specified deadline. Anonymity of identity. Normally, all submissions are anonymous unless the nature of the submission is such that anonymity is not appropriate, illustratively as in presentations or where minutes of group meetings are required as part of a group work submission The nature of this assessment is such that anonymity is not required. Published 10th January 2022 Section B: Assessment Brief and Requirements For this final assignment, you will need to identify an important business problem, find one or more relevant datasets, generate insightful visualizations of the data, fit a range of models to the data to produce your best predictions/forecasts, and make and justify recommendations to a decision maker related to this problem. A key goal for this final individual assignment is to demonstrate a wide range of the concepts covered in the module. This assignment is worth 50% of the overall module assessment. Report Structure Section 1: The Problem (10%) Discuss the problem you are addressing. What are the questions and business/management decisions your analysis is trying to address Describe your problem’s decision maker and what is important for them to know from your data analysis Discuss the source of your data. Questions to consider include: – Where did you find this data – How reliable or uncertain is this data – How old is the data – Is the data recorded at given dates or times Identify and justify your choice of target attribute(s) and explain how this/these should be derived, if not already available. Section 2: Understand the Data (30%) Discuss the nature and size of the dataset(s) you are using. Discuss the data attributes that are relevant to your problem. Exactly what does the data represent and, if relevant, how was it derived How is it distributed What type of data is it Explore and discuss whether any of the data attributes you have focused on are closely correlated with other attributes – either positively or negatively. Include at least 3 different types of Tableau visualisations (e.g. map, scatter plot, bar chart, pie chart, box-and-whisker plot) to support your discussions. Include at least 3 R-generated plots or aggregation tables to support your discussions. Include the R-code you used in the code appendix. Section 3: Prepare the Data (10%) If required, explain how you have derived your chosen target attribute(s) in Tableau and in R. Discuss and justify what other steps you may have taken to prepare your data, including, where relevant: removing attributes from consideration, adding further “derived” attributes (eg Dates), imputing “reasonable” values for missing data, and standardizing data values. Prepare suitable separate “Training”, “Validation” (if required) and “Testing” subsets of the dataset. Include any R-code you used to prepare your data in the code appendix. Section 4: Generate and Test Prediction Models (40%) Published 10th January 2022 Select and justify at least 3 different prediction model types that are likely to best help with your stated problem objectives. Configure your models (e.g. select attributes and/or other model parameters) that you expect will best deliver relevant insights and/or provide the lowest error rates, justifying your decisions. Run these models, discussing the model outputs and drawing, where possible, insights related to your problem. Prepare and discuss at least 1 ensemble model, combining two or more of your prediction models. Select a proper scoring rule to measure the accuracy of your models. Determine and comment on the best generalised error rate across your 4 prediction models and of your ensemble models. Discuss what steps you may have taken to improve your individual models. Include any R-code you used to prepare your data in the code appendix. Section 5: Problem Conclusions and Recommendations (10%) Combining the results from your various analysis steps, draw conclusions about the particular problem and questions stated at the beginning. What recommendations would you now make to your problem’s decision maker and why Which are the most important variables for the decision maker to look at Marking Criteria Marks will be awarded for: Using Tableau and R in a way that is relevant and appropriately justified, and that is ideally different from that presented in the lectures and other module materials. Meaningful insights are discussed after each analysis task. Your analysis should flow, with each step building on the last. Structuring your report and analysis so as to follow the standard stages of a data science project. The correctness and quality of your code, visualisations and conclusions. Employing a wide range of the concepts and methods covered in this module. Problem identification: you have found a novel and significant problem. Proposed a compelling solution/recommendation; you have generated important business or policy insights. Your report was well-written: clear and compelling. Submission Requirement You are required to submit 3 files for this assignment: 1. A PDF file containing your fully completed assignment, including an appendix containing all your R-based analysis. 2. A runnable code file containing all your R-based analysis. This file can either be submitted as a R script file (.R file) or as an R-based Jupyter Notebook File (.ipynb file). 3. A data file if it is not too large to upload on Moodle. If it is too large, please submit to this part of the assignment drop box a PDF document providing links to the original datasets. Only the first PDF file will be marked. The additional code file and data file are only provided to ensure your code works as you have claimed it should. Published 10th January 2022 Section C: Module Learning Outcomes covered in this Assessment This assessment contributes towards the achievement of the following stated module Learning Outcomes as highlighted below: During the module, students will work with example data sets to experience and understand the stages of the data science process: they will visualise data, propose models that might fit the data, choose a best-fit model, use that model to make predictions, and test those predictions against new realisations. The module builds on ideas and tools introduced in MSIN0010 Data Analytics I and MSIN0023 Computational Thinking, including R and Tableau, statistical software used by the world’s leading data scientists. Published 10th January 2022 Section D: Groupwork Instructions (where relevant/appropriate) N/A Published 10th January 2022 Section E: How your work is assessed Within each section of this assessment you may be assessed on the following aspects, as applicable and appropriate to this assessment, and should thus consider these aspects when fulfilling the requirements of each section: The accuracy of any calculations required. The strengths and quality of your overall analysis and evaluation; Appropriate use of relevant theoretical models, concepts and frameworks; The rationale and evidence that you provide in support of your arguments; The credibility and viability of the evidenced conclusions/recommendations/plans of action you put forward; Structure and coherence of your considerations and reports; Appropriate and relevant use of, as and where relevant and appropriate, real world examples, academic materials and referenced sources. Any references should use either the Harvard OR Vancouver referencing system (see References, Citations and Avoiding Plagiarism) Academic judgement regarding the blend of scope, thrust and communication of ideas, contentions, evidence, knowledge, arguments, conclusions. Each assessment requirement(s) has allocated marks/weightings. Student submissions are reviewed/scrutinised by and internal assessor and are available to an External Examiner for further review/scrutiny before consideration by the relevant Examination Board. It is not uncommon for some students to feel that their submissions deserve higher marks (irrespective of whether they actually deserve higher marks). To help you assess the relative strengths and weaknesses of your submission please refer to UCL Assessment Criteria Guidelines, located at https://www.ucl.ac.uk/teaching-learning/sites/teaching-learning/files/migrated- files/UCL_Assessment_Criteria_Guide.pdf The above is an important link as it specifies the criteria for attaining 85% +, 70% to 84%, 60% to 69%, 50% to 59%, 40% to 49%, below 40%. You are strongly advised to not compare your mark with marks of other submissions from your student colleagues. Each submission has its own range of characteristics which differ from others in terms of breadth, scope, depth, insights, and subtleties and nuances. On the surface one submission may appear to be similar to another but invariably, digging beneath the surface reveals a range of differing characteristics. Published 10th January 2022 Section F: Additional information from module leader (as appropriate) N/A