Python-DATA1002/1902

Page 1 of 9 DATA1002/1902 (Sem2, 2021) Project Stage 2 Due: 11:59pm on Sunday October 17, 2021 (end of week 9) Value: 10% of the unit Note: these instructions are long and somewhat complicated, but the work you need to do is not actually very much. It should be easy to fit into the provided three class-weeks of your time, as long as you interact frequently and apply any feedback from the tutors. Don’t wait till near the due date to start! If anything in the instructions is unclear or confusing, please ask about it on Edstem, using the category “Group Report”, and sub-category “Stage 2”. GROUPS RULES This assignment is done in groups of 3 or 4, and all students in a group must be part of the same lab session. Under exceptional circumstances a group of 5 members may be created by the unit coordinator (for example, the coordinator may be adding someone who had missed allocation to an already formed group of 4), similarly a smaller group may be created by the coordinator when dealing with group disputes as described below, or when a group is reduced in size due to member discontinuing this unit. Note: there is work required from each member separately, but the project is handed in as a combined effort, and it is marked as a whole: all members of the group will get the same mark for the assessment. GROUP FORMATION PROCEDURE In week 7 lab, you should form a group. This stage is usually done with the same group members as you worked with for Stage 1. Membership changes will only be made following the process described below. In particular, if there is any group if 4 where all the members are happy to keep the group unchanged, then it will not be forced to change. Note however that a new Canvas group has been created for this stage of the project (so that any changes made now, do not affect the marking of stage 1). Similarly, any group of 3 members from Stage 1, can choose to stay together; however, they may receive an extra person joining the group for Stage 2. If for any reason any members in a group want to leave, then they should inform the tutor at the start of week 7 lab, by not joining the breakout room for their former group. If someone who wants to leave a group will not be at the week 7 lab, they need to urgently email the unit coordinator alan.fekete@sydney.edu.au, naming the lab and group they wish to leave. The lab tutor will endeavor to form groups of the proper size, Page 2 of 9 by combining people who have left groups, and/or by adding such people to existing groups with less than 4 members. If several people (from a previous group) all want to leave that group but stay together with one another, then they can let the tutor know; the tutor will try to achieve this, but it is not guaranteed. Similarly, if someone wants to join a specific existing group which less than four members, they should tell the tutor, but again this can’t be certain. describing clearly what new membership structure they desire. Note that whenever a move occurs, all members of the former group may continue to make use of any data, code or documents that had been produced in Stage 1, by the group they were part of during that stage. Sometimes, people ask to have someone else removed from a group (usually, for non-contributing in Stage 1). This is not allowed. Instead, the people who are unhappy with someone, can choose to leave the group themselves (as described above), thus leaving the other person in the former group. DISPUTE RESOLUTION If, during the course of the assignment work, there is a dispute among group members that you can’t resolve, or that will impact your group’s capacity to complete the task well, you need to inform the unit coordinator, alan.fekete@sydney.edu.au. Make sure that your email names the group, and is explicit about the difficulty; also make sure this email is copied to all the members of the group (including anyone you are complaining about). We need to know about problems in time to help fix them, so set early deadlines for group members, and deal with non-performance promptly (don’t wait till a few days before the work is due, to complain that someone is not delivering on their tasks). If necessary, the coordinator will split a group, and leave anyone who didn’t participate effectively, in a group by themselves (they will need to achieve all the outcomes on their own). This option is only available up until Friday October 1, which is the last day with time to resolve the issue before the due date. For any group issues that arise after this time, you will need to try to resolve the problem on your own, and you will continue to be treated as a single group which all get the same mark for this Stage, based on whatever is submitted (though you should still let the coordinator know about them). If someone doesn’t provide material required for the report, or their material is not of the agreed standard, you should still have the report show what that person did. Their section of the report may be empty if they don’t produce anything, or it may have material but not enough. In such cases, please put a “Note to marker” on the front page of the report, which describes the circumstances. That way, we can consider how best to apply the marking scheme. Note that it is not expected or sensible, for other members to do the work that someone failed to deliver. Groups may be changed after Stage 2 is finished in this sort of case. THE PROJECT WORK FOR THIS STAGE: SUMMARY [Done together] Identify a topic or issue about which you will gain understanding, through data; also identify one or more datasets you will use for this purpose. Page 3 of 9 [Done together] Coordinate in choosing the summaries and charts you will each produce (to avoid duplication between members, and to enable a good conclusion for your report). [Done separately by each member] Use Python to produce a few tables (showing grouped-aggregate summaries) from parts of the data. [Done separately by each member] Produce a few charts from parts of the data; evaluate the effectiveness of each chart. [Done separately by each member] Write your section in Part A of the report, in which you present the work you have done individually. [Done together] Write Part B of the report, that explains some understanding about the topic for readers who are interested in the topic, and backs it up with some of the summaries and charts. [Done together] Produce a PDF of the whole report, with all individual sections and the jointly-written Part B, and produce the compressed folder with all the data and code from each member. Submit it all. IDENTIFY TOPIC AND DATASET(S): The analysis done in this Stage must all be relevant to a single topic or question, which you are investigating because it matters to some stakeholders. You need to then have one or more datasets that you will analyse, to produce results that are relevant to this topic/question. You are allowed to use the same topic as in Stage 1, but you are equally free to change topic. The members of the group are allowed to all work with the same dataset, or some (or all) may choose to work with different datasets. These datasets are allowed to be cleaned data from Stage 1, or integrated data from Stage 1, or you may choose to obtain new/extra data. There are no requirements for particular origin or volume in the datasets for this Stage. We will make available a dataset (on a topic of our choice) and any group can use that data instead, if they prefer. Note that all members of the group must be working on the same topic/question as each other, even if they use different datasets that deal with different facets of the issue. We realize that the results you produce from analysis may not completely resolve the issue you are targeted at, but each result should at least be potentially able to provide some insights. For example, if your topic is “what influences the average level of wealth in a community ”, one analysis may calculate the average wealth in communities having different levels of housing density, and a chart may show how wealth relates to percentage of people living alone in each suburb. Please make sure that your question or issue is not simply a factual matter, but instead looks at relationships where insights might be impactful for some stakeholder groups (for example, it is not a good choice of question to ask just “which country has the highest level of wealth ”). CHOOSE SUMMARIES AND CHARTS TO PRODUCE: Each member needs to calculate one or more grouped-aggregate summaries from the dataset they are using, and they also must produce one or more charts from that dataset. The number of summaries and charts, and some constraints on what sort of attributes are used, depends on the level of score you are seeking. Details are in the markscheme below. It is required that all the summaries be distinct from one another, and similarly each chart must be distinct. So you need to coordinate among the members, in case two members want to do the same calculation, one at least will need to change! Page 4 of 9 INDIVIDUAL WORK: Each member then needs to work with their chosen dataset, to produce the material for their section in Part A of the report. This will involve writing Python code to calculate one or more summaries, and running that code to get the output, this can then be formatted as a Table in the report. Producing charts can be done either by Python code, or else in a spreadsheet such as excel. You need to describe how each chart was produced, and also to evaluate its effectiveness following the approach discussed in Lectures 8A and 8B, which is based on knowing how data attributes get encoded by visual ones in your chart. COMMUNICATE RESULTS FOR INTERESTED READERS: Working together as a group, you need to write up a presentation of what your analysis has revealed about the topic. This needs to be written to communicate with readers whose focus is not on the technical details of data science, but rather they are interested in the topic itself. Your report should clearly identify relationships or trends which your analysis has shown (or suggested), and back up these statements with some of the tables or charts taken from what members produced in their individual work. We realise that your work is likely to be limited, and indeed it may be that your analysis suggests that some attributes are not related in any simple way (for example, it may be that wealth and housing density seem fairly independent of one another, or at least, that your data doesn’t show any connection!) – that’s ok, just be honest in saying what you expected, and what you found. WRITE A REPORT: Working together as a group, you need to produce a report. The structure of the report is described below in detail, as the report is the main basic for grading in this project. The report has sections for each member’s separate work, as well as a brief combined introduction that explains the topic or issue, and a combined presentation of conclusions. PRODUCE PDF AND ZIPPED FOLDER, AND SUBMIT: From the combined document, you need to produce a PDF. As well, there needs to be a file which compresses a folder, within which are subfolders for each member, the subfolders contain the dataset the member worked with, and the code or spreadsheet for producing their analysis (both summaries and charts). One person submits both PDF and zipped folder, to the submission links on Canvas, on behalf of the whole group. Every member of the group will get the marks earned by the combined submission. GROUP PROCESS During the project, you need to manage the work among the group members. We insist that every person do each activity, and describe what they did and found in the appropriate section of the report and in the appropriate subfolder of the compressed folder that gets submitted. We intend for the members to compare regularly and learn from one another (as well as from tutor feedback during lab sessions). Because any member’s Page 5 of 9 poor work will reduce everyone’s score, make sure to quickly report any difficulty in working together to the unit coordinator as described above. WHAT TO SUBMIT, AND HOW: There are two deliverables in this Stage of the Project. Both should be submitted by one person, on behalf of the whole group. The marks from this stage will appear in canvas gradebook as being associated with the report submission; the other submission has no marks appearing for it in Canvas, but it can be used as evidence in determining the mark for the stage. SUBMIT A STAGE 2 WRITTEN REPORT ON YOUR WORK, AS A PDF. This should be submitted via the link in the Canvas site. The report should have two Parts. Part A should be targeted at a tutor or lecturer whose goal is to see what you achieved, so they can allocate a mark. Part B is targeted at someone who is interested in the topic or issue you are investigating. The report should have a front page, that gives the group name, and lists the members involved (giving their SID and unikey, not their name), and then the body of the report has structure as follows (this corresponds to the marking scheme): 1. In Part A, there is an initial section which briefly states the topic of interest, and the stakeholders who care about this. This is not marked as such, it is just so the marker can understand the setting for the rest of the report. 2. Next in Part A, there should be one section for each member (the section should state the SID/unikey of the group member who did the work reported in this section). In this section, there should be some subsections a. A brief description of the dataset being used by this member; showing at least the schema of the dataset. You do not need here to describe the provenance, or give a detailed data dictionary. This is not marked as such, it is just so the marker can understand the tables and charts that follow. b. One or more subsections, each giving a grouped-aggregate summary. In any subsection, you should show the Python code that calculates a summary, followed by a table that presents the output of that summary. c. One or more subsections, each giving a chart. In any subsection, you should describe how you produced the chart, followed by a display of the chart, followed by an evaluation of the effectiveness of the chart. If the chart was produced by Python, the description of how you produced it is the relevant Python code; if you used a spreadsheet to produce the chart, you should state in words the actions you took when creating the chart. 3. If the group is seeking full marks for Chart Production, there will be an extra section in Part A, with a chart which shows four attributes and their relationships. This chart may be produced jointly, or by any individual member. 4. There is a single Part B, jointly written by the group. It is written for readers who are interested in the general topic that you have investigated. In it, you describe the specific issue that you have been investigating, and you present some conclusions or insights about this, which you reached from your analysis. You should include some tables and charts (derived from the data, and chosen from among those calculated Page 6 of 9 by the members and reported in Part A), to justify or illustrate the conclusions you give. There is no required minimum or maximum length for the report; write whatever is needed to show the reader that you have earned the marks, and don’t say more than that! In most cases, the code to produce a summary or chart will be fairly short (a few dozen lines at most), and the evaluation of a chart should not take more than a half-page. SUBMIT A COPY OF THE STAGE 2 DATA AND CODE. This should be submitted through the Canvas system, as a single zip or tar.gz file. So you should put have a single folder, with subfolders for each member. The subfolder for a member should contain the dataset used, the Python code to calculate some summaries, and either Python code or a spreadsheet for producing the charts. You then compress the top folder (with all these subfolders and their contents), then submit the single compressed file. MARKING Here is the mark scheme for this assignment. The score (out of ten) is the sum of separate scores for each of five components. Note that all members of the group receive the same score. GROUPED AGGREGATE CALCULATIONS [2 POINTS] This component is assessed based on the corresponding subsections of all the separate member sections in Part A of the report; the uploaded data and code may be checked by the marker as supporting evidence for claims made in the report. Full marks: the Distinction criteria hold, and also all the code is well-documented and clear, and each provides the overall aggregate as well as the aggregate of each group. Distinction: every member has written Python code that correctly computes some grouped-aggregate where the grouping is based on a nominal attribute of the data, and each member has written Python code that correctly calculates some grouped aggregate where the grouping is based on a binned quantitative attribute. All the code pieces are distinct from one another. Pass: every member has written Python code that correctly calculates a grouped-aggregate summary of some data. All the code pieces are distinct from one another. Flawed: there is a correct calculation of some grouped-aggregate summary of some data. CHART PRODUCTION [2 POINTS] This component is assessed based on the corresponding subsections of all the separate member sections in Part A of the report, as well as in the final 4-aspect chart section if that is present; the uploaded data and code may be checked by the marker as supporting evidence for claims made in the report. Page 7 of 9 Full marks: the Distinction criteria holds, and also there is at least one chart which illuminates connections that involve at least four aspects or attributes of the data that are relevant to the question, and where there is a reasonable expectation of a relationship where all four attributes interact together [not just that each pair are related, but that the way in which any two relate, is impacted by the values of the other two!]. This chart must be compelling in communicating the information to the reader (eg it draws the reader to easily gain a deep awareness of the patterns, especially how the relationship of some are impacted by the other attributes) and makes them keen to learn more. Distinction: each member produces at least two charts that accurately convey the relationship between aspects or attributes from their data that are relevant to the topic. For each member, at least one chart must show information about at least three aspects or attributes. For each member, at least one of the aspects shown among their charts must be nominal or ordinal and at least one attribute (possibly in a different chart) must be quantitative. All the charts in the report must be distinct from one another, and without serious flaws (such as distortion or misleading , or missing crucial information such as axis scales). Pass: each member produces a chart that accurately conveys the relationship between at least two aspects or attributes from their data that are relevant to the topic. [The phrase “convey the relationship” could mean showing whether or not there is a trend that describes how one attribute’s value is influenced by the values of other attributes, or it could mean showing whether the distribution of values of one attribute is different among different subsets of the data, defined by the values of other attributes, etc.]. All the charts in the report must be distinct from one another, and without serious flaws (such as distortion or misleading , or missing crucial information such as axis scales). Flawed: Some charts are produced CHART EVALUATIONS [2 POINTS] This component is assessed based on the corresponding subsections of all the separate member sections in Part A of the report; the uploaded data and code may be checked by the marker as supporting evidence for claims made in the report. Full marks: the Distinction criteria hold, and also, for each chart, there are good reflections on how well (or not) the chart design would work if much more data is obtained. Distinction: every member has written an evaluation for each chart in their section, which correctly documents the encoding between data attributes and visual attributes, and also documents other decisions (such as style of chart, scale etc), and also it sensibly justifies the decisions in view of the effectiveness of communication. All the charts in the report must be distinct from one another. Pass: every member has written an evaluation for each chart in their section, which correctly documents the encoding between data attributes and visual attributes. All the charts in the report must be distinct from one another. Flawed: Some reasonable attempts to evaluate the effectiveness of some of the charts Page 8 of 9 CONCLUSION – CONTENT [2 POINTS] This component is assessed based on Part B of the report. Material in Part A, or the submitted data and code, may be checked by the marker as supporting evidence for claims made in the report. Full marks: the Conclusion section has all the Distinction criteria, and also discusses honestly and with insight, the limitations and uncertainties about the results. Distinction: the Conclusion section provides some accurate information which provides insight into important issues in the topic, supported by at least four relevant tables and at least four relevant charts; the tables and charts must include something produced by each member of the group in their sections of the report. Pass: the Conclusion section provides some accurate information about the topic, supported by at least two relevant tables and at least two relevant charts which were each part of the earlier material in the report. Flawed: the Conclusion section contains at least one relevant table and at least one relevant chart, as well as text about the topic. CONCLUSION – COMMUNICATION [2 POINTS] This component is assessed based on Part B of the report. Full marks: the Conclusion section has all the Distinction criteria, and also it draws the reader in and engages their attention with vivid and stylish prose. Distinction: the Conclusion section makes it easy for the intended audience to gain understanding they seek. It clearly links to the readers’ background and aims. The structure needs to be logical and well-organised, (for example, tables, charts and text relate well to one another). It makes explicit what has been learned and aspects which have not been resolved. Pass: the Conclusion section allows the intended audience to gain some knowledge of the domain, without excessive effort or confusion. Flawed: a reasonable attempt to communicate Page 9 of 9 Late work As announced in the unit outline, late work (without approved special consideration or arrangements) suffers a penalty of 5% of the maximum marks, for each calendar day after the due date. That is, we subtract 0.25 marks per day from what you would otherwise get for the work. No late work will be accepted more than 10 calendar days after the due date.