Page 1 of 4 ALY6000: Data Analysis Overview and
Rationale Being able to ask appropriate questions of data is an
important part of the work of data analytics. It is also critical to be
able to interpret the results of the analysis. This assignment is
intended to familiarize you with the data sets and to get you thinking
about key business questions you can answer from this data. Module Outcomes This assignment is directly linked to the following learning: Investigate impacts of big data on industry Describe the evolution of big data Analyze data to complete a data rich and visually appealing report Assignment Instructions Find one dataset that is of interest to you. Some places to find datasets include: The R Project for Statistical Computing Kaggle U.S. Government’s Open Data
or your own data. Your data set should have at least 700, but less than
6000, records and eight (8) attributes and the data should not be
“clean”. Part of this assignment will require you to clean the data
yourself. Please see any accompanying Data Dictionary to understand the
fields and values in your chosen dataset is available. The assignment
has three parts. Part I Please review the Data Dictionary document
as you review the datasets if one is provided. In order to understand
the data we first need to run some descriptive statistics on the data
set. Start by providing the following for each appropriate variable in
the dataset: 1. Summarize the data in a table. Page 2 of 4 2.
Graphs that help visualize the data. These can be bar charts,
histograms, pie charts, etc. Be sure the chosen graph best represents
the information you want to highlight. 3. Explain the story the data is
telling you. What business question do your descriptive analyses answer Provide a brief discussion of the findings.
If there are any unusual values, discuss them. If data values are “out
of range,” clean the data as needed. Delete the out of range values and
run the analysis again. If you remove out of range values for
any of the variables, present both the analysis with the out of range
values and the analysis without the out of range value(s).
Identify additional questions that the data is leading you to ask. What
new attributes are needed to answer those questions Part II
Create new attributes based on the data and the questions you identified
in Part 1. For your data set, compute differences between appropriate
variable values and create a new variable. For examples, if the data
shows yearly sales for different years, by month, calculate the increase
or decrease in sales from month to month. Then, compute the mean and
median for each of the variables you have computed. Part III Now
that you have worked with the data, what is the data saying to you What
have you learned about the attributes What are some follow-up
questions you would like to have answered Identify 3-5 observations or
follow-up questions that you have. What to Submit A presentation
slide deck (5-8 slides not including Title and reference list slide)
with your findings. Submit a single file with the following filename: _FinalProject.pptx Format Your presentation must: Tell the story of your data through the use of descriptive statistics and visualizations. o Remember your visualizations are the primary vehicle you’ll use to convey information in an analytics presentation. o
Include very concise with written information that is highly connected
to the points made in the visualizations as a Notes section on each
slide. Properly cite all sources using APA citation rules. Page 3 of 4 Appendix Assignment Part I Section Example Business Question: What is the distribution of the status of the 2017 GxP Audits Analysis: Descriptives Table Audit Status Frequency Percent Valid Percent Valid Closed 19 19.8 19.8 Completed 4 4.2 4.2 In Progress 18 18.8 18.8 Scheduled 11 11.5 11.5 Pending 14 14.6 14.6 Not In Scope 26 27.1 27.1 Cancelled 4 4.2 4.2 Total 96 100.0 100.0 Audit Status Count Page 4 of 4 Audit Status Percentages Discussion:
The data file includes information on 96 audits in 2017 for GxP areas.
It is unclear if the data file includes all the known GxP audits in 2017
or if it only includes a subset. A large percentage of all GxP Audits
(27.1%) are not in scope. 19.8% of audits are closed and 4.2% of audits
are completed. It is unclear what the difference between “closed” and
“completed” audits is. We should perhaps ask the client. Do we really
need two distinct values 18.8% of the audits are in progress, 11.5% are
scheduled and 14.6% are pending. For the pending audits, the dates of
the audit process have not been established. 4.2% of the audits were
canceled. It may be interesting to have a notes field where the reasons
for cancelation are noted. _FinalProject.pptx Format Your presentation must: Tell the story of your data through the use of descriptive statistics and visualizations. o Remember your visualizations are the primary vehicle you’ll use to convey information in an analytics presentation. o
Include very concise with written information that is highly connected
to the points made in the visualizations as a Notes section on each
slide. Properly cite all sources using APA citation rules. Page 3 of 4 Appendix Assignment Part I Section Example Business Question: What is the distribution of the status of the 2017 GxP Audits Analysis: Descriptives Table Audit Status Frequency Percent Valid Percent Valid Closed 19 19.8 19.8 Completed 4 4.2 4.2 In Progress 18 18.8 18.8 Scheduled 11 11.5 11.5 Pending 14 14.6 14.6 Not In Scope 26 27.1 27.1 Cancelled 4 4.2 4.2 Total 96 100.0 100.0 Audit Status Count Page 4 of 4 Audit Status Percentages Discussion:
The data file includes information on 96 audits in 2017 for GxP areas.
It is unclear if the data file includes all the known GxP audits in 2017
or if it only includes a subset. A large percentage of all GxP Audits
(27.1%) are not in scope. 19.8% of audits are closed and 4.2% of audits
are completed. It is unclear what the difference between “closed” and
“completed” audits is. We should perhaps ask the client. Do we really
need two distinct values 18.8% of the audits are in progress, 11.5% are
scheduled and 14.6% are pending. For the pending audits, the dates of
the audit process have not been established. 4.2% of the audits were
canceled. It may be interesting to have a notes field where the reasons
for cancelation are noted.