ALY6010 Module 1 Project Instructor: Dr. Dee Chiluiza, PhD Discrete probability and normal distributions Overview and Rationale This assignment is designed to provide you with hands-on experience in performing descriptive statistical methods on a data set. The data set is provided in an Excel workbook and contains a wide range to data types that you will need to work with. Assignment Summary Using the data provided in the attached Excel workbook, apply the methods of graphical and numerical descriptive statistics. Follow the instructions in the project document to analyze the data presented in the Excel workbook. Then complete a report summarizing your data analyses. Important note on the report: for this project, your report will be a HTML file produced using R Markdown. Important note 2: I understand that some students are still learning R and R Markdown. If you are in this group, this week deadline is flexible, and you can present your report up to five days after the deadline. Files to submit: Important to remember, for this project you must submit two files: Your R Markdown File. Your HTML report. Tasks to complete before starting your project. 1. Install the latest versions of R and R Studio on your computer. (Read file: 01a R Install, create folder and project.ppt) 2. Create a folder on your computer named “ALY6010 R Project” and a subfolder named “DataSets”. 3. From R Studio, create an R Project for this class using the “ALY6010 R Project” folder you created above. (Read file: 01a R Install, create folder and project.ppt) 4. Learn how to import data sets into R using the strategy requested by your instructor. (Read file: 03 R Import data sets.ppt) 5. Learn how to use R Markdown. We will use only basic codes to produce the HTML outcome reports. (Read file: R Markdown Introduction.ppt) 6. Save the file “M1data_carsales.xlsx” inside your DataSets folder. 7. Create an R Markdown file inside your ALY6010 R Project, name this file: Project1_myname.Rmd. (Read file: R Markdown Introduction.ppt) 8. Import the data set into R Studio using the strategy you leant above and present the code using an initial R chunk. 9. Do not present install.package() codes on your report. If you need to install any new package, do it directly in the R Studio console. Create an initial R chunk to activate your libraries and import your data sets. Use the following header on this R chunk: {r message=FALSE, warning=FALSE} Some libraries to include on your libraries R Chunk. If you do not have them, install the packages in the console. library(readxl) library(tidyverse) library(dplyr) library(DT) library(RColorBrewer) library(rio) library(dbplyr) library(psych) library(FSA) Report starts here Title. Create a Title to your report with the report’s name (Project 1 Report), name and CRN of the class, your name, your instructor’s name, and date you submit the report. Here, there is an example: Introduction. Create a title for your Introduction section. Here, there is a code example: (A) Write some sentences to present general information about car sales market, global and in India. Here there are some websites you can read, these are examples, find others if you prefer: Wagner, I. February 5, 2021. Automotive industry worldwide – statistics & facts. Statista. Link: https://www.statista.com/topics/1487/automotive-industry/ Thakkar, K. January 11, 2021. Indian car market may post record 30% growth in 2021 on low base. Auto.com. Link: https://auto.economictimes.indiatimes.com/news/passenger-vehicle/cars/indian-car- market-may-post-record-30-growth-in-2021-on-low-base/80218106 Culver, M. December 17, 2020. Global Auto Sales Expected to Gain Momentum Next Year; 83.4 Million Light Vehicles to Be Sold In 2021, According to IHS Markit. Business Wire. Link: https://www.businesswire.com/news/home/20201217005798/en/Global-Auto-Sales-Expected-to- Gain-Momentum-Next-Year-83.4-Million-Light-Vehicles-to-Be-Sold-In-2021-According-to-IHS-Markit (B) Write a paragraph describing and explaining the importance of discrete and continuous probability distributions. (C) Write a sentence describing the data set you are about to use. Analysis section. Task 1 If you don’t know the dplyr::select() and psych::describe() codes, this will be a good opportunity to learn. Create an R Chunk. Start with the name of the data set, then using the pipes %>% , apply code dplyr::select() to select only the variables Efficiency, Power_bhp, Seats, Km, and Price. Using a second pipe, apply code psych::describe(), nothing inside the parenthesis. Run the code. Two things that should call your attention: descriptive statistics are in the columns, not in the rows, and there are too many decimals. Correct these issues. Using another pipe, enter code t() to transpose values. Run code and observe. Using another pipe, enter code round(2) to reduce decimals to only 2. Using another pipe, enter code knitr::kable() to improve table presentation. Present the table on your Report. Write some observations about the code strategy you just learnt. Task 2. Prepare and present a bar plot to show the frequencies of variable location. Prepare and present a bar plot to show the frequencies of variable fuel type. Prepare and present a bar plot to show the frequencies of variable transmission. Prepare and present a bar plot to show the frequencies of variable owner. Important: Use code par(mfrow=c(2,2)) to organize your bar plots presentation in a 2×2 matrix. Improve your graphs presentation with clear y- and x- axes labels, colors. Task 3 Create a table with the variable location on the rows, and present their corresponding frequencies, cumulative frequencies, percentages, and cumulative percentages. If you have decimals, always reduce them to 2 or 3 only. Follow these steps: Create a table to present locations and its frequencies. Convert table using as.data.frame() Rename columns: Var1 to Location and Freq to Frequency. Use code mutate() to create three new columns (these are new calculated fields): The cumulative frequencies, name column: CumFrequency. The percentages, name column: Percentage The cumulative percentage, name column: CumPercentage Present it using a table library of your choice: library(DT) or library(knitr). Optional: to apply kable, practice these codes to present your table, you will need to install package kableExtra. knitr::kable(digits = 2, caption = “Task 3 Table”) %>% kable_classic(full_width = FALSE, font_size = 12) Check: https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html. Task 4. Repeat the codes used for task 3, this time present frequencies, cumulative frequencies, percentages, and cumulative percentages for variable owner. Task 5 Prepare horizontal box plots and one histogram to display the data distribution of numerical kilometers. Use the code par(mfrow=c(2,1), mai=(1,1,1,1)) at the beginning of the R chunk. mfrow will present the two figures one of top of the other as a group, in this case, c(2,1) indicates 2 rows and 1 column. Mai will change the margins of your figures, bottom, left. Top. Right. Play with the mai numbers to observe changes. Remove the title of your graphs by using main = NA. Remember to always make observations after each task. Task 6. Similar to task 5, this time present the box plot and histogram for variable price. Task 7 Prepare and present a box plot to display the price distribution per location. Your figure must contain several boxes inside. Provide your figure with a good presentation format. Remember to always make observations after each task. Task 8 Similar to task 7, prepare and present a box plot to display kilometers distribution per owner. Task 9 Apply and present the outcomes of code boxplot.stats() for variable kilometers. Explain the information obtained with the application of this code. Task 10 With the information obtained in task 9, prepare and present a dotchart() to display the quartiles values ($stats) for variable kilometers. Create a CONCLUSIONS title. In the conclusions section, make a global summary of your results and what you learnt from the whole work performed. Make an overall observation of the whole project, the meaning of the results you obtained regarding the direction of the project, explain any new skills you gained. Also, imagine you are preparing this report for a company or research institution, therefore, you must make meaningful contributions. Think about what recommendations you can provide. Create a BIBLIOGRAPHY title. In the reference section, indicate all information sources you used to support your work on this project. References must be used on the main body of your report: Technically speaking, if you do not mention any references in the main body of your report, then it is like you did not use any, even if you add a list at the end. Present references in the main body of your reports in the place where you use them as an information source, use either only the first author’s last name and year, e.g., (Bluman, 2017) and then list them in the bibliography section in alphabetical order, or use a number in order of appearance, then list them in the bibliography section in that numerical order. Appendix Since you are presenting your report as a HTML file, but you also will submit the original Rmd file, create an appendix section and write a sentence: An R Markdown file has been attached to this report. The name of the file is…. R codes: If you feel comfortable using R, feel free to add additional codes using more complex libraries. Do not present install.packages() codes in your Rmd file. Install all packages you need using the console or the Packages tab on your R Studio program. Grade: 100 Points