r-IE6600

IE6600 – hw3
Due on 2/28/2022 11:59pm PT
Zhenyuan Lu
IE6600 Homework Instructions
You should process your homework in R Markdown (.Rmd), and knit it to .pdf file (do not include any data
or other materials). Attentively check all the references I mentioned in the class or from other resources.
Once the homework is completed, you need to have your homework compressed into one .zip file
(hw3YourFullName.zip), and submit it to the assignment section on Canvas.
In the .zip file, it should contain the following documents:
hw3YourFullName.rmd
hw3YourFullName.pdf
Please include all your codes and results for each of the problem, and keep them organized and clear. All of
your codes should run successfully. Problems related to any plots/charts should be generated by ggplot2
mainly. If it’s neccessary, please deal with the missing values, overplotting, or labels on axis/legend properly.
All the solutions may vary.
1
Section A
This section is for testing your data transformation, ggplot, and base function skills. If it’s neccessary,
please deal with the overplotting, or labels on axis/legend properly.
Problem 1
We would like to create one simulated (fake) data frame contained the employer height(cm) information from
two companies: Alpha and Beta. Also, this data frame should include the companies’ area codes. (Company
may have multiple subsidiaries in different areas)
Create one column Area Code (e.g. A, B, C etc.) with 2000 rows only contained 26 upper-case letters
(alphabet). These letters should be randomly filled in 2000 rows. (with replacement)
Create one column Company with 2000 rows contained only two values “Alpha” and “Beta”. To be
convenient, first 1000 rows should be “Alpha”s, and last 1000 rows should be “Beta”s.
Create one column Employee Height (cm) with 2000 rows. To be convenient,first 1000 rows and
last 1000 rows should be randomly generated with mean = 160, sd = 5, and mean = 170, sd = 5,
respectively.
Then create a density plot on the height, mapping company as the fill.
hint1: The built-in “LETTERS” contains 26 upper-case letters. hint2: You can use set.seed(#a number) to
make your code reproducible
Problem 2
Still working on the previous data frame. For each area, summarize the average employee height of each
company. Then plot a dodge bar chart visualizing area code versus the average of height, and mapping
company as fill.
Plot Example
## ‘summarise()‘ has grouped output by ’areaCode’. You can override using the ‘.groups‘ argument.
2
150
155
160
165
170
175
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Area Code
Av
e
ra
ge
E
m
pl
oy
e
e
H
ei
gh
t(c
m)
Company
Alpha
Beta
Problem 3
Insert THREE more columns into the previous data frame.
First column Employee Weight (kg) should be generated with 2000 random variables (mean = 65,
sd = 10).
Second column “BMI” follows the formula: weight(kg)/[(height(cm)/100) 2]
Third column BMI Categories contains 4 labels “underweight”, “normal weight”, “overweight”,
and “obesity” associated with column “BMI” for each row.
– When BMI <=18.5, "Underweight" – When 18.5< BMI<= 25, "Normal weight" – When 25< BMI <=30, "Overweight" – When BMI > 30, “Obesity”
Then create a scatterplot visualizing Employee Height(cm) versus Employee Weight(kg), mapping BMI
Categories as color, and facet this plot by Company.
Section B
Section B uses National Health and Nutrition Examination Survey 2015-2016 Demographics Data
from Centers for Disease Control and Prevention.
Download NHANES 2015-2016 Demographics data (XPT file) from: https://wwwn.cdc.gov/nchs/nhanes/
Search/DataPage.aspx Component=Demographics&CycleBeginYear=2015
3
To read the data manual: https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.htm
The details and introduction for NHNES please click the link: https://www.youtube.com/watch v=
75Ur89rMsSA
Load package “haven”(one of the packages from “tidyverse”), and use read_xpt() to import the dataset to
R.
Problem 1
Create a new data frame with the following columns:
The race information included only Mexican American, Other Hispanic, Non-Hispanic White, Non-
Hispanic Black, Non-Hispanic Asian, and other race included all the rest.
Ratio/value of family income to the poverty line
Removing the above ratio’s decimals (e.g. 2.61 -> 2) and then make them as categorical data (“Annual
family income value”): 0, 1, 2, 3, 4, and 5
The proportion of all families by race among all families: the number of total families by ethnic / the
number of total families.
The proportion of all families by race among all families at each annual family income value: 0, 1, 2,
3, 4, and 5
Summarize the above created data frame by Annual Family Income Value, Race, and the two columns
with proportion data. Then Create a bar chart to visualize the annual family income value (x-axis) versus
the proportion of Black families among all families at each annual family income value (y-axis). Include a
subline of which y value should equal to the proportion of Black families among all families.
Are Black families over- or under-represented in poverty What else you notice about the chart hints:
When the annual family income value is 0, which means such family is in poverty.
hint: you may use a lot of “group_by”s.
Problem 2
Still working on the created data frame in Problem 1. Summarize the above data frame by Annual Family
Income Value, Race, and the two columns with proportion data.
Then create a bar chart to visualize the annual family income value (x-axis) versus the proportion of
Mexican American families among all families at each annual family income value (y-axis). Include a subline
of which y value should equal to the proportion of Mexican American families among all families.
Are Mexican American families over- or under-represented in poverty What else you notice about the
chart
Problem 3
Still working on the created data frame in Problem 1. Summarize the above data frame by Annual Family
Income Value, Race, and the two columns with proportion data.
Then create a bar chart to visualize the annual family income value (x-axis) versus the proportion of other
hispanic families among all families at each annual family income value (y-axis). Include a subline of which
y value should equal to the proportion of other hispanic families among all families.
Are other hispanic families over- or under-represented in poverty What else you notice about the chart
4