R-GR5058

GR5058 Final Practice Problems
Ben Goodrich
Answers will be available December 16, 2021
1 Text
Execute
data(“constitution”, package = “qss”)
Each row of constitution contains the (English translation of the) text of the preamble to a constitution,
written by that country in that year for a total of 155.
Create a country_year variable inside the constitution data.frame that combines the country and
year variables that uniquely identifies the observation (since some countries rewrite their constitutions
over time).
Use the functions in the tidytext R package to make a “tidy” data.frame from the information in
constitution if the word is the unit of analysis
Eliminate English “stop words” from the “tidy” data.frame to form a new data.frame
How many words are left once the “stop words” have been removed from these Constitution preambles
and how many of those are unique
Which five words have the largest weight under the term-frequency inverse document-weight metric
Use functions in the dplyr package to create a tibble that counts up all of the times that a word appears
in each constitution’s preamble
Use the cast_dtm function to create a document-term matrix
Apply K-means clustering to this document-term matrix for some value of K. Which constitutions
appear in each cluster and how would you interpret those clusters
2 Classification
Execute
data(“OJ”, package = “ISLR2”)
Split the data into training and testing, stratifying on Purchase, which is a the outcome variable (brand
of orange juice purchased)
Estimate an appropriate but unpenalized logistic regression for this outcome in the training data
Use Linear Discriminant Analysis to estimate a model with the same or similar set of predictors
Use elastic net logistic regression to estimate a model with the same or similar set of predictors
Which of the above models classifies best in the testing data if overall accuracy is the criterion
3 Neural Network Regression
In a blog post at