POLS0010 Data Analysis Term II

POLS0010 Data Analysis Term II
ESSAY QUESTIONS 2023, Part I
Guidelines for Completing and Submitting POLS0010 Term II Essay
Read the guidelines below to avoid losing unnecessary marks.
The assessment is due on Tuesday 2
nd May 2023, 2.00pm. It has two parts (I and II),
both of which will need to be submitted together. This first part of the essay is worth 60
marks. Part II will be released in week 10 and is worth 40 marks.
Please follow all designated Department of Political Science submission guidelines. These
may be different to those of your home department. You must submit one copy of your
essay via Turnitin.
The datasets for the essay can be found in the ‘Term 2 Assessment’ folder in the ‘Term 2
Assessments’ section of Moodle
The word limit for both Parts I and II is 3,000 words, excluding your R script appendix
(see below). You can divide the word limit as you like between the two parts.
This is an assessed piece of coursework for the POLS0010 module; collaboration and/or
discussion with anyone is strictly prohibited. The rules for plagiarism apply and any cases
of suspected plagiarism of published work or the work of classmates will be taken very
seriously.
You may open up the datasets and work on the essay questions anytime up until the
submission date. There is no limit on the number of times you may open the data files. Be
sure to save your data files and R script file.
You should include a copy of your R script as an appendix to your essay. FAILURE TO
INCLUDE THE R SCRIPT WILL INCUR A 10 POINT PENALTY. Note that your
R script file should be neatly presented and easy to follow, including comments indicating
the question being addressed. The essay answers should not contain any code.
All tables or figures must be included within your answers to the essay, not in the code
appendix.
Answers should be written in complete sentences; no bulleting or outlining.
You may assume the methods you have used (e.g. logit regressions, etc) are understood by
the reader and do not need definitions, but you do need to say which techniques you have
used and why.
As this is an assessed piece of work, you may not email/ask the course tutors for help
with the essay questions.
1
2
This part of the final essay contains two questions. You must answer both of them. Question
A is worth 30 points and Question B is worth 25 points.
Up to an additional five points will be awarded for clarity of presentation, especially tables and
figures. See “Term 2 Assessment Advice” in the “Assessments” section of the moodle site for
guidelines on presentation.
Both questions require you to write a brief report. It is up to you how you structure the reports,
but it is advisable to keep introductory material to a minimum, given the word limit. Your
reports should discuss your methods, your results and the conclusions that you draw from them.
You are welcome to use sub-headings to structure your reports.
QUESTION A: The Scottish Independence Referendum [30 points]
In 2014, there was a referendum in Scotland on independence from the United Kingdom. The “No”
side (opposed to independence) won with 55.3% of votes against the “Yes” side’s 44.7%. For this
question, suppose that the referendum is going to be repeated next year, and the pro-independence
campaign asks for your advice. They want to run an advertising campaign targeted at groups who are
most likely to support independence in the new referendum, to persuade them to turn out and vote.
Your job is to tell them which types of people are most supportive of Scottish independence. To help
measure the likely effectiveness of their advertising, they also want to know how much each
characteristic matters in explaining support. To answer these questions you’ll look at data on how
people voted in the first referendum: a survey of 1100 voters from the Scottish Social Attitudes
Survey fielded in 2015 that asked about respondents’ vote in the referendum. You need to:
1. Run at least three logit models containing different sets of variables, and select the one that
you think has the best performance in classifying supporters of independence.
2. Present your chosen model’s findings in ways that clearly explain how much the variables
matter in explaining support for Scottish independence.
3. Use the findings to make recommendations on whom the new campaign should target.
You should present your approach and your findings in the form of a brief report. The dataset is called
“ssa” and is contained in the file “scottishindependence.Rda”. It contains the following variables for
each individual in the survey:
Name Variable description
voteYes dependent variable: =1 if respondent voted for independence, 0 otherwise
male =1 if male, 0 if female
age in years
housing =1 if household owns their home, 0 otherwise
highinc =1 if respondent falls into the highest income quartile (25%), 0 otherwise
religious =1 if respondent reports belonging to an organized religion, 0 otherwise
satisfiedNHS level of satisfaction with the National Health Service (NHS), ranging from 1
(very dissatisfied) to 5 (very satisfied)
leftright Respondent’s self-placement on a left-right ideology scale, ranging from 1 (left_xfffe_wing) to 5 (right-wing)
trustUKgov How much the respondent trusts the British government to work in Scotland’s
long-term interest, ranging from 1 (almost never) to 4 (just about always)
trustSEgov How much the respondent trusts the Scottish government to work in Scotland’s
long-term interest, ranging from 1 (almost never) to 4 (just about always)
3
QUESTION B: Estimating Support for Redistribution in British Local Authority
Areas [25 points]
How the public feels about redistribution – government action to redistribute income from richer to
poorer citizens using tax and benefits – is important in determining the policies that governments enact.
Recently there have been large geographic changes in support for redistribution. In this question you
will produce geographic estimates of average support for redistribution in every local authority area1
in
Great Britain using multilevel regression and post-stratification. You also have access to an existing,
authoritative, set of estimates. Your tasks are (i) to produce estimates of average support for
redistribution in every local authority area using multilevel modeling and post-stratification that are as
close as possible to this existing set of estimates, as measured by the Mean Absolute Error (MAE), and
(ii) to use your results to explain the correlates of support for redistribution. You should present and
explain your approach and results in a brief report.
You need to:
i) Estimate an appropriate multilevel model explaining support for redistribution, using the
predictors in the dataset.2
ii) Present the multilevel model results and interpret how the variables affect support for
redistribution (Note: you do not need to discuss statistical significance).
iii) Produce post-stratified estimates of average support for redistribution in 366 local authority
areas in Great Britain.
iv) Compare your results to the existing estimates using the Mean Absolute Error.
Note: if you cannot get close to the existing results, do not worry. Your grade depends on the quality
of your analysis, presentation and interpretation, not how close your results are to the existing estimates.
The survey data is called “surveydata” and is in the file “2023qb_survey.Rda”. It contains 6410
survey responses from British citizens aged 18+, with the following variables:
Variable name Variable description
support Dependent variable: the individual’s level of support for redistribution, a
scale from 1-5 where higher numbers indicate greater support
female =1 if respondent is female, 0 otherwise
age2029 =1 if respondent is aged 20-29, 0 otherwise
age3039 =1 if respondent is aged 30-39, 0 otherwise
age4049 =1 if respondent is aged 40-49, 0 otherwise
age5059 =1 if respondent is aged 50-59, 0 otherwise
age6069 =1 if respondent is aged 60-69, 0 otherwise
age7079 =1 if respondent is aged 70-79, 0 otherwise
age80pl =1 if respondent is aged 80+, 0 otherwise
la_pc65plus_cen percent of local authority area population aged 65+, standardised
la_pc1824_cen percent of local authority area population aged 18-24, standardised
la_hhinc_cen average household income in local authority area, standardised
la_socbensrd_cen average amount of government benefits received per household in local
authority area, mean-centered (a measure of household poverty/need),
standardised
la_popdensity_cen population density of local authority area, standardised
localauthority local authority area name
1 Local authority areas are the lowest tier of government in the UK. For example, London boroughs such as
Camden are local authority areas. At the time this dataset was produced, there were 366 such areas.
2 The outcome variable is continuous rather than binary, so you should not use a multilevel logistic model
4
The post-stratification data for the 366 local authority areas is called “poststratdata” and is
contained in the file “2023qb_poststratdata.Rda”. Each row contains one particular demographic
group in one local authority area. In addition to the variables in “surveydata”, it also contains this
variable:
Variable name Variable description
value percent of local authority represented by the demographic group
Finally, the comparison data containing the existing estimates by local authority area is called
“existing_estimates” and is in the file “2023qb_existingestimates.Rda”. In addition to the local
authority name, it contains the existing estimate of average support for redistribution in each area
(called final.est).
5
PART II
This part of the final essay contains one question. It is worth 40 points. Again, 5 points are
reserved for clarity of presentation, especially tables and figures. See Q+A session 5 for
guidelines on presentation.
The question requires you to write a brief report. It is up to you how you structure the report,
but it is advisable to keep introductory material to a minimum, given the word limit. Your
report should discuss your methods, your results and the conclusions that you draw from them.
QUESTION C: Describing and Classifying Reviews [40 points]
Many organisations monitor online reviews in order to gauge how users feel about their services.
For this question, imagine that you have been hired as a consultant by the British National Health
Service to analyse reviews of GP surgeries. They want to find out how people talk about GP
surgeries’ performance, and then build a predictive tool that can classify reviews in future into
‘negative’ or ‘positive’ sentiment toward surgeries, to help them respond better to service users in
real time. They have provided you with a dataset of 1,986 online reviews of GP surgeries that have
been labelled as ‘negative’ or ‘positive’ by their staff. The dataset also records the star rating (from
1 to 5) given to the surgery in each review.
Your task is to prepare a brief report that describes the reviews, and recommends a classification
method for future reviews. You need to:
i) Use appropriate tools to describe the reviews. In particular, what words are associated with
negative or positive sentiment How does word usage differ between reviews with high
and low star ratings
ii) Use your analysis from i) to build a short dictionary of negative and positive words
describing surgeries, then use it to classify reviews as ‘negative’ if they contain more
negative than positive language, and ‘positive’ otherwise [code for creating your own
dictionary is provided below]
iii) Use the lasso logit method to classify the reviews into ‘negative’ and ‘positive’
iv) Compare the performance of your classifiers from ii) and iii), and use this analysis to decide
which one would be the better classifier for the NHS to use for future reviews
The dataset for this question is called “nhs_reviews” and is contained in the file “nhs_reviews.Rda”.
It contains the following variables:
Variable name Variable description
review_text The text of each review
review_positive Labeled sentiment of each review: 1=positive, 0=negative
surgery The star rating given to the surgery in the review (from 1 to 5, higher
numbers indicate a better assessment of the surgery)
You should first create a corpus of reviews using the following code:
reviewCorpus <- corpus(nhs_reviews$review_text, docvars = nhs_reviews) Here is some advice for part ii): Your dictionary should contain a minimum of 5 words and a maximum of 15 words in each category You are not expected to exhaustively compare the performance of different dictionaries. Instead, simply choose one dictionary based on your analysis from i), explaining how you chose the words. Code for creating a dictionary: You can create a dictionary called “mydict” in R that contains two categories (‘negative’ and ‘positive’) using the following code: neg.words <- c() pos.words <- c() mydict <- dictionary(list(negative = neg.words, positive = pos.words)) You need to insert your chosen sets of negative and positive words in ‘neg.words’ and ‘pos.words’. This dictionary can then be used with quanteda in exactly the same way as any of the existing built in dictionaries. 6