GY476

GY476 – Summative Assessment 2022/23 – MSc GDS Overview and Instructions: Computational Essay Due Date: 15th December 2022, by 12 noon Overview Here’s the premise. You will take the role of a real-world GIS analyst or spatial data scientist tasked to explore datasets on the San Francisco Bay Area (often just called the Bay Area) and find useful insights for a variety of city decision-makers. It does not matter if you have never been to the Bay Area. In fact, this will help you focus on what you can learn about the city through the data, without the influence of prior knowledge. Furthermore, the assessment will not be marked based on how much you know about the San Francisco Bay Area but instead about how much you can show you have learned through analysing data. You will need contextualise your project by highlighting the opportunities and limitations of ‘old’ and ‘new’ forms of spatial data and reference relevant literature. Format A computational essay using R-markdown. The assignment should be carried-out fully in R- markdown. What is a Computational Essay A computational essay is an essay whose narrative is supported by code and computational results that are included in the essay itself. This piece of assessment is equivalent to 4,000 words. However, this is the overall weight. Since you will need to create not only narrative but also code and figures, here are the requirements: Maximum of 3,000 words (ordinary text) (references do not contribute to the word count). You should answer the specified questions within the narrative. The questions should be included within a wider analysis. Up to five maps or figures (a figure may include more than one map and will only count as one but needs to be integrated in the same overall output) Up to one table There are three kinds of elements in a computational essay. 1. Ordinary text (in English) 2. Computer input (R-markdown code) 3. Computer output These three elements all work together to express what’s being communicated. Submission You must submit 1 electronic copy of your summative assessment via sharepoint by the published deadline. The format of the file must be a .zip (zipped folder), including an html AND an R-markdown document AND any additional data or jpgs. Please do not include your name anywhere in the documents. Please name your file as follows: Course_Candidate number (eg, GY476_34567.zip). Don’t worry if your file gets renamed and please do not tell the course teacher if it does as files should remain anonymous. Please refer to the GY476 Summative Assessment criteria. This document includes the parts you should include in your Computational Essay. GY476- Summative Assessment 2022/23 – GDS Data The assignment relies on datasets and has two parts. Each dataset is explained with more detail below. Data made available on Murray Cox’s website as part of his “Inside Airbnb” project which you can download (http://insideairbnb.com/). The website periodically publishes snapshots of Airbnb listings around the world. You should Download the San Francisco data, the San Mateo data and the Oakland data. These are all part of the Bay Area. Please Note: that for best results you will need to drop some of the outliers. Socio-economic variables for the Bay Area. Source: American Community Survey (ACS) 2016-2020, US Census Bureau. Observations: 1039; Variables: 472; Years: 2016-2020. A subset of variables from the latest ACS has already been retrieved for you in ACS_2016_2020_vars.csv. However, you have access to ALL variables in the American Community Survey (ACS) 2016-2020 through the R package Tidycensus. You are strongly recommended to use the census API in the R package Tidycensus to extract your variables of interest instead of the csv. For more information about the ACS (2016-2020) you can have a look at: https://www.census.gov/data/developers/data-sets/acs-5year.html and https://api.census.gov/data/2020/acs/acs5/variables.html. If you want to visualise some aspects at different Subnational Administrative boundaries, you can download USA boundaries from GADM. You can also find other geodata for the Bay Area in the Berkeley Library. You can use additional datasets, IF YOU SO CHOSE, for Part 2. If you need some inspiration, have a look at: Geodata for the Bay Area in the Berkeley Library. San Francisco Open Data Portal: https://datasf.org/opendata/ Data World: https://data.world/datasets/san-francisco NASA Data: https://earthdata.nasa.gov/earth-observation-data/near-real-time/hazards- and-disasters/air-quality Part 1 – Common 1.1 Collecting and importing the data 1.1.1 Import and explore 1.2 Preparing the data 1.2.1 What CRS are you going to use Justify your answer. 1.3 Discussion of the data Present and describe the data sets used for this project. 1.4 Mapping and Data visualisation 1.4.1 Airbnb in the BAY AREA at Neighbourhood Level GY476- Summative Assessment 2022/23 – GDS Summarise the data. Using Bay Area zipcodes obtained from Berkeley Library. This is slightly different from the Airbnb neighbourhood file. Obtain a count of listings by neighbourhood. Map 1.1: Number of listings per zipcode. Explore the spatial distribution of the data using choropleths. Style the layers using a colour ramp. Map 1.2: Average price per zipcode. Explore the spatial distribution of the data using choropleths. Style the layers using a colour ramp. Justify your data classification methods and visualization choices. You should include these maps in your assessment submission. The maps should be well-presented and include a short description. Questions to answer within your analysis: How does the Inside Airbnb data compare to other ‘new’ forms of spatial data Discuss the potential insights and biases, as well as opportunities and limitations of the Airbnb data. 1.4.2. Socio-economic variables from the ACS data Select two variables from American Community Survey data. These could be but are not limited to population density, median income, median age, unemployed, percentage of black population, percentage of Hispanic population or education level. See the Appendix in this document for help. If you chose to calculate population percentages, make sure you standardise the table by the population size of each tract. Map2: Explore the spatial distribution of your chosen variables using choropleths. Style the variables using a colour ramp. Justify your data classification methods and visualization choices. You should include these maps in your assessment submission. The maps should be well-presented and include a short description. Questions to answer within your analysis. Comment on the details of your map and analyse the results. What are the main types of neighbourhoods you identify Which characteristics help you delineate this typology What can you say about the spatial distribution of your socio-economic variable of interest If you had to use this classification to evaluate where Airbnbs would cluster, what would your hypothesis be Why For some stylised (not necessarily accurate) facts about the Bay Area see here. 1.4.3. Combining Data sets Map 3: Plot the natural logarithm of price (ln of price) of Airbnbs in the San Francisco Bay Area together (point plot) with one of your chosen socio-economic variables of interest at zipcode level using ggplot or tmap or mapsf (polygon plot). There are various ways of doing this. The maps should be well-presented. Questions to answer within your analysis. Comment on the details of your map and analyse the results. Does this map tell you more about the relationship between Airbnb location/price and your socio-economic variable of choice Explain your answer. Part 2 – Chose your own analysis Please Note: This part of the assignment can be done on the Bay Area as a whole or you can zoom in on one of the counties. For example, you could just focus on San Francisco. GY476- Summative Assessment 2022/23 – GDS 2.1. Discuss which potential raster data set could add to or improve your analysis in maximum 300 words. We have looked at various ones in class. You do not need to obtain the data, just discuss it. You can also look for other ideas at earthdata.nasa.gov. 2.2. Query OpenStreetMap data (for example bars, restaurants, subway stations) 2.1.1 Chose an amenity to query in OpenStreetMap (for example bars, restaurants, subway stations). Source your amenity of choice in and save the data. Map 4: Create a heatmap of your amenity of choice and analyse it. The maps should be well-presented. 2.1.2 Create buffers around your chosen amenity. Find out which Airbnbs are 200 metres (or less) from your amenity of choice. How many Airbnbs are within this spatial range Would this help you decide where to choose an Airbnb if you were going to San Francisco Justify by referring to the opportunities and limitations of OSM data. 2.3 Descriptive Spatial Analysis. You need to pick one of the following three options. Only one, and make the most of it. You must include one map (Map 5) to support your analysis. Option 1: Smoothing & Interpolation (IDW, Heatmaps or Point Patterns of Airbnb or OSM data) Chose which data to focus on. If you use OSM data it must be different that the amenity chosen in 2.1.1. Visualise the dataset appropriately and discuss why you have taken your specific smoothing or interpolation approach How did you define nearest neighbours Distance and other parameters What do the clusters help you learn about areas of interest in the city How are clusters distributed geographically What are the main characteristics of each cluster Can you identify some groups concentrated on particular areas In what research contexts would your chosen research approach be useful What would you advise city decision-makers from your findings Option 2: Network Analysis or Routing For this option, you can either chose to calculate routing between different points from data you are already working with in your project by using the R package sfnetworks or you can download some trip data https://data.sfgov.org/browse category=Transportation&page=2 Visualise the dataset appropriately and discuss why you have taken your specific smoothing or interpolation approach Report your travel time findings and minimum and maximum distances to relevant locations OR create an origin-destination matrix stating the number of trip and average duration and explore the spatial distribution of your trips. In what research contexts would calculating travel times or inspecting a travel network be useful What would you advise city decision-makers from your findings Option 3: Plotting relationships between Spatial Variables For this option you will be using the Inside Airbnb price data together with four socio- economic variables of your choice. Visualise the datasets and discuss why you have chosen your four socio-economic variables of interest. GY476- Summative Assessment 2022/23 – GDS Create a minimum of four scatter plots between av. price and 2-4 socio-economic variables of interest and/or a cross-correlation matrix of your variables What do these relationships help you learn about areas of interest in the city In what research contexts would your chosen research approach be useful What would you advise city decision-makers from your findings Feel free to play around with more variables if you think it can support/enhance your findings. Resources to help you. See also suggested bibliography in slides throughout the course. https://www.r-bloggers.com/2017/11/programming-meh-lets-teach-how-to-write- computational-essays-instead/ https://rmarkdown.rstudio.com/ https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf https://vizual-statistix.tumblr.com/post/114850050736/i-find-the-spread-of-airbnb-to-be- as-fascinating https://carto.com/blog/airbnb-impact/ https://cran.r-project.org/web/packages/biscale/vignettes/biscale.html Appendix American Community Survey (ACS) 2016-2020, US Census Bureau. Observations: 1039; Variables: 472; Years: 2016-2020 Variable Description B19013_001E Median household income in the past 12 months (in 2020 inflation-adjusted dollars). Coded as hh_income B02001 (list of vars) Population by race See https://api.census.gov/data/2020/acs/acs5/variables.html I have already recoded black (n of black people) and all_ppl_race (total population by census tract) B23006 (list of vars) Population by education See https://api.census.gov/data/2020/acs/acs5/variables.html C15002A (list of vars) Population by Sex by Education See https://api.census.gov/data/2020/acs/acs5/variables.html C27012 (list of vars) Population by Health insurance See https://api.census.gov/data/2020/acs/acs5/variables.html B08006 (list of vars) Commuting variable See https://api.census.gov/data/2020/acs/acs5/variables.html B09010 (list of vars) Supplementary income variables See https://api.census.gov/data/2020/acs/acs5/variables.html B09019 (list of vars) Household type counts See https://api.census.gov/data/2020/acs/acs5/variables.html B17001 (list of vars) Poverty Status See https://api.census.gov/data/2020/acs/acs5/variables.html B28011 (list of vars) Internet Access See https://api.census.gov/data/2020/acs/acs5/variables.html B99084 (list of vars) Work From Home See https://api.census.gov/data/2020/acs/acs5/variables.html