R和Python-GEOG0125

GEOG0125 Advanced Topics in Social and Geographic Data Science (2021-2022) Coursework Anwar Musah1,* and Stephen Law1,** 1Department of Geography, University College London, London,UK *a.musah@ucl.ac.uk **stephen.law@ucl.ac.uk 1 Coursework The coursework for GEOG0125 consists of two separate tasks. The first task concerns the use of Bayesian models and the second task concerns the use of a machine learning model. 2 Spatial Bayesian modelling task For this part of the coursework, we would like you to select an outcome that follows a Poisson distribution of your choice. This can be from any scientific discipline of your choosing (e.g., public health, quantitative criminology, disaster reduction, social sciences etc.,) on which you can perform geospatial analysis of aggregate data within a Bayesian framework. The aim of this task is to introduce an interesting research problem and apply spatial & spatiotemporal Bayesian models for the mapping and quantification of area-level risk of an outcome and create an interactive dashboard using the ”Shiny” package in RStudio. The final deliverable for this task is an extended abstract of 1,500 words (excluding references) and a RShiny Dashboard. Your extended abstract should contain the following sections: 2.1 Overview 2.1.1 Background In this section, introduce your research problem and importance of the selected outcome and then justify why dashboards are an ideal surveillance tool for monitoring your the chosen outcome. 2.1.2 Data and Methods For this section you should include a description of the data and selected study area. You are required to use the Spatial Conditional Auto-regressive Model (CAR) for this exercise as well as provide the model formulation and statistical description of each model parameter. 2.1.3 Results and Discussion In
this section, you must reports the key findings from the spatial CAR
model. It is important that your the interpretation of the results should touch on the following key points: The overall risks of associated of the outcome. Descriptive interpretation of the geographical patterns of risk with regards to the selected outcome, and whether these risk are statistically significance Interpretation of the exceedance probabilities The discussion should relate back to the modelled outputs visualised from the dashboard. You should discuss how the dashboard can be implemented as a tool to help inform some intervention or support some policy decision making in the context of your selected problem. 2.1.4 Reference The information in provided in the background, methods and discussion sections must be supported with the relevant references. 1 2.2 RShiny Dashboard When creating the RShiny dashboard, it must contain the following visual map outputs: Geographical distribution of the outcome using any measure of frequency (e.g., expressed as rate per capita (i.e., per 100,000 or per km2 etc.,) Area-specific risk estimates [i.e., relative risk (RR)] Significant regions determined by the 95% credibility intervals Exceedance probabilities 2.3 Data sources You are free to use data you come across outside of the provided list below. You are welcome to use any data previously implemented for the GEOG0114 Spatial Analysis Project. However, you will need to make sure that the selected outcome follows the appropriate distribution suitable for the spatial CAR model. Some examples of appropriate: 1. London Data Store (https://data.london.gov.uk) 2. Consumer Data Research Centre (http://data.cdrc.ac.uk/) 3. UK Metropolitan Police (https://data.police.uk) 4. Office of National Statistics for all UK population data (https://tinyurl.com/5h8hes72) For this coursework, you are not allowed to use the road accident data from week 8’s practical. We strongly caution against replicating any examples from online tutorials, or any from the books recommended in lecture 8. To follow through with our key tenets for GEOG0125, it is a requirement that all analyses and generation of dashboard are carried out in RStudio. Some key advice when analysing data for your study population, if you are not able to obtain population counts that is dis-aggregated by age and sex for the expected number for your outcome. Then, you can specify “n.strata = 1” in the r-code when computing the expected number. For example: expectNumber < expected(population = d$population, cases = d$outcome, n.strata = 1) Note that it is not obligatory to use an entire country as a study area. You can use a sub-region from that country which is delineated appropriately for the generation of the adjacency matrix. 2.4 Submission format The final extended abstract should be submitted in PDF format, font size 11 or 12 points. The report should have a maximum length of 1,500 words. The total word count includes the background, data and methods, and results and discussion. The word count excludes the references at the end. Please note that your interpretation of results should be supported by the outputs from the RShiny dashboard. You may use figures - the maximum allowed is 8 in total (sub-figures are allowed). An example structure of the abstract would be: 1. Background (300 words) 2. Data and Methods (600 words) 3. Results and Discussion (600 words) 4. References All R-scripts (i.e., .Rmd or .R) used for the analyses and generation of RShiny app in RStudio must be submitted separately as a single ZIP file. For reproducibility, the data set behind the RShiny application must be submitted. PLEASE MAKE SURE THE SUBMITTED CODE FOR THE APP & DATA ARE FULLY FUNCTIONAL. Any large datasets (e.g., .SHP and .CSV) that you require for these tasks should be uploaded separately with a link provided within the codebook. We recommend uploading your dataset to OneDrive, create a share link, and providing this link in your notebook. 2/5 3 Machine learning task For the second part of this coursework, we would like you to identify an outdoor scenes (urban/natural) image dataset on which you can apply a machine learning research problem. Examples of such a problem are using crowd-sourced imagery to classify urban scenes1 (eg. with or without shops/greenery) or using satellite imagery to predict population density/wealth of an area2 (eg. LSOA socio-economic data). The aim of this task is to be able to identify a research problem, describe its related works, setup a research pipeline, construct a machine learning model, and to report and discuss the implication of the results. The final deliverable for this task is an extended abstract of 1,500 words (excluding references). This may seem like a lot, but it really is not when you need to properly describe all the necessary parts of your research. Your extended abstract should contain the following: 3.1 Overview 3.1.1 Background and Related works In this section, describe the research problem you want to study. It is important to describe why this research problem is important, what dataset you will be using to study this problem, and which particular machine learning approach you plan on using. Important here is the justification: why are the data and methods that you have chosen appropriate to study the problem. As part of the justification, you should describe and include some related works that try to address a similar research problem. 3.1.2 Dataset Please describe the dataset you will be using, the source of the dataset, and how you collected the data and the details on how you prepared the dataset. For image datasets this would involve, for instance, data cleaning, removal of invalid data, data quality checking, data transformation and exploring the data (visualising). 3.1.3 Methodology and Research Pipeline In this section, describe the method of your research. Here, you describe the detail the machine learning model you have chosen and the hyper-parameters of the model. Please also describe and draw a research pipeline describing the tasks you will be conducting. Its important to also describe the details of the experiments you will be running and why you made these specific decisions. However, you need to write this at a high-level so that it does not becomes a process report. We would strongly recommend having a look at how researchers have done this in academic papers that involved similar methods. 3.1.4 Results In this section, please report the results (train/testset) of the machine learning model for the image regression/classification task you have identified. Please also interpret the results of the model. It is important to consider the following questions: How is the model’s performance Is your model overfitting Where are the errors coming from What are the implications of the research What are the limitations of the research What are some potential steps for the future 3.1.5 Conclusion The final section is to briefly conclude your report by answering your research question / explaining how your results relate to your research aim. Make sure that the conclusion links nicely to the research problem you have introduced in the introduction. You can also mention limitations and suggestions for future research. 3.2 Submission format The final extended abstract should be submitted in PDF format, font size 11 or 12 points. The report should have a maximum length of 1,500 words. The total word count includes the title, introduction, related works, data, method, results, conclusion, captions, and excludes the bibliography at the end. The maximum number of figures is 8 in total (subfigures are allowed). An example structure of the extended abstract could be: 1. Background and Related works 3/5 2. Data 3. Method 4. Results 5. Conclusion 6. References Code should be submitted separately as a single ZIP file. The code can be submitted as Jupyter worksheet(s) or as a set of Python files. Any large datasets (eg. images) that you require for these tasks should be uploaded separately with a link provided within the codebook. We recommend uploading your dataset to OneDrive, create a share link, and providing this link in your notebook. Some tips: 1. computation is an important factor to consider when running machine learning models. 2. Hundreds and often thousand of images are sometimes required for a simple image classification task. As a result, to reduce the need for training data consider using pretrained models for feature extractions. 3. If the learning process still takes too long, consider using Google Colab to run the analysis. However only move to Google Colab when your processes run on your local machine. 4. use figures with captions when you want to elaborate a point. 5. use tables when you want to summarise your results. 6. Remember to have in-text citation when you are using a specific model and method. 3.3 Example datasets An example of an image-dataset, you could use is the scenicness dataset that was provided in the week5 lab notebook and week6 lab seminar. This dataset is a subset of the original Scenic-or-not dataset as used in3. You are not allowed to reuse the scenicness ratings. As such, you would need to propose a new research question using these images. Below are some example datasets you could consider using: Scenic-or-not images (eg. cannot reuse the scenicness ratings) flickR images Google Earth Engine imagery Google Streetview images 4 Submission details You should submit both parts of the course work as a single report through Turnitin on the course Moodle page, under the ’Assessment’ tab. Your code should be submitted as a single ZIP file on the course Moodle page for each task. Two submission links will be available for you to upload your code for each of the tasks. To be clear: this means that you will have upload two ZIP files in total, one containing all the code for the first task and another one containing all the code for the second task. Note: Failure to include your full code will incur a 10-point penalty. The submission deadline is May 3rd, 2022 at noon. Further details on the submission procedures will be available on Moodle. 4.1 Queries A sub-channel has been created specifically for queries about the coursework to be asked in. All related queries must be posted in this sub-channel; this is largely to address a likely overlap in questions that students may have and so that all students will benefit from any clarification that is given. 4/5 Questions seeking clarification about, for instance, the wording of the task briefs or format of submission will be answered. However, as this is an assessed piece of work, you may not ask about questions that pertain directly to the coursework itself, e.g. ”Is analysis X the best way to answer question 1a ” Because of the same reason, any collaboration or discussion of the coursework with anyone is strictly prohibited. The rules for plagiarism apply and any cases of suspected plagiarism of other works, published or not, will be taken very seriously. The deadline for questions is April 26th, 2022, i.e. 1 week before submission deadline (May 3rd, 2022). References 1. Law, S., Seresinhe, C. I., Shen, Y. & Gutierrez-Roig, M. Street-frontage-net: urban image classification using deep convolutional neural networks. Int. J. Geogr. Inf. Sci. 34, 681–707 (2020). URL https://doi.org/10.1080/ 13658816.2018.1555832. DOI 10.1080/13658816.2018.1555832. https://doi.org/10.1080/13658816. 2018.1555832. 2. Jean, N. et al. Combining satellite imagery and machine learning to predict poverty. Sci. 353, 790–794 (2016). 3. Seresinhe, C. I., Preis, T. & Moat, H. S. Using deep learning to quantify the beauty of outdoor places. Royal Soc. open science 4, 170170 (2017). 5/5