R – ECON3389数据分析

Case Study and Course Project Anatoly Arlashin Boston College Case Study: Agenda Add a real life story to the theory of lectures and model estimation in R. Broaden your understanding of the benefits of ML in economic/business applications. Practice creating summary reports and video presentations for the course project. ECON3389 ML in Economics | Fall’20 Lecture 09: Case Study and Course Project 2/8 Case Study: Tips & Tricks Case study is NOT simply a summary of an article/blog post. Instead, it should be a review of the chosen case based on the agenda of our course, i.e. the use of ML in Economics. Think about your case as a presentation done in front of you, and try to come up with any and all questions you might ask during such presentation. I You will likely not find answers to those questions in the original article/blog post, but you can try answering those questions yourself. Of course, if your case involves some ML models/techniques that we have not yet covered in our class, make sure to very briefly explain what those are. ECON3389 ML in Economics | Fall’20 Lecture 09: Case Study and Course Project 3/8 Course Project: Agenda Course project is a full scale research project, and requires a substantial amount of work to complete. Unlike case study, project requires you to do all the work — find a dataset, for- mulate research agenda, apply your knowledge of ML methods to build a reliable inference/prediction model, and so on. You will be tasked both with basic data analysis (summary statistics, visual plots), statistical modeling (estimating a model using R), writing a research paper and cre- ating a video presentation of your results. Important: you should start working on the project as soon as possible, and keep working on it on a regular basis. Rushing everything in the last couple of days will likely produce inferior results. ECON3389 ML in Economics | Fall’20 Lecture 09: Case Study and Course Project 4/8 Course Project: Data Each group can choose any dataset for their project. It could be one of the four datasets I suggest or any other dataset. Unlike case study, there is no restriction on how many groups are using the same dataset. The four datasets available through Canvas are: I Iowa liquor sales. I US baseball and basketball salaries. I Personal income and socio-demographic attributes. You are free to use any other datasets, as long as it has at least 1000 observations across at least 10 variables, but you do need to confirm the chosen dataset with me first. I If using Kaggle, make sure to not fall into a trap of repeating someone’s steps from one of Kaggle’s challenges. ECON3389 ML in Economics | Fall’20 Lecture 09: Case Study and Course Project 5/8 Course Project: Research Question Specifics of research questions depend on the nature of the data, but in general you are required to do two things: build an inference/causal analysis model and build a pure prediction model. For both models you will need to choose the same outcome variable and use the rest of the variables as your predictors (explanatory variables). Inference model will likely be not too complicated — linear regression with a few non-linear terms and/or interactions with factor variables. Prediction model, however, can be as complex as you like — polynomial regression, random forests, neural nets, etc. ECON3389 ML in Economics | Fall’20 Lecture 09: Case Study and Course Project 6/8 Course Project: Building a Model Even for relatively simple linear models you will need to make educated decisions about which variables and in which form to include in the model’s equation. I If you have factor variables, than the most flexible approach will require interacting all factor variables with all non-factor ones, which may lead to hundreds of regressors. I On the other hand, for inference model there may not be any meaningful interpretation for inclusion of all possible combinations, and thus you will have to balance additional regressors vs ZCM vs interpretability. Generally speaking, whenever choosing between competing models, you should always use the train/test split of your data. I This is especially important for pure prediction model, where overfitting could be a major issue. ECON3389 ML in Economics | Fall’20 Lecture 09: Case Study and Course Project 7/8 Course Project: Tips & Tricks Start working on course project ASAP. Anywhere between 50% and 80% of all the work will be data management and analysis in R, and that is something prone to being stuck with some issue for hours, if not days. The earlier you start working, the more opportunities you will have to ask me ques- tions/feedback. Spread out the workload across all group members — one person can do general data summary (tables, charts), another one work on best inference model and yet another one on best prediction model. Your video presentation should contain the bulk of your findings, but it is also some- thing that I and your classmates will comment on, giving you a chance to fix any spotted issues before submitting the final paper. ECON3389 ML in Economics | Fall’20 Lecture 09: Case Study and Course Project 8/8