STAT 7008-Python-Assignment 3

STAT 7008 – Assignment 3 Note: A3 is 20% of the overall assessment. The 100 points in A3 will be rescaled to 20% in the final score. Web Scraping 1. (25 points) Crawl information from https://www.sciencedirect.com (1) (13 points) Crawl some key information about all articles published in 2022 from the website https://www.sciencedirect.com/journal/journal-of-econometrics/issues, including year, volume, article content, title, authors and pages. Crawl the volume numbers from 226 to 230 only. (2) (6 points) Remove “xa0” in volume_name and store the crawled data into pandas DataFrame. (3) (6 points) Filter the author with Null value and then find the top 10 authors that published the most articles. Hint: i. Click the button of the targeted item ii. Pass the html to BeautifulSoup and get all links iii. Use requests to get article content, title, authors and pages for each block For this example, article content Research article title Identification in nonparametric models for dynamic treatment effects authors Sukjin Han pages Pages 132-147 Scikit-learn 2. (10 points) Handwritten digits dataset loading and preprocessing (1) (2 points) Load the digits data by load_digits. (2) (4 points) Use MinMaxScaler to normalize the covariates X. (3) (4 points) Split the data into training and test set with test_size=0.2 and random_state=2020. 3. (15 points) Following question 2, fit the model specified below with different hyper- parameters, and report the performance. (1) (7 points) Fit the naive bayes model MultinomialNB on the digits training set with different values of the parameter alpha α∈{1,2,…,20}. (2) (4 points) Record the accuracy scores on the test set for each α. (3) (4 points) Draw the line plot of the accuracy scores versus different α. 4. (15 points) Following question 2, apply dimensionality reduction methods applied on the digits dataset. (1) (3 points) Fit Principal Component Analysis (PCA, n_components=2) model to Digits training set for dimension reduction. (2) (3 points) Apply model from (1) to train/test set for dimensionality reduction, compute the 2-dimensional embedded train/test set. (3) (3 points) Fit a nearest neighbor classifier (KNN, n_neighbors=3) on the embedded training set. Compute the nearest neighbor accuracy on the embedded test set, plot the projected test set points and show the evaluation score. (4) (6 points) Use Neighborhood Components Analysis (NCA, n_components=2) for dimensionality reduction, repeat (1), (2) and (3). Note: output results in following image format, no need for outputs in (1) and (2) Computer vision 5. (18 points) Face and Eye Detection (1) (12 points) Please write down the code to detect the faces and the eyes in face.jpg. Draw the red rectangle for the faces and the green rectangle for the eyes. (2) (6 points) If we want to open the front camera for video capturing and performing face and eye detection. How can we modify the above codes Hints: you may use the auxiliary .xml files and the detection algorithm based on Haar-like features, provided by opencv. Natural language processing 6. (17 points) Word embedding (Skip-gram) see the attached jupyter notebook with partially finished code: wb_partial_code.ipynb