Python-ZHVZ9BVRP

2021/4/28 HW4AnomalyDetectionBaysianNet.ipynb – Colaboratory
https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 1/4
Question 1 (40 points)
In this question, you will model trac counts in Pittsburgh using Gaussian process (GP) regression.
The included dataset, “PittsburghTracCounts.csv”, represents the average daily trac counts
computed by trac sensors at over 1,100 locations in Allegheny County, PA. The data was collected
from years 2012-2014 and compiled by Carnegie Mellon University’s Trac21 Institute; we have the
longitude, latitude, and average daily count for each sensor.
Given this dataset, your goal is to learn a model of trac count as a function of spatial location. To
do so, t a Gaussian Process regression model to the observed data. While you can decide on the
precise kernel specication, you should try to achieve a good model t, as quantied by a log
marginal likelihood value greater than (i.e., less negative than) -1400. Here are some hints for
getting a good model t:
We recommend that you take the logarithm of the trac counts, and then subtract the mean
of this vector, before tting the model.
Since the data is noisy, don’t forget to include a noise term (WhiteKernel) in your model.
When tting a GP with RBF kernel on multidimensional data, you can learn a separate length
scale for each dimension, e.g., length_scale=(length_scale_x, length_scale_y).
Your Python code should provide the following ve outputs:
1) The kernel after parameter optimization and tting to the observed data. (10 pts)
2) The log marginal likelihood of the training data. (5 pts)
3) Show a 2-D plot of the model’s predictions over a mesh grid of longitude/latitude (with color
corresponding to the model’s predictions) and overlay a 2-D scatter plot of sensor locations (with
color corresponding to the observed values). (10 pts)
4) What percentage of sensors have average trac counts more than two standard deviations
higher or lower than the model predicts given their spatial location (5 pts)
5) Show a 2-D scatter plot of the sensor locations, with three colors corresponding to observed
values a) more than two standard deviations higher than predicted, b) more then two standard
deviations lower than predicted, and c) within two standard deviations of the predicted values. (10
pts)
MLC HW 4
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount(‘/content/gdrive’)
2021/4/28 HW4AnomalyDetectionBaysianNet.ipynb – Colaboratory
https://colab.research.google.com/drive/1ogA8xZHVZ9BVRP-ZxlPlXjYyXk4IjFH1#scrollTo=5bvgIil0dMZW&printMode=true 2/4
Data1=pd.read_csv(‘gdrive/My Drive/PittsburghTrafficCounts.csv’)
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount(“/con
Data1[‘log’]=np.log(Data1[‘AvgDailyTrafficCount’])
Data1[‘log’].mean()
8.408342585887237
Longitude Latitude AvgDailyTrafficCount log
0 -80.278366 40.468606 84.0 4.430817
1 -80.162117 40.384598 95.0 4.553877
2 -80.221205 40.366778 97.0 4.574711
3 -80.142455 40.622084 111.0 4.709530
4 -80.131975 40.544915 125.0 4.828314
… … … … …
1110 -79.843684 40.498619 13428.0 9.505097
1111 -79.926842 40.425383 13713.0 9.526100
1112 -80.065730 40.397582 13822.0 9.534017
1113 -79.863848 40.429878 14172.0 9.559023
1114 -79.848609 40.479233 14891.0 9.608512
1115 rows × 4 columns
Data1