Homework 1 for CSI 531 (Part 2)
Due: Mar 8, 23:59pm
All homeworks are individual assignments. This means: write your own solutions and do not
copy code/solutions from peers or online. Should academic dishonesty be detected, the proper
reporting protocols will be invoked (see Syllabus for details).
Instructions: Submit two files. One should be a write-up of all solutions and observations, as LastnameFirst-
nameSolution.pdf. The second should be an archive LastnameFirstnameCode.zip containing code and any results
files.
1 [20 pts.] Kernel methods
Consider the problem of finding the most dissimilar diametric pair (MDDP): this is a pair of data points that are
dissimilar from the mean and also dissimilar from each other. Below is an algorithm that would find such a pair
given a data matrix D:
Algorithm 1: MDDP(D)
Result: a, b – the most dissimilar diametric points in D
Compute the data mean μ = mean(D);
s = +∞;
for i in (1 . . . n) do
for j in (i+ 1 . . . n) do
temp = xTi μ+ x
T
j μ+ x
T
i xj ;
if temp < s then
s = temp;
a = xi;
b = xj ;
end
end
end
The algorithm computes the sum of inner products xTi μ+ x
T
j μ+ x
T
i xj for each pair of points and returns the
pair with the lowest such quantity.
(a)[5 pts] Demonstrate the execution of this algorithm on the following data matrix of 2D instances: D=
0 1
1 3
5 0
2 4
.
Show the steps and the resulting MDD pair of points.
(b)[15 pts] As we discussed in class sometimes we would like to kernelize methods to handle non-linearity in data.
Provide a pseudo-code for a kernel version of the MDDP algorithm above. The goal is to kernelize the algorithm
for an arbitrary kernel Hint: Assume that you can compute the kernel matrix K, corresponding to
some mapping φ() and then use the basic kernel operations we discussed in class and also in the
book, to derive the steps of MDDP in terms of elements in K.
2 [10 pts.] Orthogonality of Error in Regression:
Prove that Y T = 0, where Y is the predicted response and = Y Y is the error between the actual and predicted
response. Hint: Use the solution for the predicted response as a transformation of Y through the
1
hat matrix.
3 [25 pts. + 5 pts extra] Regression to understand Bike sharing:
For this task, you will use data about various factors (e.g. season, day of the week, weather, etc,) to predict the
daily number of bike rentals in a city. The data includes total count of rentals (cnt column) which will be the
response variable and you will use the following columns as predictors:
season, yr, mnth, holiday, weekday, workingday, weathersit, temp, atemp, hum, windspeed
Go over the included README file to get more insigh about the meaning of the columns. Feel free to prepare
the data into simpler text format before you use code. The goal will be to train linear regression models for the
total count of rentals (cnt column) as a function of the above predictors.
(a) [8pts] Use scikitlearn’s linear regression: linear model.LinearRegression to learn an unregularized model.
Report (1) the residual error (SSE) of the model and (2) the corresponding coefficient for each predictor
variable.
(b) [8pts] Use scikitlearn’s ridge regression: linear model.Ridge to learn a regularized model with alpha = 1.
Report (1) the residual error (SSE) of the model and (2) the corresponding coefficient for each predictor
variable. Is the ridge better or worse than non-regularized regression in terms of training SSE Explain why
and discuss the expected advantages of ridge (if any)
(c) [9pts] Do weather variables help Train a linear regression using only weather predictors:
temp, atemp, hum, windspeed
Report (1) the residual error (SSE) of the model and discuss what you can conclude from the results regarding
the usefulness of the weather variables
(d) [Extra 5pts] Learn about an efficient and effective method to perform non-linear regression:
sklearn.tree.DecisionTreeRegressor. Train a DecisionTreeRegression model using both the population and the
Boolean predictors. Discuss how it compares to the other 3 models in terms of its residual error. Investigate
the predictor variable importance stored in feature importances after you train the model. Discuss findings.
2