Python-STA 142A - Father Essays

STA 142A: Homework 3
Homework due in Canvas: 02/26 (Friday) at 11:59PM.
Please follow the instruction in canvas regarding HWs.
You are encouraged to discuss about the problems with your classmates.
But copying of the homework constitutes a violation of the UC Davis
Code of Academic Conduct and appropriate action will be taken.
1. Smoothing Splines. Question 2 from page 298 and question 5 from page 299. (For any
plots involved, just a hand-drawn plot is sufficient)
2. Trees and Bagging. Question 1 from page 332 and question 5 from page 332. (For any
plots involved, just a hand-drawn plot is sufficient)
3. GAM+Splines For this question, pyGAM package will be useful.
In this question, we will do a binary classification with multivariate input data. To handle
the multivariate nature, we will use a generalized additive model. Let X ∈ Rp represent the
input random variable and Y represent the output random variable for Binary classification
(note we let Y ∈ {0, 1} instead of Y ∈ { 1, 1} which we typically did in class, as PyGAM
package follows that convention). Let the conditional distributions be as follows:
(a) For even j, the jth-coordinate of X is distributed as
Xj |(Y = 1) is a t distribution with 1 degree of freedom with mean 2
Xj |(Y = 0) is a t distribution with 1 degree of freedom with mean 0.
(b) For odd j, the jth-coordinate of X is distributed as
Xj |(Y = 1) is an exponential distribution with λ = 1.
Xj |(Y = 0) is an exponential distribution with λ = 3.
and let P (Y = 1) = 0.5. Details about t-distribution and exponential distribution could be
found in the wikipedia links here and here, respectively. You could use
np.random.standard t, numpy.random.exponential and np.random.binomial for this ques-
tion.
(a) Let p = 10. Repeat the following procedure for 100 trails: Generate n = 100 train-
ing data samples (x1, y1), . . . , (x100, y100) from the above model. Note that here each
xi ∈ R
p, and for all i, xi,j represents the j
th co-ordinate of the ith training sample,
which follows the above generating process. Train a logistic generalized additive model
classifier on this training data (you could use LogisticGAM from the pyGAM package).
Generate n = 100 testing data from the same model. Note that you will know the true
labels in this testing data as you generated it. Plot a box-plot of the test error. What
is the mean and variance of the test errors
(Here, for each trail, the test error is defined as the number of misclassified samples on
the testing data. Also, when running LogisticGAM command, there might be warnings
on non-convergence; please feel free to ignore such warnings. Finally, this experiment
might take sometime to run (about 10 minutes on a reasonable laptop))
1
2.
(b) Repeat the above procedure with p = 30. Comment on the running time and test error
differences form the previous case.
2