Python-DATA7703-Assignment 3

DATA7703 Assignment 3
2021 Semester 2, due 23:59 24 Sep
Instructions. Please read the instructions carefully — not following them may result in a
penalty. (a) Submit your solutions as a single PDF file on Blackboard. Go to Assessment,
Assignment 3 to submit. If you don’t know how to convert your file to a PDF, please search
for a guide online. You can submit as many times as you want before the deadline. The last
submission will be graded. (b) Write down your name and student number on the first
page of your solution report, and write down the question numbers for your solutions.
For programming questions, you are welcome to submit your code files or output files in a
separate zip file, but you must include both your code and relevant output in your submitted
PDF file. Excessive code output may be penalised. (c) Follow integrity rules, and provide
citations as needed. You can discuss with your classmates, but you are required to write
your solutions independently, and specify who you have discussed with in your solution. If
you do not know how to solve a problem, you can get 15% of the mark by writing down “I
don’t know”.
You are encouraged to keep your solutions concise — these questions require thoughts, not
long answers.
1. (15 marks) We consider linear combinations of threshold classifiers for classifying real
numbers in this question. Specifically, a threshold classifier for classifying real numbers is
one of the form () = ( ° ) where ° can be ≤,≥, < or >, and the indicator function
(·) takes value +1 if its argument is true, and -1 otherwise. For example, () = ( ≤ 5)
can be written as () =
{
+1, ≤ 5,
1, > 5. .
Given threshold classifiers 1, . . . , , we say that a linear combination = 11 + . . . +
of them represents a set R if and only if the classifier () = (() ≥ 0)
satisfies () = ( ∈ ).
(a) (5 marks) For any < , consider the threshold classifiers 1() = ( > ), 2() =
( < ), and 3() = ( < +∞). What is the set represented by = 1+2+0.13 (b) (10 marks) Find a linear combination of threshold classifiers to represent two intervals ( 2, 1) ∪ (1, 2), 2. (20 marks) We consider a variant of the standard bagging algorithm, which we call Wag- ging (Weighted Aggregating), in this question. We only consider regression problems here. In the standard bagging algorithm for regression, we train multiple models 1, . . . , and average their outputs during prediction. In Wagging, we compute a weighted average of the predictions of the individual models instead. Formally, we assign a weight ≥ 0 to with ∑ =1 = 1. If is the prediction of , then Wagging predicts = ∑ . There are many possible ways to choose 1, . . . , in Wagging, and we consider how their values affect bias and variance below. 1 In the remainder of this question, assume that 1, . . . , are identically distributed with Var() = 2 for all , and cov(, ) = 2 for all 1 ≤ = ≤ . (a) (5 marks) Show that Wagging has the same bias as each individual model. (b) (5 marks) Express Var( ) in terms of 1, . . . , , and 2. In addition, for = 2, use your formula to evaluate Var( ) when (1, 2) equals to (1, 0), ( 1 2 , 1 2 ) and (0, 1) respectively. (c) (10 marks) For any ≥ 2, find weights 1, . . . , that minimize the variance of = ∑ . 3. (30 marks) We consider random forest in this question, and study the effect of , the size of the random feature subset used in choosing the splitting point in the decision trees. Recall that when constructing a decision tree in a random forest, at each node, instead of choosing the best split from all given features, we can first choose 1 ≤ ≤ features, and then choose the best split among them. (a) (5 marks) Load the California housing dataset provided in sklearn.datasets, and construct a random 70/30 train-test split. Set the random seed to a number of your choice to make the split reproducible. What is the value of here (b) (5 marks) Train a random forest of 100 decision trees using default hyperparameters. Report the training and test accuracies. What is the value of used (c) (5 marks) Compute all the pairwise correlations between the test set predictions of the 100 trees, and report their average. Correlation refers to the Pearson correlation in this question. (d) (5 marks) Repeat (b) and (c) for = 1 to , and tabulate the training and test accuracies, and the average correlations for all values. In addition, plot the training and test accuracies against in a single figure, and plot the average correlation against in another figure. (e) (5 marks) Describe how the average correlation changes as increases. Explain the observed pattern. (f) (5 marks) A data scientist claims that we should choose such that the average correlation is smallest, because it gives us maximum reduction in the variance, thus maximum reduction in the expected prediction error. True or false Justify your answer. 4. (35 marks) We consider bagging using the OOB error to automatically determine the number of basis models to use in this question. Specifically, we keep adding one model trained on a bootstrap sample to the bagging ensemble at a time, and stop only when one of the following happens: (a) the number of models in the ensemble reaches a pre-specified maximum number of models, or (b) is at least 10, and ≥ 5 . 2 Here the smoothed OOB error is defined as = 1 5 ∑ 5 =1 +1, with being the OOB error for the ensemble consisting of the first models. To help you implement the above enhanced bagging algorithm, a partial implementation with some comments has been provided below and also in the file bagging.py. from sklearn.base import clone import numpy as np class OOBaggingClassifier: def __init__(self, base_estimator, n_estimators=200): ''' Parameters ---------- base_estimator: a probabilistic classifier that implements the predict_proba function, such as DecisionTreeClassifier n_estimators: the maximum number of estimators allowed. ''' self.base_estimator_ = base_estimator self.n_estimators = n_estimators self.estimators_ = [] self.oob_errors_ = [] def fit(self, X, y, random_state=None): if random_state: np.random.seed(random_state) self.best_n = 0 probs_oob = None for i in range(self.n_estimators): estimator = clone(self.base_estimator_) # construct a bootstrap sample # train on bootstrap sample # compute OOB error oob_error = ... # replace ... with your code # save the OOB error and the new model self.oob_errors_.append(oob_error) self.estimators_.append(estimator) # stop early if smoothed OOB error increases (for the purpose of # this problem, we don't stop training when the criterion is # fulfilled, but simply set self.best_n to (i+1)). if (self.best_n == 0) and (OOB criterion); # replace OOB criterion with your code self.best_n = (i+1) def errors(self, X, y): 3 ''' Parameters ---------- X: an input array of shape (n_sample, n_features) y: an array of shape (n_sample,) containing the classes for the input examples Returns ------ error_rates: an array of shape (n_estimators,), with the error_rates[i] being the error rate of the ensemble consisting of the first (i+1) models. ''' error_rates = [] # compute all the required error rates return error_rates def predict(self, X): ''' Parameters ---------- X: an input array of shape (n_sample, n_features) Returns ------ y: an array of shape (n_samples,) containing the predicted classes ''' probs = None for estimator in self.estimators_: p = estimator.predict_proba(X) if probs is None: probs = p else: probs += p return np.argmax(probs, axis=1) Note that in this question, we only consider bagging of probabilistic classifiers, and we use a different prediction method from the majority vote method discussed in lecture: we first use each individual classifier to predict the class distributions, then we average over all the predicted distributions, and predict the most likely class. This is implemented in the predict function, and you should use the same prediction method when computing the OOB prediction. If an example is not an OOB example for any of the basis models, predict the first class. (a) (5 marks) Load the digit data set provided in sklearn.datasets, and construct a random 70/30 train-test split. Set the random seed to a number of your choice to make the split reproducible. (b) (5 marks) A naive way to compute the OOB error is to store all bootstrap samples, and then use the definition of OOB error discussed in lecture to compute it. This requires a lot of memory for large datasets. Describe a method that computes the 4 OOB error using the -th bootstrap sample, but not previous bootstrap sam- ples — this allows you to throw away a bootstrap sample once the -th model has been trained and has been computed. Introduce additional variables or data structures if needed, but the amount of memory used should be at most within a constant factor of that for the original dataset size (the constant factor need to be independent of the number of basis models used). (c) (10 marks) Implement the fit function to support the method for using the OOB error to determine the number of models as described above. (d) (5 marks) Implement the errors function which takes in a labeled dataset, and returns the error rates for the ensemble consisting of the first basis models, for 1 ≤ ≤ . (e) (5 marks) Train your OOBaggingClassifier on the digit dataset, using a maximum of 200 decision trees. Report the number of selected models. Plot the OOB error and the test error against the number of models (from 1 to 200). What is the number of basis models chosen by the OOB error method (f) (5 marks) Repeat (f) two more times using different random seeds. Comment on the performance of the OOB error method for selecting the number of models. 5