DATA7703 Assignment 3
2021 Semester 2, due 23:59 24 Sep
Instructions. Please read the instructions carefully — not following them may result in a
penalty. (a) Submit your solutions as a single PDF file on Blackboard. Go to Assessment,
Assignment 3 to submit. If you don’t know how to convert your file to a PDF, please search
for a guide online. You can submit as many times as you want before the deadline. The last
submission will be graded. (b) Write down your name and student number on the first
page of your solution report, and write down the question numbers for your solutions.
For programming questions, you are welcome to submit your code files or output files in a
separate zip file, but you must include both your code and relevant output in your submitted
PDF file. Excessive code output may be penalised. (c) Follow integrity rules, and provide
citations as needed. You can discuss with your classmates, but you are required to write
your solutions independently, and specify who you have discussed with in your solution. If
you do not know how to solve a problem, you can get 15% of the mark by writing down “I
don’t know”.
You are encouraged to keep your solutions concise — these questions require thoughts, not
long answers.
1. (15 marks) We consider linear combinations of threshold classifiers for classifying real
numbers in this question. Specifically, a threshold classifier for classifying real numbers is
one of the form () = ( ° ) where ° can be ≤,≥, < or >, and the indicator function
(·) takes value +1 if its argument is true, and -1 otherwise. For example, () = ( ≤ 5)
can be written as () =
{
+1, ≤ 5,
1, > 5. .
Given threshold classifiers 1, . . . , , we say that a linear combination = 11 + . . . +
of them represents a set R if and only if the classifier () = (() ≥ 0)
satisfies () = ( ∈ ).
(a) (5 marks) For any < , consider the threshold classifiers 1() = ( > ), 2() =
( < ), and 3() = ( < +∞). What is the set represented by = 1+2+0.13
(b) (10 marks) Find a linear combination of threshold classifiers to represent two intervals
( 2, 1) ∪ (1, 2),
2. (20 marks) We consider a variant of the standard bagging algorithm, which we call Wag-
ging (Weighted Aggregating), in this question. We only consider regression problems here.
In the standard bagging algorithm for regression, we train multiple models 1, . . . , and
average their outputs during prediction. In Wagging, we compute a weighted average of
the predictions of the individual models instead. Formally, we assign a weight ≥ 0 to
with
∑
=1 = 1. If is the prediction of , then Wagging predicts =
∑
.
There are many possible ways to choose 1, . . . , in Wagging, and we consider how
their values affect bias and variance below.
1
In the remainder of this question, assume that 1, . . . , are identically distributed with
Var() =
2 for all , and cov(, ) =
2 for all 1 ≤ = ≤ .
(a) (5 marks) Show that Wagging has the same bias as each individual model.
(b) (5 marks) Express Var( ) in terms of 1, . . . , , and
2. In addition, for = 2,
use your formula to evaluate Var( ) when (1, 2) equals to (1, 0), (
1
2
, 1
2
) and (0, 1)
respectively.
(c) (10 marks) For any ≥ 2, find weights 1, . . . , that minimize the variance of
=
∑
.
3. (30 marks) We consider random forest in this question, and study the effect of , the size
of the random feature subset used in choosing the splitting point in the decision trees.
Recall that when constructing a decision tree in a random forest, at each node, instead of
choosing the best split from all given features, we can first choose 1 ≤ ≤ features,
and then choose the best split among them.
(a) (5 marks) Load the California housing dataset provided in sklearn.datasets, and
construct a random 70/30 train-test split. Set the random seed to a number of your
choice to make the split reproducible. What is the value of here
(b) (5 marks) Train a random forest of 100 decision trees using default hyperparameters.
Report the training and test accuracies. What is the value of used
(c) (5 marks) Compute all the pairwise correlations between the test set predictions of
the 100 trees, and report their average. Correlation refers to the Pearson correlation
in this question.
(d) (5 marks) Repeat (b) and (c) for = 1 to , and tabulate the training and test
accuracies, and the average correlations for all values. In addition, plot the training
and test accuracies against in a single figure, and plot the average correlation
against in another figure.
(e) (5 marks) Describe how the average correlation changes as increases. Explain the
observed pattern.
(f) (5 marks) A data scientist claims that we should choose such that the average
correlation is smallest, because it gives us maximum reduction in the variance, thus
maximum reduction in the expected prediction error. True or false Justify your
answer.
4. (35 marks) We consider bagging using the OOB error to automatically determine the
number of basis models to use in this question. Specifically, we keep adding one model
trained on a bootstrap sample to the bagging ensemble at a time, and stop only when
one of the following happens: (a) the number of models in the ensemble reaches a
pre-specified maximum number of models, or (b) is at least 10, and ≥ 5 .
2
Here the smoothed OOB error is defined as
=
1
5
∑ 5
=1
+1, with
being
the OOB error for the ensemble consisting of the first models.
To help you implement the above enhanced bagging algorithm, a partial implementation
with some comments has been provided below and also in the file bagging.py.
from sklearn.base import clone
import numpy as np
class OOBaggingClassifier:
def __init__(self, base_estimator, n_estimators=200):
'''
Parameters
----------
base_estimator: a probabilistic classifier that implements the predict_proba
function, such as DecisionTreeClassifier
n_estimators: the maximum number of estimators allowed.
'''
self.base_estimator_ = base_estimator
self.n_estimators = n_estimators
self.estimators_ = []
self.oob_errors_ = []
def fit(self, X, y, random_state=None):
if random_state:
np.random.seed(random_state)
self.best_n = 0
probs_oob = None
for i in range(self.n_estimators):
estimator = clone(self.base_estimator_)
# construct a bootstrap sample
# train on bootstrap sample
# compute OOB error
oob_error = ... # replace ... with your code
# save the OOB error and the new model
self.oob_errors_.append(oob_error)
self.estimators_.append(estimator)
# stop early if smoothed OOB error increases (for the purpose of
# this problem, we don't stop training when the criterion is
# fulfilled, but simply set self.best_n to (i+1)).
if (self.best_n == 0) and (OOB criterion); # replace OOB criterion with
your code
self.best_n = (i+1)
def errors(self, X, y):
3
'''
Parameters
----------
X: an input array of shape (n_sample, n_features)
y: an array of shape (n_sample,) containing the classes for the input examples
Returns
------
error_rates: an array of shape (n_estimators,), with the error_rates[i]
being the error rate of the ensemble consisting of the first (i+1)
models.
'''
error_rates = []
# compute all the required error rates
return error_rates
def predict(self, X):
'''
Parameters
----------
X: an input array of shape (n_sample, n_features)
Returns
------
y: an array of shape (n_samples,) containing the predicted classes
'''
probs = None
for estimator in self.estimators_:
p = estimator.predict_proba(X)
if probs is None:
probs = p
else:
probs += p
return np.argmax(probs, axis=1)
Note that in this question, we only consider bagging of probabilistic classifiers, and we
use a different prediction method from the majority vote method discussed in lecture: we
first use each individual classifier to predict the class distributions, then we average over
all the predicted distributions, and predict the most likely class. This is implemented in
the predict function, and you should use the same prediction method when computing
the OOB prediction. If an example is not an OOB example for any of the basis models,
predict the first class.
(a) (5 marks) Load the digit data set provided in sklearn.datasets, and construct a
random 70/30 train-test split. Set the random seed to a number of your choice to
make the split reproducible.
(b) (5 marks) A naive way to compute the OOB error is to store all bootstrap samples,
and then use the definition of OOB error discussed in lecture to compute it. This
requires a lot of memory for large datasets. Describe a method that computes the
4
OOB error using the -th bootstrap sample, but not previous bootstrap sam-
ples — this allows you to throw away a bootstrap sample once the -th model has
been trained and has been computed. Introduce additional variables or data
structures if needed, but the amount of memory used should be at most within a
constant factor of that for the original dataset size (the constant factor need to be
independent of the number of basis models used).
(c) (10 marks) Implement the fit function to support the method for using the OOB
error to determine the number of models as described above.
(d) (5 marks) Implement the errors function which takes in a labeled dataset, and returns
the error rates for the ensemble consisting of the first basis models, for 1 ≤ ≤ .
(e) (5 marks) Train your OOBaggingClassifier on the digit dataset, using a maximum of
200 decision trees. Report the number of selected models. Plot the OOB error and
the test error against the number of models (from 1 to 200). What is the number of
basis models chosen by the OOB error method
(f) (5 marks) Repeat (f) two more times using different random seeds. Comment on the
performance of the OOB error method for selecting the number of models.
5