程序案例-ST3189 - Father Essays

ST3189 Machine Learning Suitable for all candidates Instructions to candidates This paper contains four questions. Answer ALL FOUR. All questions will be given equal weight (25%). The marks in brackets reflect marks for each question. Time allowed – Reading Time: None Writing Time: 2 hours You are supplied with: Graph paper You may also use: No additional materials Calculators: Calculators are allowed in this examination 1. (a) Suppose that yi ～ N(μ, 1) for i = 1, . . . , n and that the yi’s are independent. i. Show that the sample mean estimator μ 1 = 1 n ∑n i yi is obtained from minimising the least squares criterion [7 marks] μ 1 = argmin μ n∑ i=1 (yi μ)2, and that μ 1 an unbiased estimator of μ. Also find the variance of μ 1. Answer: Show that the derivative of ∑n i=1(yi μ)2 wrt μ is equal to 2 ∑ i yi 2nμ. Setting it equal to 0 and solving then yields μ 1 = 1n ∑n i yi. We then get E( 1 n n∑ i yi) = 1 n ∑ i=1 nE(yi) = μ, which implies that the estimator is unbiased. For the variance not that var( 1 n n∑ i yi) = 1 n2 n∑ i=1 var(yi) = 1 n ii. Consider adding a penalty term to the least squares criterion, and therefore using the estimator that minimises μ 2 = argmin μ n∑ i=1 (yi μ)2 + λμ2 for the mean, where λ is a non-negative tuning parameter. Derive μ 2, find it bias and show that its variance is lower than that of μ 1 [7 marks] Answer: The derivative w.r.t. μ is 2 ∑ (yi μ) + 2λμ. Setting it to 0 gives μ 2 = ∑n i=1 yi n+ λ . Then E(μ 2) = n n+ λ μ, Bias(μ 2) = E(μ 2) μ = n n+ λ μ μ = λ n+ λ μ var (μ 2) = var (∑n i yi n+ λ ) = 1 (n+ λ)2 n∑ i=1 var(yi) = n (n+ λ)2 . Note that var (μ 2) < var(μ 1) since n (n+λ)2 < 1 n as λ > 0. (b) Consider the multiple linear regression model yi = β0 + p∑ j=1 βjxij + i, i = 1, . . . , n, j = 1, dots, p, where β = (β1, …, βp) T and = (1, …, n) T ～ N(0, σ2In). 2 i. When p is comparable to n, the multicollinearity becomes an issue. De- scribe the effects of multicollinearity on the estimated coefficients, the associated standard errors and the significance of the coefficients using the ordinary maximum likelihood method. [3 marks] Answer: The estimated coefficients and the associated standard errors can both become very large, making the coefficients non-significant in the end. ii. The ridge regression estimate of β can be obtained by minimising a par- ticular expression with respect to β. Write down this expression as well as an alternative formulation of it. [4 marks] Answer: The expression is n∑ i=1 ( yi β0 p∑ j=1 βjxij )2 + λ p∑ j=1 β2j , where λ > 0 is a tuning parameter. It can be shown however, that minimise the above expression is equivalent to minimising n∑ i=1 ( yi β0 p∑ j=1 βjxij )2 , subject to p∑ j=1 β2j ≤ s, where s > 0 is a tuning parameter. iii. Explain why ridge regression can potentially correct the problems of multicollinearity. [2 marks] Answer: It is because the magnitude of the estimated coefficients are re- stricted by setting ∑p j=1 β 2 j ≤ s. iv. Provide an advantage and a disadvantage of ridge regression over the stan- dard linear regression. [2 marks] Answer: Ridge regression estimates are biased but have lower variance. 3 2. Let x = (x1, . . . , x100), with ∑ i xi = 20, be a random sample from the Exponential(λ) distribution with probability density function given by f(xi|λ) = 1 λ exp ( xi λ ) , xi > 0, λ > 0. Note that E(xi) = λ. (a) Assign the IGamma(0.1, 0.1) prior to λ and find the corresponding posterior distribution. [5 marks] Answer: The likelihood can be written as f(x|θ) = 100∏ i=1 1 λ exp ( xi λ ) λ n exp ( ∑n i=1 xi λ ) = λ 100 exp ( 20 λ ) , and the prior is pi(θ) ∝ λ 0.1 1 exp ( 0.1 λ ) . Hence the posterior is pi(θ|x) ∝ λ 100 exp ( 20 λ ) λ 0.1 1 exp ( 0.1 λ ) = λ (100.1) 1 exp ( 20.1 λ ) which can be recognised as the IGamma(100.1, 20.1) distribution. (b) Find the Jeffreys’ prior for λ. Which is the corresponding posterior distribu- tion [6 marks] Answer: We can write l(x|λ) = log f(x|λ) = 100 log λ 20 λ λ l(x|λ) = 100 λ + 20 λ2 , 2 λ2 l(x|λ) = 100 λ2 220 λ3 I(λ) = E [ 2 λ2 l(x|λ) ] = E [ 100 λ2 220 λ3 ] = 100 λ2 + 2 ∑ iE(xi) λ3 = 100 λ2 + 200 λ2 = 100 λ2 Hence Jeffreys’ prior is pi(λ) ∝ I(λ)1/2 ∝ (λ 2)1/2 = λ 1. The posterior becomes pi(θ|x) ∝ λ 100 exp ( 20 λ ) λ 1 = λ 100 1 exp ( 20 λ ) which can be recognised as the IGamma(100, 20) (c) Find a Bayes estimator for λ based on the priors of parts (a) and (b). [3 marks] Answer: A standard Bayes estimator is the posterior mean which is equal to (see appendix) 20.1 100.1 1 = 0.203 or 20 100 1 = 0.202 depending on the chosen prior. 4 (d) Let y represent a future observation from the same model. Find the predictive distribution of y based either on the prior of part (a) or (b). [6 marks] Answer: f(y|x) = ∫ Λ f(y|λ)pi(λ|x)dλ = ∫ ∞ 0 1 λ exp ( y λ ) (20)100 Γ(100) λ (100) 1 exp ( 20 λ ) dλ = (20)100 Γ(100) ∫ ∞ 0 λ (101) 1 exp ( 20 + y λ ) dλ = (20)100 Γ(20) Γ(101) (20 + y)101 for y > 0. (e) Describe how you can calculate the mean the of the predictive distribution in software such as R. [5 marks]. Answer: Note that we can write the mean of the predictive distribution as E(y|x) = ∫ ∞ 0 yf(y|λ)pi(λ|x)dλ. Hence a Monte Carlo scheme would draw samples y{k}, k = 1, . . . , N from f(y|λ)pi(λ|x) for some large N and then just take E (y|x) = ∑N k=1 y {k} N , To that in R once can i. Draw -say- 10,000 samples from the Gamma(100,20) using nu=rgamma(100,20) . ii. Invert those samples to make them samples from the IGamma(100,20) using lambda=1/nu. iii. Using each of the samples in lambda, sample y by the model y ～Exponential(lambda) using y=rexp(lambda). iv. Calculate the sample mean of the values in y using mean(y). 5 3. (a) i. Suppose a non-linear model that can be written as Y = f(X) + , where has zero mean and variance σ2, and is independent of X. Show that the expected test error, conditional on X can be decomposed into the following three parts: E [( Y f (X) )2] = σ2 + Bias [f(x)]2 + Var [f(x)] , where f(·) is estimated from the training data. [7 marks] Answer: Since Y = f(x) + , we can write E [( Y f (X) )2 |X = x ] = E [( f(x) f (X) + )2] = E [( f (X) E ( f (X) ) + E ( f (X) ) f(x) )2] = E [( f (X) E ( f (X) ))2] + E [( E ( f (X) ) f(x) )2] +E [( f (X) E ( f (X) ))( E ( f (X) ) f(x) )] + E ( 2 ) = Var [f(x)] + Bias [f(x)]2 + 0 + σ2 The third equality comes from the fact that is independent of f (X). The fourth equality uses the definitions of variance and bias, var() = E(2) = σ2 and the fact that the cross-product is equal to 0. ii. To estimate the test error rate, one can use the 10-fold Cross Validation (CV) approach or the information criterion approach, e.g. AIC, BIC. What are the main advantage and disadvantage of using the 5-fold CV approach in comparison with AIC or BIC [3 marks] Answer: For the 10-fold CV, it is computational extensive because one need to fit the model 10 times, but only 1 time is needed for AIC or BIC. CV approaches provide direct estimates of the test error and make fewer assumption about the true model. For AIC or BIC, it is also hard to specify the model degrees of freedom. iii. State which one of AIC and BIC tends to select smaller size model and explain the reason. [3 marks] Answer: BIC places a heavier penalty on models with many variables and hence results in the selection of smaller models than AIC. (b) i. The tree in Figure 1 provides a regression tree based on a dataset of pa- tient visits for upper respiratory infection. The aim is to identify factors associated with a physicians rate of prescribing, which is a continuous vari- able. The variables appearing in the regression tree are private: percent of privately insured patients a physician has, black: the percent of black patients a physician has, and fam whether or not the physician specialises in family medicine. Provide an interpretation of this tree. [5 marks] 6 Figure 1: Regression tree for Question 3 (b) i. Answer: Among those privately insured, black patient populations had a 48.72% average physician rate of prescribing, while physician’s prescrip- tion rate for non-black populations was 54.60%. Among those without private insurance, the presence of a family medicine doctor raises the aver- age provider prescribing rate by approximately 10%, to reach 54.48% (vs 44.67%), indicating that family medicine doctors systematically prescribe most antibiotics than non-family medicine doctors. ii. Consider the regression tree of Figure 2 where the response variable is the log salary of a baseball player, based on the number of years that he has played in the major leagues (Years) and the number of hits that he made in the previous year (Hits). Create a diagram that represent the partition of the predictors spaces according to this tree. Answer: The requested diagram showing the partition of the predictors spaces according to this tree is provided by Figure 3: 7 |Years < 4.5 Hits < 117.5 5.11 6.00 6.74 Figure 2: Regression tree for Question 3 (b) ii. 4.0 4.2 4.4 4.6 4.8 5.0 11 5 11 6 11 7 11 8 11 9 12 0 Years H its 5.11 6 6.74 Figure 3: Partition of the predictor’s according to the tree in Figure 2. 8 4 (a) i. Consider the following data: 10 20 40 80 85 121 160 168 195 Use the k-means algorithm with k = 3 to cluster the data set. Use the Euclidean distance to measure the distance between the data points. Sup- pose that the points 160, 168, and 195 were selected as the initial cluster means. Work from these initial values to determine the final clustering for the data. Provide results from each iteration. [9 marks] Answer: The k-means clustering can be performed via the following steps: If we work through this you will see that initially observations (10, 20, 40, 80, 85, 121, 160) are closest to cluster centre 1, the observation 168 is closest to cluster centre 2, whereas the observation 195 is closer to cluster 3. The new cluster centres will then be the averages of the observations belonging to each cluster that are 73.71, 168 and 195 respectively. Now observations (10, 20, 40, 80, 85) are closest to cluster centre 1, the observations (121, 160, 168) are closest to cluster centre 2, whereas cluster centre 3 has the observation 195. The new cluster centre for cluster 1 will then be the average of the observations (10, 20, 40, 80, 85), which is 47, the new cluster centre for cluster 2 will be the average of observations (121, 160, 168), which is 149.67. Finally the centre of cluster 3 will remain unchanged to 195. As with the previous step, observations (10, 20, 40, 80, 85) are closest to cluster centre 1, the observations (121, 160, 168) are closest to cluster centre 2, whereas cluster centre 3 has the observation 195. Since there has been no change in the clusters k-means stops at this point with final cluster assignments of (10, 20, 40, 80, 85), (121, 160, 168), 195 and centres of 47, 149.67, 195. ii. What are the main disadvantages of k-means clustering Why one may want to consider hierarchical clustering as an alternative [4 marks] Answer: Regarding the k-means algorithm one drawback is that we can only find a local optimum rather than a global optimum, so the results ob- tained will depend on initial cluster assignment of each observation. Sec- ond, k-means clustering requires us to pre-specify the number of clusters k. In hierarchical clustering there is no need to set k beforehand. Also, hier- archical clustering may also be appealing over K-means clustering in that it also offers a tree-based depiction of the data, called dendrogram. (b) i. Data are available for students taking BSc degree in Data Science and in particular the variables X1: average mark on project coursework, X2: average hours studied per course, and Y : get a degree with distinction. The estimated coefficients of a logistic regression model were β0 = 5, β1 = 0.02, β2 = 0.1. Estimate the probability that a student who takes on average 50% on project coursework and studies 30 hours on average for each course 9 gets a degree with distinction How many hours would the student in part (a) need to study on average to have a 50 % chance of getting a degree with distinction [6 marks] Answer: We have p(X) = exp(β0 + β1X1 + β2X2) 1 + exp(β0 + β1X1 + β2X2) , where X1 = average coursework mark, X2 = average hours studied per course, β0 = 5, β1 = 0.02 and β2 = 0.1. For X1 = 50 and X2 = 30 we get p(X) = exp( 5 + 0.02 50 + 0.1 30) 1 + exp( 5 + 0.02 50 + 0.1 30) = 26.89%. For X1 = 50 and X2 = x we get p(X) = exp( 5 + 0.02 50 + 0.1 x) 1 + exp( 5 + 0.02 50 + 0.1 x) or else 0.50 = exp( 4 + 0.1 x) 1 + exp( 4 + 0.1 x) or else x = 40hours. ii. Suppose that we wish to predict whether a high quality chip produced in a factory will pass the quality control (‘Pass’ or ‘Fail’) based on x, the measurement of its diameter. Diameter measurements are available for a large number of chips. After examining them it turns out that the mean value of x for chips that passed the quality control was 5mm, while the mean for those that didnt was 7mm. Moreover, the variance of x for these two sets of companies was σ2 = 1. Finally, 70% of the produced chips passed the quality control. Assuming that x follows the normal distribution, predict the probability that a chip with x = 5.8 will pass the quality control. [6 marks] Answer: For the probability of passing the quality control we get ppass(x) = pipass exp( 12σ2 (x μpass)2) pipass exp( 12σ2 (x μpass)2) + pifail exp( 12σ2 (x μfail)2) = 0.70 exp( 1 2 1(x 5)2) 0.70 exp( 1 2 1(x 5)2) + 0.30 exp( 12 1(x 7)2) Setting x = 5.8 we get ppass(5.8) = 0.70 exp( 1 2 1(5.8 5)2) 0.70 exp( 1 2 1(5.8 5)2) + 0.30 exp( 12 1(5.8 7)2) = 77.68% 10 Appendix: Table of Common Distributions Binomial(n, θ): number of successes in n independent Bernoulli trials with probability of suc- cess θ. f(x|θ) = P (x|θ) = n!x!(n x)!θx(1 θ)n x for x = 0, 1, . . . , n. E(X) = nθ, Var(X) = nθ(1 θ). NegBin(r, θ): number of successes before rth failures in repeated independent Bernoulli trials. f(x|θ) = P (x|θ) = (x+r 1x )θx(1 θ)r for x = 0, 1, . . .. E(X) = r(1 θ)θ , Var(X) = r(1 θ)θ2 . Poisson(λ): often used for the number of events which occur in an interval of time. f(x|λ) = P (x|λ) = λxe λx! for x = 0, 1, . . .. E(X) = λ, Var(X) = λ. Normal N(μ, σ2): characterized by first two moments. f(x) = (2piσ2) 1/2 exp ( (x μ)2 2σ2 ) for ∞ < x <∞. E(X) = μ, Var(X) = σ2. Beta(α, β): characterized by parameters α > 0 and β > 0. f(x) = 1B(α,β)xα 1(1 x)β 1 for 0 ≤ x ≤ 1, B(α, β) = ∫ 1 0 y α 1(1 y)β 1dy = Γ(α)Γ(β)Γ(α+β) E(X) = αα+β , Var(X) = αβ(α+β+1)(α+β)2 . Gamma(α, β): characterized by parameters α > 0 and β > 0. f(x) = βαΓ(α)xα 1 exp( βx) for 0 ≤ x <∞, Γ(t) = ∫∞ 0 y t 1e ydy. E(X) = αβ , Var(X) = αβ2 . IGamma(α, β): characterized by parameters α > 0 and β. If X ～ Gamma(α, β), then 1/X ～ IGamma(α, β). f(x) = βαΓ(α)x α 1 exp ( βx ) for 0 ≤ x <∞. E(X) = βα 1 , Var(X) = β 2 (α 1)2(α 2) . for positive integer n. END OF PAPER 11