程序案例-COMP6245W1

UNIVERSITY OF SOUTHAMPTON COMP6245W1 SEMESTER 1 FINAL ASSESSMENT 2020 – 2021 MACHINE LEARNING DURATION 240 MINS (4 Hours) This paper contains 20 questions Provide answers to all questions in a single page, neatly numbered in order. You may attach two further pages with any workings where useful, clearly numbered and in order of the questions. These additional pages will be looked at if the question requires any derivation and the answer you provided is incorrect. Note the questions will have at least one correct answer. Where the ques- tion has more than one correct answer, you must select all the correct ones. For these questions, partial credit will usually not be given. You should upload a maximum of three pages as a single pdf file. Each question is worth ive marks. Copyright 2021 University of Southampton Page 1 of 21 2 COMP6245W1 Question 1. In a far away island, a highly infectious disease is spreading across the population. A third of those infected appear to suffer long term illness, whereas two thirds recover. The precise reasons as to who suffers ad- verse conditions is unknown. Scientists claim to have discovered two proteins, concentrations of which in blood could be implicated in the ad- verse conditions. Measurements of concentrations of these proteins (P1 and P2) were carried out in samples of patients who suffered long term conditions (denoted A) and those who made full recovery (B). Bivariate Gaussian models were fitted to the data (x = [P1 P2] T ), and estimated means and covariances of A and B were as follows: mA = [ 1.3 4.3 ] , mB = [ 8.5 4.7 ] , ΣA = [ 3.0 0.001 0.001 1.5 ] and ΣB = [ 3.0 0.001 0.001 1.5 ] . A multi-national company contracted by the government of the island rec- ommends the use a linear classifier: f(x) = wT x + w0 to predict ad- verse outcomes. You are asked to comment on the proposed classifier. Which of the following statements is/are true 1. The proposed linear classifier is optimal and should be deployed. 2. A distance-to-mean classifier based on Euclidean distance will be the optimal solution. 3. Inspired by the brain, we should train an artificial neural network. 4. A distance-to-mean classifier based on the Mahalanobis distance will be the optimal solution. 5. The implication of protein P1 with this condition is suspect. 6. The implication of protein P2 with this condition is suspect. [5 marks] Copyright 2021 University of Southampton Page 2 of 21 3 COMP6245W1 Question 2. Consider the scenario described in Question 1. The study was repeated with refined measurements of the two proteins, producing the following results: mA = [ 2.35 4.76 ] , mB = [ 2.42 4.82 ] , ΣA = [ 2.0 1.0 1.0 2.0 ] and ΣB = [ 2.0 0.001 0.001 2.0 ] . What might you suspect 1. A third protein might be involved in causing long term illness. 2. We could still consider the use of an artificial neural network. 3. A linear support vector machine that maximizes the margin is a better solution. 4. Measuring protein P1 is sufficient for accurate prediction. 5. The linear classifier recommended by the company contracted by the government is the optimal solution to this problem. [5 marks] Copyright 2021 University of Southampton TURN OVER Page 3 of 21 4 COMP6245W1 Question 3. Consider again the scenario described in Question 1. After realising their error, the government of the island decides to consult expert clinicians and data scientists of Dolphin University who base their study on x-ray imaging data of affected organs. Abnormal regions of the images were annotated by the clinical experts and prediction systems were designed by the data scientists. Interpretation of the images being time consuming, and the clinical experts paid much higher salaries than the data scientists, only part of the data (setA) was annotated. We denote the remaining set B. Six features were extracted from each image by the data scientist, for- mulating a regression problem in x ∈ R6, and predicting how long a seriously infected patient might survive in intensive care conditions. A radial basis function model f(x) = ∑M j=1 λj φ (α||x mj||) was pro- posed by the data scientist who suggested that the data in set B be clustered using K means clustering to set mj, j = 1, 2, …,M and data in set A be used to solve a regression problem to estimate the λj, j = 1, 2, …,M . The approach used by the data scientist is best described as: 1. Supervised learning 2. Unsupervised learning 3. Semi supervised learning 4. Transfer learning 5. Deep learning 6. Online learning 7. Self supervised learning [5 marks] Copyright 2021 University of Southampton Page 4 of 21 5 COMP6245W1 Question 4. The textbook Pattern Recognition and Machine Learning gives expres- sions for the Bayesian estimation of the mean of a univariate Gaussian density in Equations 2.141 and 2.142. Please refer to these equations before answering the question below. Which of the following statements is/are true 1. When N → ∞, the Bayesian and maximum likelihood estimates are the same. 2. A high confidence prior has large σ0. 3. With a high confidence prior, the Bayesian and maximum likelihood estimates, using the same amount of data, will be identical. 4. Uncertainty in the Bayesian estimates reduces with sample size. [5 marks] Copyright 2021 University of Southampton TURN OVER Page 5 of 21 6 COMP6245W1 Question 5. Consider the derivation of equation (2.126) for the maximum likelihood estimation of the mean in the textbook Pattern Recognition and Machine Learning: μ (N) ML = μ (N 1) ML + 1 N (xN μ(N 1)ML ). Which of the following statements is/are true 1. This formula is useful for accurate estimation of the mean of a multi- variate Gaussian distribution. 2. This formula is useful for solving an online learning problem. 3. We might use this formula in a situation where the size of a given dataset is very large. 4. The quantity computed by this formula can sometimes not converge to the true mean unless the learning rate is set very low. [5 marks] Copyright 2021 University of Southampton Page 6 of 21 7 COMP6245W1 Question 6. Consider the univariate function y = exp { 1 2 (x 0.2)2 } What is ∫ +∞ ∞ y dx 1. √ pi 2. √ 2pi 3. 0.2 √ 2pi 4. 9.8 5. 6.0× 1023 [5 marks] Copyright 2021 University of Southampton TURN OVER Page 7 of 21 8 COMP6245W1 Question 7. You are tasked with predicting the market price of an asset using its past values and several variables relating to the underlying economy within which the business operates. The dataset given to you spans one year of daily trading (252 items) of 180 variables. You are required to split the data into a training set and an evaluation set of equal sizes and use a linear model as predictor. You attempt to solve the problem of estimating regression coefficients by w = ( XT X ) 1 XT t Which of the following is/are true 1. The attempt above will not work without suitable regularization. 2. I would advocate the use of a Lasso regularizer to solve the problem. 3. If the problem is solved using a regularizer min w ||t Xw|| + γ||w||2, setting γ to very small values will produce sparse solutions. 4. We cannot use more data acquired by taking longer windows (say several years of trading instead of just one) because the underlying statistical relationships might have changed over time. [5 marks] Copyright 2021 University of Southampton Page 8 of 21 9 COMP6245W1 Question 8. With usual notation, Fisher Linear Discriminant Analysis (FLDA) maxi- mizes the objective function J(w) = wT SBw wT SW w to arrive at the discriminant direction wF = βS 1W (μ2 μ1). Which of the following statements is/are true 1. If the features are uncorrelated, wF is the same as the line joining the means. 2. It is not necessary to compute the term β in the solution accurately. 3. ComputingwF is a necessary step in deriving the Receiver Operating characteristic (ROC) curve for any pattern classification problem. 4. If the class conditional densities are multi-modal, using FLDA is not recommended. 5. The factor β could be tuned to improve regularization. [5 marks] Copyright 2021 University of Southampton TURN OVER Page 9 of 21 10 COMP6245W1 Question 9. In solving a two class pattern classification problem, it is thought Fisher LDA could be improved by accounting for prior probabilities of classes, p(C1) and p(C2). The corresponding objective function to maximize is: J(w) = (μ1 μ2)2 p(C1)s21 + p(C2)s 2 2 , where μ1 and μ2 are projected mean and s1 and s2 are scatters of pro- jected data. Derive the direction that maximize J(w). Your answer is: 1. w = β (Σ1 + Σ2) 1(μ2 μ1) 2. w = β p(C1)p(C2)(Σ1 + Σ2) 1(μ2 μ1) 3. w = β (p(C1)Σ1 + p(C2)Σ2) 1(μ2 μ1) 4. w = √ 2pi β (exp(p(C1))Σ1 + exp(p(C2))Σ2) 1(μ2 μ1) μ1,μ2,Σ1,Σ2 are the means and covariance matrices of the two classes. [5 marks] Copyright 2021 University of Southampton Page 10 of 21 11 COMP6245W1 Question 10. Dolphin University scientists have developed a novel method to predict coronavirus infection based on traces of mobile phone usage. A continu- ous valued score is computed from the duration of contact with persons known to have tested positive. A threshold is set and if the score exceeds this threshold, the person concerned is requested to self-isolate. In the above setting, which of the following is/are true 1. There is an economic cost associated with False Positive predictions. 2. High False Negatives lead to infection risk in the community. 3. True Positives of the test are caused by the test themselves. 4. Inspired by how the brain works, I will input the score to an artificial neural network for accurately predicting coronavirus infection. [5 marks] Copyright 2021 University of Southampton TURN OVER Page 11 of 21 12 COMP6245W1 Question 11. Which of the following statements is/are true about a Receiver Operating Characteristic (ROC) curve 1. The area under the curve can sometimes be negative. 2. The probability of correct ranking is given by the area under the ROC curve. 3. Every operating point on the ROC curve yields the same misclassifi- cation error. 4. It is not advisable to use area under the ROC curve as a performance measure if we can estimate the different costs of misclassification. 5. Increasing the learning rate when training a neural network always increases the area under the corresponding ROC curve. [5 marks] Copyright 2021 University of Southampton Page 12 of 21 13 COMP6245W1 Question 12. In a two-class pattern classification problem involving a positive-valued univariate feature x, the class conditional densities are both uniformly distributed as follows: p(C1|x) = { α a ≤ x ≤ b 0 otherwise and p(C2|x) = { β c ≤ x ≤ d 0 otherwise , where a ≤ c ≤ b. Compute the area under the Receiver Operating Characteristics (ROC) curve for this problem, assuming the prior probabilities of the classes, p(C1) and p(C2) are equal. Your answer is: 1. 1 (b c)22(b a)(d c)) 2. 1 (b a)(d c) 3. 1 (d b)[1 12 (d a)(b c) ] 4. 0.5[1 + (c a)(b c)2(d b)] [5 marks] Copyright 2021 University of Southampton TURN OVER Page 13 of 21 14 COMP6245W1 Question 13. A dataset consists of shopping habits of N = 300 individuals. The num- ber of times any individual purchased any of p = 600 items in the two weeks prior to Christmas has been recorded in the dataset. The data is contained in a matrix X of dimensions N × p. The purchasing power of the individuals was also acquired from their annual tax returns and is contained in an N dimensional vector y. The following analysis was performed on this data: min W,H ||X W H||2 subject to wij ≥ 0, hij ≥ 0, where matrices W and H are of dimension N × r and r× p respectively, and wij and hij denote their elements. We also chose r such that r < N . A linear prediction of the purchasing power from the items purchased was also attempted. Which of the following statements is/are true 1. The rank of the reconstruction W H is p. 2. The rank of the reconstruction W H is at most r. 3. W defines a dimensionality reduction of features that preserves the variance in the original features. 4. W defines a dimensionality reduction of features leading to a sparse set of features. 5. Predicting the purchasing power from features given in W is prefer- able to predicting it directly from the original features given in X. [5 marks] Copyright 2021 University of Southampton Page 14 of 21 15 COMP6245W1 Question 14. The distribution of two two-dimensional variables x and y are shown as scatter plots in Fig. 1. FIGURE 1: Distribution of two two-dimensional variables Which of the following statements is/are true 1. Variable y could have been derived from variable x by a linear trans- formation of the form y = Ax + b. 2. Variable y coud have been derived from variable x by a linear trans- formation of the form y = Ax. 3. Variable x is likely to have a covariance matrix [ 2.0 0.0 0.0 0.0 ] . 4. Variable y is likely to have a covariance matrix [ 2.0 1.8 1.8 2.0 ] 5. Normalization has been applied to variable y so that it has zero mean. [5 marks] Copyright 2021 University of Southampton TURN OVER Page 15 of 21 16 COMP6245W1 Question 15. A multi-layer perceptron (MLP) is usually trained using gradient descent, with the gradient computed using the error backpropagation algorithm. Which of the following statements is/are true 1. Given the data in the form of an N × p matrix X, where N is the number of data items and p, the input dimensions, and the targets in vector t, the weights w could be solved by the formula: w = (XT X) 1XT t 2. The error function of the MLP is quadratic. 3. The speed of convergence of a gradient descent algorithm of the form w ← w α wE. could be increased by cross validation. 4. The use of a momentum term usually helps improve speed of con- vergence. [5 marks] Copyright 2021 University of Southampton Page 16 of 21 17 COMP6245W1 Question 16. Given a classification problem {xn, tn}Nn=1 the perceptron learning algo- rithm updates weights using the formula w(τ) = w(τ 1) + η tn xn. Which of the following statements is/are true 1. xn is an item of data correctly classified by the current estimate of weights w(n 1). 2. xn is an item of data misclassified by the current estimate of weights w(n 1). 3. The solution to which the algorithm converges could be written as∑N n=1 αn xn, i.e. a weighted combination over all the data. 4. If the data is linearly separable, the iterative algorithm is guaranteed to terminate. 5. The learning rate η should be set by cross validation. 6. The above algorithm minimizes the following cost function: E(w) = N∑ n=1 (tn wT xn)2 [5 marks] Copyright 2021 University of Southampton TURN OVER Page 17 of 21 18 COMP6245W1 Question 17. Two groups of people are sitting in a park. Group A consists of 4 mem- bers, whereas Group B consists of 10 members. The positions of all group members are shown in Fig. 2. The k-means algorithm is applied to the data for clustering. A Gaussian Mixture Model (GMM) is also fitted to the data. The initial centroids of k-means and of the GMM are ran- domly selected from the samples. Both algorithms are run for up to 100 iterations. Since the algorithms are randomly initialised, k-means and the GMM are evaluated over 50 independent trials. Which of the following statements is/are true 1. For each trial, the cluster centers and the cluster assignments are identical between k-means and the GMM. 2. For each trial, k-means assigns all points to the same cluster. 3. For some trials, the cluster centers obtained using the GMM may be skewed towards the mean of all samples. FIGURE 2: Distribution of two two-dimensional variables [5 marks] Copyright 2021 University of Southampton Page 18 of 21 19 COMP6245W1 Question 18. A researcher is deriving the maximisation of the log-likelihood function for Gaussian Mixture Models. The log-likelihood function is given by: ln pθ (X) = N∑ n=1 ln [ K∑ k=1 pikN (xn |μk,Σk) ] , (1) where N (·) denotes the probability density function of a Gaussian; the dataset is given by X = [ x1, . . . ,xN ] and sample, n ∈ {1, . . . , N}, is denoted by xn; the number of samples is N ; the number of Gaussian mixture components is K; pik, μk and Σk correspond, respectively, to the weight, mean, and covariance of component, k ∈ {1, . . . , K}; and θ = {(μ1,Σ1, pi1), . . . , (μK ,ΣK , piK)}. Which of the following statements is/are true 1. μk ln pθ (X) = N∑ n=1 pikN (xn |μk,Σk) K∑ k=1 pikN (xn |μk,Σk) · μk [lnN (xn |μk,Σk)] 2. μk ln pθ (X) = N∑ n=1 ln pikN (xn |μk,Σk) K∑ k=1 pikN (xn |μk,Σk) · μk [N (xn |μk,Σk)] 3. The partial derivative cannot be solved. [5 marks] Copyright 2021 University of Southampton TURN OVER Page 19 of 21 20 COMP6245W1 Question 19. As a data scientist, you are given the following data matrix: X = 0.85 1.12 1.14 0.67 0.46 1.06 1.43 0.67 0.98 1.04 0.58 0.09 1.12 0.38 1.08 0.06 1.24 0.33 , (2) where the rows correspond to features and the columns correspond to samples. The number of samples is N = 3, and the number of features is D = 6. Which of the following statements is/are true 1. The problem is overdetermined. 2. The problem is underdetermined. 3. P = 3 principal components are required to explain 100% of the vari- ance in the data. 4. P = 5 principal components are required to explain 95% of the vari- ance in the data. [5 marks] Copyright 2021 University of Southampton Page 20 of 21 21 COMP6245W1 Question 20. Which of the following statements is/are true Principal Component Anal- ysis (PCA)... 1. ... is not a supervised learning algorithm. 2. ... can be used for dimensionality reduction. 3. ... can be used for data analysis. 4. ... minimises the variance of the projected data. [5 marks] Copyright 2021 University of Southampton END OF PAPER Page 21 of 21