程序案例-MAT00031H

Module Code MAT00031H BA, BSc and MMath Examinations 2019/20 Department: Mathematics Title of Exam: Statistical Pattern Recognition Time Allowed: 2 hours Allocation of Marks: The marking scheme shown on each question is indicative only. Instructions for Candidates: Answer ALL questions. Please write your answers in ink; pencil is acceptable for graphs and diagrams. Do not use red ink. Materials Supplied: Answer booklet Calculator Do not write on this booklet before the exam begins. Do not turn over this page until instructed to do so by an invigilator. Page 1 (of 5) MAT00031H 1 (of 3). (a) Explain what is meant by the term pattern recognition and give 3 real- world examples of pattern recognition applications. You should mention both supervised and unsupervised analyses in your answer. [6] (b) (i) Explain the terms decision rule and decision region in the context of classification. (ii) Give the decision rule for a K-class problem in terms of the discrim- inant functions, gk(x), for k = 1, …, K. (iii) Give a decision rule based solely on prior probabilities for a two- class problem. What is the probability of error under this rule [9] (c) The Likelihood Ratio Test (LRT) that minimizes the Bayes’ risk for two classes, C1 and C2, is given by ΛBayes(x) = p(x|C1) p(x|C2) C1 > < C2 (L12 L22)P (C2) (L21 L11)P (C1) . (i) Show that, with the 0/1 loss function, the LRT reduces to the Maxi- mum A Posteriori (MAP) criterion. (ii) If the priors are also equal, show that the MAP criterion reduces to the Maximum Likelihood (ML) criterion. [5] (d) Explain the reject option in Bayes’ classification and how it can be im- plemented. Give a practical example for the use of this option. [5] (e) Suppose the likelihoods, p(x|C1) and p(x|C2) for two classes, C1 and C2, are defined as follows: p(x|C1) = 1√ 2pi e x2 2 , x p(x|C2) = 1 4 , 2 < x < 2 Find the minimum error classification rule for this two-class problem, assuming P (C1) = P (C2) = 0.5. [10] (f) Given the following two-dimensional dataset: Class 1: (0, 2), (0, 4), (1, 2), (2, 3) Class 2: (2, 1), (3, 1), (3, 3), (4, 4) classify a new point (2, 2) using k-nearest neighbour classification with the Euclidean distance when k = 3 and when k = 5. [5] Page 2 (of 5) MAT00031H 2 (of 3). (a) Learning vector quantization (LVQ) and self-organising maps (SOM) are both competitive learning algorithms. (i) Explain the term competitive learning. (ii) Describe the architecture of an LVQ in terms of a neural network and explain how such a network is used in classification [4] (b) The following equations are used in the context of SOMs: d(n) = exp ( (x w) 2 2σ2(n) ) and wjk = r(n) d(n) (xk wjk). State the meaning of each symbol and explain how the equations are used in the self-organising process. [8] (c) A SOM has three output neurons with current weights: {N1 = (1, 1, 1)T ; N2 = ( 1, 0, 1)T ; N3 = (1, 1, 0)T}. Find the winning neuron for the input feature vector x = ( 1, 1, 1)T and give its updated weights using a learning rate of 0.5. [8] (d) Given the 2-dimensional training data set {(4, 3), (5, 4), (6, 3), (6, 4), (8, 5), (8, 6), (9, 4), (9, 5), (9, 6), (11, 5)}, Using the Parzen window function, φ(u) = { 1 if |uj| ≤ 1/2 for j = 1, 2 0 otherwise (i) Estimate the density p(x, y) at the points (x = 7, y = 5) and (x = 9, y = 5) with a bandwidth of 2. (ii) Estimate the density p(x, y) at the points (x = 7, y = 5) and (x = 9, y = 5) with a bandwidth of 4. (iii) Giving your reasons, state which bandwidth you think is best. [10] Page 3 (of 5) Turn over MAT00031H 3 (of 3). (a) Artificial neural networks are widely used machine learning algorithms. (i) Give the motivation behind the development of artificial neural net- works and explain briefly how this is achieved. [3] (ii) Draw a diagram to show a simple perceptron, explaining any nota- tion or symbols that you use. [5] (b) The table below shows the summary statistics for petal length and petal width measurements (in centimetres) of two different species of iris: 50 of the species versicolor and 50 of the species virginia. versicolor virginia variable mean variance mean variance grand mean petal length 4.26 0.221 5.552 0.305 4.906 petal width 1.326 0.039 2.026 0.075 1.676 For G groups, with ng examples in group g and N = n1 + . . . + nG examples in total, the between-groups variance is given by VB = 1 N G G∑ g=1 ng(xˉg xˉ)2 and the within-groups variance by VW = 1 N G G∑ g=1 (ng 1)s2g where xˉg and s2g are the sample mean and variance respectively for group g and xˉ is the grand mean. You may assume the following identity without proof: Var ( m∑ j=1 αjXj ) = m∑ j=1 α2j Var (Xj) + 2 m∑ j=1 j 1∑ k=1 αjαk Cov (Xj, Xk) for random variables X1, . . . , Xm and constants α1, . . . , αm. (i) Calculate the separation for each variable. [10] (ii) Let X1 denote petal length and X2 denote petal width. Given the within-groups and between-groups covariances of petal length and petal width are CovW (X1, X2) = 0.061 and CovB(X1, X2) = 0.231 respectively, calculate the separation for the linear discriminant: D = 0.31 (X1 Xˉ1) + 0.96 (X2 Xˉ2) [8] Page 4 (of 5) continued on next page continued from previous page MAT00031H 3 (of 3) cont. (iii) Using the discriminant function in part (ii), classify the new feature vector with measurements of 5.9cm and 2.3cm for petal length and petal width respectively. [4] Page 5 (of 5) End of examination. SOLUTIONS: MAT00031H 1. (a) Pattern recognition involves the automated recognition of patterns or regularities in the features of the objects being analysed that can be used to identify groups or clustering of the objects (unsupervised analysis) or to classify new objects (supervised analysis). Examples of applications include medical diagnosis, speech recognition, fingerprint identification, number-plate recognition and financial forecasting amongst many oth- ers. 6 Marks (b) (i) In classification, a decision rule is a rule or function that determines which one of the possible classes a feature vector, x, is to be assigned to. A decision region Rj is the region of feature space over which the classifier assigns class Cj . (ii) For discriminant functions, gk(x), for k = 1, ..., K, the decision rule is: choose class Cj if gj(x) > gk(x) k 6= j. (iii) For a two-class problem, a decision rule based solely on prior prob- abilities is: Assign { class C1 if P (C1) > P (C2) class C2 otherwise The probability or error under this rule is P (error) = min(P (C1), P (C2)). 9 Marks (c) (i) In the case of the 0/1 loss function, we have Ljk = { 0 if j = k 1 if j 6= k, and the LRT becomes ΛMAP (x) = p(x|C1) p(x|C2) C1 > < C2 P (C2) P (C1) . (ii) If the priors are also equal, then we have ΛMAP (x) = p(x|C1) p(x|C2) C1 > < C2 1 or ΛML(x) = p(x|C1) C1 > < C2 p(x|C2) 7 SOLUTIONS: MAT00031H 5 Marks (d) The reject option in Bayes’ classification allows for no classification de- cision to be made in regions of greatest uncertainty, where the greatest posterior probability is significantly less than 1. This can be achieved by specifying a threshold, θ, and rejecting inputs, x, for which the largest posterior probability P (Ck|x) < θ. In medical diagnosis for example, it may be better to automate classification of easy cases, but leave the more difficult cases to a human expert. 5 Marks (e) The minimum error classification rule is given by the Bayes’ rule. If 2 < x < 2 then the Bayes’ rule is: Choose C1 if p(x|C1) > p(x|C2) i.e. 1√ 2pi e x2 2 > 1 4 or g(x) > 0 where g(x) = ln ( 4√ 2pi ) x 2 2 giving the rule: Choose C1 if g(x) > 0; otherwise choose C2. If x < 2 or x > 2, we should always choose C1. Therefore we obtain: x < 2 choose C1 2 < x < 0.9668 choose C2 0.9668 < x < 0.9668 choose C1 0.9668 < x < 2 choose C2 x > 2 choose C1 10 Marks (f) The Euclidean distances from the point (2, 2) to each point in the dataset are: Class1 : 2, 2 √ 2, 1, 1 Class2 : 1, √ 2, √ 2, 2 √ 2. When k = 3, the nearest neighbours are two from Class 1 and one from Class 2 so that the point should be classified as Class 1. However, when 8 SOLUTIONS: MAT00031H k = 5, the nearest neighbours are two from Class 1 and three from Class 2 so that the point should be classified as Class 2. 5 Marks Remarks. Everything in this question should be straightforward and similar to examples that have been seen. Total: 40 Marks 2. (a) (i) In competitive learning algorithms nodes (neurons) compete for the right to respond to the input data and only one node is allowed to respond to each input vector. The specialisation of each node in the network is increased in each iteration. (i) In terms of neural networks, an LVQ is a feed-forward network with one hidden-layer of nodes (neurons), fully connected with the input layer, and one output layer. In classification, the output class of the winning node is assigned to the input vector. 4 Marks (b) The equation for wjk is used to update the weights of the SOM as wnewjk = w old jk + wjk where n is the number of iterations and d(n) = exp ( (x w) 2 2σ2(n) ) determines how nodes are updated. Only the winning neuron and nodes in the neighborhood of the winning node are updated, where the neigh- borhood is defined by its radius σ(n) so that d(n) allows nodes within the neighborhood to be updated according to how close they are to the winning node with those closest being updated by a greater amount as x denotes the winning node and w the node in the neighborhood. σ de- pends on n as it shrinks as the number of iterations increases. r(n) is the learning rate which is also related to the number of iterations using an exponential decay function. xk and wjk denote the kth weight of the winning node and the jth node being uptated respectively. 8 Marks (c) The distances from x to each output node are: d1 = √ ( 1 1)2 + (1 ( 1))2 + (1 ( 1))2 = √ 12 d2 = √ ( 1 ( 1))2 + (1 0)2 + (1 1)2 = 1 d3 = √ ( 1 1)2 + (1 1)2 + (1 0)2 = √ 5 9 SOLUTIONS: MAT00031H so the winning neuron is N2. Its weights are updated according to wnew2k = w old 2k + 0.5 (xk wold2k ) where wnew21 = ( 1) + 0.5 ( 1 ( 1)) = 1 wnew22 = 0 + 0.5 (1 0) = 0.5 wnew23 = 1 + 0.5 (1 1) = 1 so that we now have N2 = ( 1, 0.5, 1). 8 Marks (d) (i) When h = 2, we have p(x = 7, y = 5) = 1 10.22 10∑ i=1 φ ( 7 xi 2 ) φ ( 5 yi 2 ) = 1 40 (0 + 0 + 0 + 1 + 1 + 1 + 0 + 0 + 0 + 0) = 0.075 as only data points within 1 unit contribute to the sum. Similarly, p(x = 9, y = 5) = 1 10.22 10∑ i=1 φ ( 9 xi 2 ) φ ( 5 yi 2 ) = 1 40 (0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 1 + 0) = 0.125 (ii) When h = 4, we have p(x = 7, y = 5) = 1 10.22 10∑ i=1 φ ( 7 xi 2 ) φ ( 5 yi 2 ) = 1 40 (0 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 0) = 0.2 as data points within 2 units contribute to the sum. Similarly, p(x = 9, y = 5) = 1 10.22 10∑ i=1 φ ( 9 xi 2 ) φ ( 5 yi 2 ) = 1 40 (0 + 0 + 0 + 0 + 1 + 1 + 1 + 1 + 1 + 1) = 0.15 10 SOLUTIONS: MAT00031H (iii) Plotting the plots shows the data roughly forms two clusters, with the point (x = 9, y = 5) in one cluster and the point (x = 7, y = 5) between the two clusters, so the density should be greater for the point (x = 9, y = 5) than for the point (x = 7, y = 5), which is true when h = 2, however when h = 4 the kernal function smooths the density too much so that both clusters contribute to the density at (x = 7, y = 5), making it greater than that at (x = 9, y = 5). Therefore a badwidth of h = 2 is more appropriate than a bandwidth of h = 4. 10 Marks Remarks. Theory has been covered in lectures. Total: 30 Marks 3. (a) (i) Artificial neural networks aim to mimic the processes in the brain. The inputs act like synapses and are multiplied by weights to provide the strength of each signal from which some function determines the activation of the neuron and hence the output. Just as real neurons can send signals to other synapses, artificial networks can have mul- tiple layers to process more complex problems. 3 Marks (ii) The figure below shows a simple perceptron with inputs, x1, . . . xn. A bias term, b is added to the weighted sum of the inputs (denoted by Σ), where each input, xi is multiplied by a weight, wi, to give the function, f . The step function, defined by step(f) = { 1 if f > 0 0 otherwise and denoted determines the output, a. 11 SOLUTIONS: MAT00031H 5 Marks (b) (i) Using the formulae given, the between- and within-groups variances for petal length are VB(X1) = 50 (4.26 4.906)2 + 50 (5.552 4.906)2 98 = 0.426 and VW (X1) = 49 (0.221 + 0.305)/98 = 0.263. Similarly, for petal width we have VB(X2) = 50 (1.326 1.676)2 + 50 (2.026 1.676)2 98 = 0.125 and VW (X2) = 49 (0.039 + 0.075)/98 = 0.057. Thus, the separation for petal length is VB(X1) VW (X1) = 1.62 and for petal width, the separation is VB(X2) VW (X2) = 2.19 10 Marks (ii) Using the identity given we have VarB(D) = (0.31) 2 0.426+(0.96)2 0.125+2 0.31 0.96 0.231 = 0.294 and VarW (D) = (0.31) 2 0.263+(0.96)2 0.0.057+2 0.31 0.96 0.061 = 0.114 12 SOLUTIONS: MAT00031H so that the separation is given by VB(D) VW (D) = 0.294 0.114 = 2.58 8 Marks (iii) The decision boundary is given by D = 0 and so as the class means for both variables are greatest for the species Virginia, we would classify examples with D > 0 as Virginia and those with D < 0 as Versicolor. When petal length = 5.9cm and petal width = 2.3cm, we have D = 0.31 (5.9 4.906) + 0.96 (2.3 1.676) = 0.91 > 0 so that this example would be classified as Virginia. 4 Marks Remarks. The theory has been covered in lectures and similar problems given on problem sheets and discussed in seminars. Total: 30 Marks 13