A33648 Specified calculator Page 1 of 8 TURN OVER Calculators may be used in this examination but must not be used to store text. Calculators with the ability to store text should have their memories deleted prior to the start of the examination. Special Requirements: None School of Engineering Department of Electronic, Electrical and Systems Engineering Level M 04 30058 Data Mining and Machine Learning January Examinations 2020 Time Allowed: 2 hours Answer THREE questions The allocation of marks within each question is stated in the right-hand margin. Specified calculator A33648 Page 2 of 8 TURN OVER 1. (a) Consider the following six short texts 1, 2, 3, 4, 5, 6: 1: He famously said that his priorities were education, education and education. 2: The education secretary was Mr Gove. 3: Some parents teach their children at home. 4: Home schooling can be a challenge for parents. 5: The school education system needs reform. 6: Education is a priority. After stop word removal and stemming these become the following documents: 1: famous priority educate educate educate 2: educate secretary gove 3: parent teach child home 4: home school challenge parent 5: school educate system reform 6: educate priority (i) Calculate the Inverse Document Frequency (IDF) for each term in the documents and the TF-IDF weights for each term relative to documents 1 and 5. [6] (ii) Now consider the query “the education system”, which after stop word removal and stemming, has the form: : educate system Calculate the lengths of the documents 1 and 5 and the query . [4] (iii) Calculate the similarities between the query and the documents 1 and 5. [4] (iv) How many times would the term “educate” need to appear in to make 1 more similar than 5 to [4] (v) If the terms “teach” in 3 and “school” in 4 are both replaced by the term “educate”, how many times would the term “educate” need to appear in to make 1 more similar than 5 to [2] Specified calculator A33648 Page 3 of 8 TURN OVER 2. (a) Let = {1, … , 8} be the following set of points in the plane: 1 = [ 3 3 ] , 2 = [ 1 3 ] , 3 = [ 1 2 ] , 4 = [ 2 4 ], 5 = [ 2 0 ], 6 = [ 3 4 ] , 7 = [ 1 3 ] , 8 = [ 2 2 ], and let = {1, 2} be two centroids: 1 = [ 1 3.5 ] , 2 = [ 2 3 ] , that represent . (i) Calculate the new values of 1 and 2 after one iteration of -means clustering ( = 2). Use the Euclidean (2) metric for distance calculations. You must show the steps in your calculations. [5] (ii) Carefully draw a scatter plot of the points , the original centroids and the new centroids. Based on this plot, would the centroids change or remain unchanged after a further iteration of -means clustering Explain your answer. [4] (b) Consider 2-dimensional data being modelled using a Gaussian Mixture Model (GMM) with two mixture components and diagonal covariance matrices. The parameters of the GMM components are: Gaussian 1: mean 1 = [ 4 6 ], variance 1 2 = [ 1 1 ], and weight 1 = 0.6 Gaussian 2: mean 2 = [ 8 3 ], variance 2 2 = [ 1 4 ], and weight 2 = 0.4 Calculate the probability of the sequence of data = ([ 4 5 ] , [ 8 4 ]) being generated by the GMM. Show your working. [4] (c) Latent Semantic Analysis (LSA) is applied to a set of documents with a vocabulary of different words. The Word-Document matrix is decomposed into = using Singular Value Decomposition. (i) In this context, explain what is meant by the Word-Document matrix. [2] (ii) What are the dimensions and properties of the matrices , and [3] (iii) Explain how the latent semantic classes that emerge from LSA can be interpreted. [2] Specified calculator A33648 Page 4 of 8 TURN OVER 3. (a) Consider two sequences = {1, 2, 3} and = {1, 2, 3, 4, 5}, where: 1 = [ 1 2 1 ] , 2 = [ 2 3.5 2.2 ] , 3 = [ 6 4 4 ] and 1 = [ 0.5 2 1 ] , 2 = [ 1 2.5 1 ] , 3 = [ 5 3.5 4 ] 4 = [ 4.5 3 4.2 ] , 5 = [ 4.5 2 2 ] . Use Dynamic Programming with the city-block (1) metric to calculate the accumulated distance and the optimal alignment path between and . You should assume that insertion and deletion penalties are zero. Your answer must include all of the steps in the calculation, in particular: (i) The distance matrix between and . [2] (ii) The accumulated distance matrix between and . [3] (iii) The path matrix for and . [2] (iv) The optimal alignment path for and . [1] (v) The accumulated distance between and . [1] (b) Figure 3.1 shows a Multi-Layer Perceptron (MLP) with 3 layers: an input layer with 3 units and a fixed ‘bias’ unit that always has input 1, a hidden layer with 2 units, and an output layer with 2 units. Figure 3.1 The weight matrices W1 (between the Input Layer and the Hidden Layer) and W2 (between the Hidden Layer and the Output Layer) are given by: 1 = [ 0.8 0.3 0.4 0.5 0.4 0.8 0.7 4 ] , 2 = [ 0.6 0.9 0.2 0.4 ]. Input Layer Hidden Layer Output Layer 1 Specified calculator A33648 Page 5 of 8 TURN OVER The Input Layer has linear activation functions, the Hidden Layer has sigmoid activation functions, and the Output Layer has sigmoid activation functions. The sigmoid activation function is given by: () = (1 + ) 1. The vector = [ 5 3 2 ] is applied to the input layer. (i) Calculate the output from each unit in the Hidden Layer. [2] (ii) Calculate the output from each unit in the Output Layer. [2] (c) Consider the MLP from part (b). Now suppose that the input is a training pattern with target output = [ 1 0 ]. (i) Use Error Back-Propagation to calculate the values of derivatives of the error with respect to the weights , , = 1,2, where is the weight associated with the link between the unit in the Hidden Layer and the unit in the Output Layer. [4] (ii) Explain why the initial values of the matrices 1 and 2 may be important in determining their final values after convergence of the Error Back- Propagation training process. [3] Specified calculator A33648 Page 6 of 8 END OF PAPER 4. (a) Figure 4.1 depicts a hidden Markov model (HMM) with three emitting states. Each of the HMM state can emit the symbol , , or . Figure 4.1 The state output probability distributions for these states are given by: (|1) = 0.5, (|1) = 0.3, (|1) = 0.2 (|2) = 0.2, (|2) = 0.3, (|2) = (|3) = 0.1, (|3) = 0.2, (|3) = 0.7 The initial state probability vector is given by P0=(1,0,0). (i) Calculate the state transition probabilities that are missing in Figure 4.1 and write down the state transition probability matrix. Calculate the missing value for the state output probability distribution (|2). Justify your answers. [2] (ii) Let = (1, 1, 2, 3, 3) be a sequence of states of M. Let be the observation sequence (, , , , ). Calculate the probability (, |) of the observation sequence and the state sequence . Calculate the probability (|, ). [4] (iii) Give the definition of the quantity that is computed using the Viterbi algorithm and the recursive equation used by the algorithm. [3] (b) Consider a sequence of data = (1, … , ) to be modelled by a HMM whose state output probability density functions (pdfs) are single Gaussian with a diagonal covariance matrix. Describe how you would estimate the parameters of the state output pdfs of such a HMM using the maximum- likelihood procedure. [5] (c) What is the precise definition of a phoneme How is the concept of phoneme exploited in an HMM-based large vocabulary speech recognition system [3] (d) What is a ‘triphone HMM’ and what problem does it attempt to solve compared with a ‘monophone HMM’ [3] s3 s2 s1 0.7 0.1 0.5 Specified calculator A33648 Page 7 of 8 This page is intentionally blank A33648 LM Data Mining and Machine Learning Specified calculator Page 8 of 8 Important Reminders Coats/outwear should be placed in the designated area. Unauthorised materials (e.g. notes or Tippex) must be placed in the designated area. Check that you do not have any unauthorised materials with you (e.g. in your pockets, pencil case). Mobile phones and smart watches must be switched off and placed in the designated area or under your desk. They must not be left on your person or in your pockets. You are not permitted to use a mobile phone as a clock. If you have difficulty seeing a clock, please alert an Invigilator. You are not permitted to have writing on your hand, arm or other body part. Check that you do not have writing on your hand, arm or other body part – if you do, you must inform an Invigilator immediately Alert an Invigilator immediately if you find any unauthorised item upon you during the examination. Any students found with non-permitted items upon their person during the examination, or who fail to comply with Examination rules may be subject to Student Conduct procedures. Do not complete the attendance slip, fill in the front of the answer book or turn over the question paper until you are told to do so