程序案例-CMPT 413/713

CMPT 413/713 Natural language processing Homework 2, Fall 2021 Due: October 12th, 2021 Instructions Answers are to be submitted by 11:59pm of the due date as a PDF file through Canvas. Go to Canvas, select HW2-C activity and submit your answer as answer.pdf. This assignment is to be done individually. Please make sure that your name and student id is clearly visible. Collaboration Policy You may form study groups and discuss the problems in the homework with others, but each person must write up your answers from scratch by yourself. Copying is not permitted and will be treated as a violation of the SFU honour code and academic integrity. If you do discuss the homework with others, please indicate who you have discussed the homework with in your writeup. 1 Vector Semantics (8 pts) (a) The terms distributional and distributed are often used to describe word vectors such as word2vec and GloVe. Explain the difference between the two terms. What is meant by a distributional repre- sentation vs a distributed representation. (2 pts) (b) Assume we are building a text classifier using a feed-forward neural network using Word embeddings as input. When used in our text classifier, the word embeddings can be randomly initialized or initialized to pre-trained embeddings (such as word2vec or GloVe). Explain what is the advantage of using a pre-trained embedding and when it would be useful to initialize the word embedding to the pre-trained embedding. Also give a example of when a randomly initialized embedding is sufficient. (2 pts) (c) Explain the relationship of word2vec skip-gram with negative sampling (SGSN) to the PPMI matrix. (2 pts) (d) Explain the two components of TF-IDF. Why is the tf-idf preferred over the raw count (2 pts) 2 PPMI (10 pts) Following table shows the word-word co-occurrence matrix for a text corpus. Here each cell contains the count of times the (row) word occurs in the context of the (column) word. toast tasty bone muscle Alberta 40 9 0 0 Montreal 50 35 0 0 Kamloops 0 0 30 25 Hamilton 0 0 45 15 1 (a) Compute the PPMI weights for the given co-occurrence matrix directly from the raw co-occurrence counts. (4 pts) (b) Use Add-1 smoothing (Laplace smoothing) to the raw co-occurrence counts and again compute the PPMI weights. (4 pts) (c) Did you notice any changes between (a) and (b) Explain your answer. (2 pts) 3 Learning Word Vectors (12 pts) In this problem, we will look at how the word2vec model can be implemented using a neural network. Consider the following neural network architecture. As a simplification, assume that we only consider contexts consisting of one word, which means that the network predicts one target word given one context word as input. In this problem, the vocabulary size is V and the hidden layer size is N . The layers on adjacent units are fully connected. The input to the neural network x is a one-hot vector. This means that out of V units, only one of them will be 1 and all other units are 0 (only one of the xi is set to 1). The initial layer of the network transforms the input x ∈ RV using the weight matrix W ∈ RV×N into h = W>x. The second layer of the network uses a second weight matrix W′ ∈ RN×V to obtain a vector of scores for all words in the vocabulary: u = W′>h. The score for each word can be written as: uj = w ′ j > h where w′j is the jth column of W ′. Then we can get the multinomial distribution of the words by applying a softmax: p(wordj |x) = yj = softmax(u)j 2 To train the model, we want to maximize the log-likelihood (P (t | c)) or equivalently minimize the cross-entropy error ( P (t | c)). Expanding out the cross-entropy error, we get: J(θ) = logP (t | c) = log ( exp ( w′>t h )∑ m∈V exp (w′>mh) ) = log ( exp ( w′>t h )) + log (∑ m∈V exp ( w′>mh )) Note that h = W>x where x is the input one-hot vector with the cth element set to 1. Thus h = wc, where wc is the cth row of W. J(θ) = w′>t h+ log (∑ m∈V exp ( w′>mh )) where h = wc is the cth row of W Please use the above objective function for the remainder of the questions. (a) To minimize the loss function, we will need to compute the derivative of the loss function with respect to the weights. What is the derivative of the loss function with respect to w′ij (4 pts) (b) Give the update equation for w′ij . Assume that we use Stochastic Gradient Descent to optimize the loss function. (2 pts) (c) How would you obtain word embeddings using this model (after you have trained it) (2 pts) (d) Can you extend this model to achieve the (word2vec) CBOW model Explain. Can it be extended to obtain the word2vec Skip-gram model (4 pts) 4 Neural Text Classification (8 pts) Suppose you want to design a neural text classifier using a feed-forward network. We are interested in classifying documents as ‘news’, ‘entertainment’, ‘sports’, and ‘science’. (a) How many output neurons will you have in your output layer (1 pts) (b) What will you use as your loss function (1 pts) (c) Explain how you can take word embeddings for each word of your document and create an fixed length embedding of size d for your feed-forward network. Note that your documents can have potentially different length. (2 pts) (d) Now you have an input vector of length d for each document. If your network have two hidden layers with 10 neurons each, how many parameters will your network have Write our your network parameters and their dimensions. (2 pts) (e) Assume that you test your neural network on a previously unseen test set. Your network performs poorly on this test set. List two ways that you can improve the generalization of your network (without changing the architecture of the network). (2 pts) 3 5 Error Backpropagation (12 pts) Suppose you have modeled the problem of comment sentiment classification using the following neural network. The output of the network could be either +1 (offensive comment) or 1 (neutral comment). Given the network, we will derive error derivatives using back-propagation. For your answers, please follow the notation given in the network. We will use a (j) i to refer to the input to the activation function of the ith unit of the jth layer, and z (j) i the output of the ith unit of the jth layer, and w (j) ik is the weight associated with the output of kth unit of the jth layer to the ith unit of the j + 1th layer. As an example, the red node has as input to the activation function: a (2) 2 = w (1) 21 x1 + w (1) 22 x2 + w (1) 23 x3 and its output would be: z (2) 2 = h(a (2) 2 ) Assume that the activation functions for the hidden layers is ELU. For the final output node, assume the activation function is a linear function. ELU(x) = { x for x > 0 ex 1 for x ≤ 0 (1) f(x) = cx (2) Also, assume this network is trained to minimize the hinge loss. Let Hn(w) be the hinge error for the nth training sample (xn, yn) En = Hn(w) = max(0, 1 yn × y n) = max(0, 1 yn × g(xn,w)) (3) where y n = g(xn,w) is the predicted output of the network, and yn ∈ {+1, 1} is the ground-truth. Note that g(xn,w) is the function represented by the neural network with input xn and parameters w. Let’s use the backpropagation algorithm to compute the derivative of En. As a short hand, let δ (j) k = En a (j) k be the partial derivative of the hinge loss with respect to a (j) k . 4 (a) Consider the output layer (layer 4). Calculate En a (4) 1 . Note that a (4) 1 is the input to the activation function of the output node. Using our short hand, En a (4) 1 = δ (4) 1 . (2 pts) (b) Use the result from (a) to calculate En w (3) 12 . (2 pts) (c) Consider the second to last layer of nodes (layer 3). Write an expression for En a (3) 1 . Use δ (4) 1 in this expression. (2 pts) (d) Use the result from (c) to calculate En w (2) 11 . (2 pts) (e) Consider the weights connecting from the inputs. Write an expression for En a (2) 1 . Use the set of δ (3) k in this expression. (2 pts) (f) Use result from (e) to calculate En w (1) 11 . (2 pts) 5