程序案例-PM2022

6/5/22, 10:47
PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 1 of
25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
datasci-w266 /2022-summer-assignment-marciayyl Private 2022-summer-assignment-marciayyl / assignment / a2 / Text_classification.ipynb Mark H Butler Release a2 Latest commit 3a1408e 15 days ago History 0 contributors Code Issues Pull requests Actions Projects Security Insights a2-submit Go to file 6443 lines (6443 sloc) 217 KB Assignment 2: Text Classification with Various Neural Networks Description: This assignment covers various neural network architectures and components, largely used in the context of classification. You will compare Deep Averaging Networks, Deep Weighted Averaging Networks using Attention, and BERT-based models. You should also be able to develop an intuition for: The effects of fine-tuning word vectors or starting with random word vectors How various networks behave when the training set size changes The effect of shuffling your training data The benefits of Attention calculations Working with BERT The assignment notebook closely follows the lesson notebooks. We will use the IMDB dataset and will leverage some of the models, or part of the code, for our current investigation. The initial part of the notebook is purely setup. We will then evaluate how Raw Blame 6/5/22,
10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 2 of
25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
The initial part of the notebook is purely setup. We will then evaluate how Attention can make Deep Averaging networks better. Do not try to run this entire notebook on your GCP instance as the training of models requires a GPU to work in a timely fashion. This notebook should be run on a Google Colab leveraging a GPU. By default, when you open the notebook in Colab it will try to use a GPU. Total runtime of the entire notebook (with solutions and a Colab GPU) should be about 1h. Open in Colab The overall assignment structure is as follows: 1. Setup 1.1 Libraries, Embeddings, & Helper Functions 1.2 Data Acquisition 1.3. Data Preparation 1.3.1 Training/Test Sets using Word2Vec 1.3.2 Training/Test Sets for BERT-based models 2. Classification with various Word2Vec-based Models 2.1 The Role of Shuffling of the Training Set 2.2 DAN vs Weighted Averaging Models using Attention 2.2.1 Warm-Up 2.2.2 The WAN Model 2.3 Approaches for Training of Embeddings 3. Classification with BERT 3.1. BERT Basics 3.2 CLS-Token-based Classification 3.3 Averaging of BERT Outputs 3.4. Adding a CNN on top of BERT INSTRUCTIONS:: 6/5/22,
10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 3 of
25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
INSTRUCTIONS:: Questions are always indicated as QUESTION, so you can search for this string to make sure you answered all of the questions. You are expected to fill out, run, and submit this notebook, as well as to answer the questions in the answers file as you did in a1. ### YOUR CODE HERE indicates that you are supposed to write code. If you want to, you can run all of the cells in section 1 in bulk. This is setup work and no questions are in there. At the end of section 1 we will state all of the relevant variables that were defined and created in section 1. 1. Setup 1.1. Libraries and Helper Functions This notebook requires the TensorFlow dataset and other prerequisites that you must download. Now we are ready to do the imports. In [1]: #@title Imports !pip install pydot –quiet !pip install gensim==3.8.3 –quiet !pip install tensorflow-datasets –quiet !pip install -U tensorflow-text==2.8.2 –quiet !pip install transformers –quiet !pip install pydot –quiet In [2]: #@title Imports import numpy as np import tensorflow as tf from tensorflow import keras from tensorflow.keras.layers import Embedding, Input, Dense, Lambda from tensorflow.keras.models import Model import tensorflow.keras.backend as K import tensorflow_datasets as tfds import tensorflow_text as tf_text from transformers import BertTokenizer, TFBertModel 6/5/22,
10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 4 of
25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Below is a helper function to plot histories. import sklearn as sk import os import nltk from nltk.corpus import reuters from nltk.data import find import matplotlib.pyplot as plt import re #This continues to work with gensim 3.8.3. It doesn’t yet work with 4.x. #Make sure your pip install command specifies gensim==3.8.3 import gensim import numpy as np In [3]: #@title Plotting Function # 4-window plot. Small modification from matplotlib examples. def make_plot(axs, history1, history2, y_lim_loss_lower=0.4, y_lim_loss_upper=0.6, y_lim_accuracy_lower=0.7, y_lim_accuracy_upper=0.8, model_1_name=’model 1′, model_2_name=’model 2′, ): box = dict(facecolor=’yellow’, pad=5, alpha=0.2) ax1 = axs[0, 0] ax1.plot(history1.history[‘loss’]) ax1.plot(history1.history[‘val_loss’]) ax1.set_title(‘loss – ‘ + model_1_name) ax1.set_ylabel(‘loss’, bbox=box) ax1.set_ylim(y_lim_loss_lower, y_lim_loss_upper) ax3 = axs[1, 0] ax3.set_title(‘accuracy – ‘ + model_1_name) ax3.plot(history1.history[‘accuracy’]) ax3.plot(history1.history[‘val_accuracy’]) ax3.set_ylabel(‘accuracy’, bbox=box) ax3.set_ylim(y_lim_accuracy_lower, y_lim_accuracy_upper) ax2 = axs[0, 1] 6/5/22,
10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 5 of
25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Next, we get the word2vec model from nltk. Now here we have the embedding model defined, let’s see how many words are in the vocabulary: How do the word vectors look like As expected: We can now build the embedding matrix and a vocabulary dictionary: ax2.set_title(‘loss – ‘ + model_2_name) ax2.plot(history2.history[‘loss’]) ax2.plot(history2.history[‘val_loss’]) ax2.set_ylim(y_lim_loss_lower, y_lim_loss_upper) ax4 = axs[1, 1] ax4.set_title(‘accuracy – ‘ + model_2_name) # small adjustment to account for the 2 accuracy measures in the Weighted Averging Model with Attention if ‘classification_accuracy’ in history2.history.keys(): ax4.plot(history2.history[‘classification_accuracy’]) else: ax4.plot(history2.history[‘accuracy’]) if ‘val_classification_accuracy’ in history2.history.keys(): ax4.plot(history2.history[‘val_classification_accuracy’]) else: ax4.plot(history2.history[‘val_accuracy’]) ax4.set_ylim(y_lim_accuracy_lower, y_lim_accuracy_upper) In [4]: #@title NLTK & Word2Vec nltk.download(‘word2vec_sample’) word2vec_sample = str(find(‘models/word2vec_sample/pruned.word2vec.txt’ model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample In [5]: len(model.vocab) In [6]: model[‘great’][:20] In [7]: #@title Embedding Matrix Creation EMBEDDING_DIM = len(model[‘university’]) # we know… it’s 300 # initialize embedding matrix and word-to-id map: 6/5/22,
10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 6 of
25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
The last row consists of all zeros. We will use that for the UNK token, the placeholder token for unknown words. 1.2 Data Acquisition We will use the IMDB dataset delivered as part of the TensorFlow-datasets library, and split into training and test sets. For expedience, we will limit ourselves in terms of train and test examples. It is always highly recommended to look at the data. For convenience, in this assignment we will define a maximum length and only keep the examples that are longer than that length For simplicity, we will also limit ourselves to examples that actually have at # initialize embedding matrix and word-to-id map: embedding_matrix = np.zeros((len(model.vocab.keys()) + 1, EMBEDDING_DIM vocab_dict = {} # build the embedding matrix and the word-to-id map: for i, word in enumerate(model.vocab.keys()): embedding_vector = model[word] if embedding_vector is not None: # words not found in embedding index will be all-zeros. embedding_matrix[i] = embedding_vector vocab_dict[word] = i In [8]: embedding_matrix In [9]: train_data, test_data = tfds.load( name=”imdb_reviews”, split=(‘train[:80%]’, ‘test[80%:]’), as_supervised=True) train_examples_batch, train_labels_batch = next(iter(train_data.batch test_examples_batch, test_labels_batch = next(iter(test_data.batch In [10]: train_examples_batch[2:4] In [11]: train_labels_batch[2:4] In [12]: MAX_SEQUENCE_LENGTH = 100 6/5/22,
10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 7 of
25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
For simplicity, we will also limit ourselves to examples that actually have at least MAX_SEQUENCE_LENGTH tokens. 1.3. Data Preparation 1.3.1. Training/Test Sets for Word2Vec-based Models First, we tokenize the data: Does this look right Yup… looks right. Of course we will need to take care of the encoding later. Next, we define a simple function that converts the tokens above into the appropriate word2vec index values. In [13]: tokenizer = tf_text.WhitespaceTokenizer() train_tokens = tokenizer.tokenize(train_examples_batch) test_tokens = tokenizer.tokenize(test_examples_batch) In [14]: train_tokens[0] In [15]: #@title Definition of sents_to_ids function def sents_to_ids(token_list_list, label_list, num_examples=100000000 “”” converting a list of strings to a list of lists of word ids “”” text_ids = [] text_labels = [] valid_example_list = [] example_count = 0 use_token_list_list = token_list_list[:num_examples] for i, token_list in enumerate(use_token_list_list): if i < num_examples: try: example = [] for token in list(token_list.numpy()): decoded = token.decode('utf-8').replace('.','' try: example.append(vocab_dict[decoded]) except: example.append(43981) if len(example) >= MAX_SEQUENCE_LENGTH: text_ids.append(example[:MAX_SEQUENCE_LENGTH]) text_labels.append(label_list[i]) 6/5/22,
10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 8 of
25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
Now we can create training and test data that can be fed into the models of interest. The variable ‘train_valid_example_list’ contains the list of chosen examples that we can use later for the construction of the BERT training and test sets. Examples 3 and 4 were apparently shorten than our target length. We will also create a reduced training dataset with only 1000 examples to study the effect of the dataset size. Let’s convince ourselves that the data looks correct: 1.3.2. Training/Test Sets for BERT-based models We already imported the BERT model and the Tokenizer libraries. Now, we create the tokenizer: text_labels.append(label_list[i]) if example_count % 5000 == 0: print(‘Examples processed: ‘, example_count valid_example_list.append(i) example_count += 1 else: pass except: pass print(‘Number of examples retained: ‘, example_count) return (np.array(text_ids), np.array(text_labels), valid_example_list In [16]: train_input_ids, train_input_labels, train_valid_example_list = sents_to_ids test_input_ids, test_input_labels, test_valid_example_list = sents_to_ids In [17]: train_valid_example_list[:5] In [18]: REDUCED_TRAINING_SIZE = 1000 train_input_ids_reduced = train_input_ids[:REDUCED_TRAINING_SIZE] train_input_labels_reduced = train_input_labels[:REDUCED_TRAINING_SIZE In [19]: train_input_ids[:2] 6/5/22,
10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 9 of
25https://github.com/datasci-w266/2022-summer-assignment-marciayyl/blob/a2-submit/assignment/a2/Text_classification.ipynb
create the tokenizer: Since the Tokenizer of BERT is not a whitespace tokenizer, each sentence will almost certainly result in more BERT tokens than whitespace tokens. Since we don’t want to cheat by showing BERT more examples than other models we should restrict ourselves to the data that will also be seen by the other models: Next, we will create our training and test sets for BERT models. In [20]: bert_tokenizer = BertTokenizer.from_pretrained(‘bert-base-cased’) In [21]: #@title Limit BERT data to the set used with word2vec all_train_examples = [x.decode(‘utf-8’) for x in train_examples_batch all_test_examples = [x.decode(‘utf-8′) for x in test_examples_batch bert_valid_train_examples_text = [] bert_valid_train_examples_labels = [] bert_valid_test_examples_text = [] bert_valid_test_examples_labels = [] for valid_example in train_valid_example_list: bert_valid_train_examples_text.append(all_train_examples[valid_example bert_valid_train_examples_labels.append(train_labels_batch[valid_example for valid_example in test_valid_example_list: bert_valid_test_examples_text.append(all_test_examples[valid_example bert_valid_test_examples_labels.append(test_labels_batch[valid_example In [22]: #@title BERT Tokenization of training and test data num_train_examples = 2500000 num_test_examples = 500000 max_length = MAX_SEQUENCE_LENGTH x_train = bert_tokenizer(bert_valid_train_examples_text[:num_train_examples max_length=max_length, truncation=True, padding=’max_length’, return_tensors=’tf’) y_train = bert_valid_train_examples_labels[:num_train_examples] x_test = bert_tokenizer(bert_valid_test_examples_text[:num_test_examples max_length=max_length, truncation=True, padding=’max_length’, 6/5/22,
10:47 PM2022-summer-assignment-marciayyl/Text_classification.ipynb at
a2-submit · datasci-w266/2022-summer-assignment-marciayyl Page 10 of