python-COMP42415

Durham University Department of Computer Science Master of Data Science COMP42415 Text Mining and Language Analytics Workshops Author Dr Stamos Katsigiannis 2021-22 Contents 1 Workshop 1: Text query with Regular Expressions 1 1.1 Regular Expressions Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 The “re” Python package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.2 RegEx functions in “re” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2.3 Regular expression metacharacters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.4 Regular expression special sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2.5 Repeating regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Using Regular Expressions to match strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.1 Matching example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3.2 Matching to validate strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3.3 Credit card number validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3.4 Validation of United Kingdom’s National Insurance numbers (NINO) . . . . . . . . . . . 7 1.3.5 Validation of hexadecimal numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Using Regular Expressions to search elements in files . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.1 Parsing XML files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4.2 Parsing HTML files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.3 Parsing raw text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5 Using Regular Expressions to substitute strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.1 Substitution example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.5.2 Email domain substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Workshop 2: Text pre-processing 15 2.1 NLTK corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Import NLTK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Corpus file IDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.3 Corpus file categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.4 Corpus words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.5 Selecting files from specific category . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.6 Corpus sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.7 Accessing sentences from specific file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.8 Number of documents in each category . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.9 Corpus raw text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Input text from text file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Text pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Sentence Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Word Tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Lowercasing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3.4 Stop words removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.5 Punctuation removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.6 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.7 Lemmatisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.8 Part of Speech (POS) tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 iii iv Contents 3 Workshop 3: Text representation 27 3.1 Load corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 Words in corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2 Unique words in corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.3 Vocabulary of multiple documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 One-hot encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.1 One-hot encoding of words in vocabulary . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 One-hot encoding of text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Term Frequency (TF) representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Compute TF of words in text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.2 TF representation of documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Term Frequency – Inverse Document Frequency (TF-IDF) . . . . . . . . . . . . . . . . . . . . . . 33 3.5.1 Document Frequency (DF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5.2 Inverse Document Frequency (IDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5.3 Term Frequency – Inverse Document Frequency (TF-IDF) . . . . . . . . . . . . . . . . . . 34 3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Workshop 4: N-Grams 37 4.1 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Unigrams (1-Grams) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 Compute unigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Unigram probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.3 Sentence probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 Bigrams (2-Grams) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.1 Compute bigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.2 Bigram probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.3 Sentence probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Trigrams (3-Grams) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.1 Compute trigrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4.2 Trigram probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4.3 Sentence probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4.4 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5 The number underflow issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5 Workshop 5: Word embeddings 49 5.1 Word context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.1 Load text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.2 Compute context words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.3 Other contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.2 Word-word co-occurrence matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.1 Word-word co-occurrence matrix (Context size = 1) . . . . . . . . . . . . . . . . . . . . . 51 5.2.2 Word-word co-occurrence matrix (Context size = 2) . . . . . . . . . . . . . . . . . . . . . 51 5.2.3 Compute word-word co-occurrence matrix as numpy array . . . . . . . . . . . . . . . . . . 52 5.2.4 Word-word co-occurrence matrix visualisation . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.1 Word embeddings computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.2 Word embeddings visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.3.3 Word embeddings distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 6 Workshop 6: Document embeddings for machine learning 57 6.1 Loading data with pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.1.1 Load dataset file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.1.2 Available words in dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1.3 Access specific row in pandas dataframe . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 6.1.4 Access specific columns in pandas dataframe . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.1.5 Convert dataframe to numpy array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 v6.2 Word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2.1 Load word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.2.2 Distance of word embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.3 Document embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.3.1 Compute document embedding (Mean word embedding) . . . . . . . . . . . . . . . . . . . 62 6.3.2 Cosine distance of document embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.4 Classification using document embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 6.4.1 Classification with the k Nearest Neighbour algorithm (kNN) . . . . . . . . . . . . . . . . 65 6.4.2 Classification with Linear Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . 66 6.4.3 Classification using Feed-Forward Neural Networks . . . . . . . . . . . . . . . . . . . . . . 66 6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 7 Workshop 7: Text Classification Using Traditional Classifiers 69 7.1 Introduction to Python classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.1.1 A simple Python class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.1.2 Definition of class methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 7.1.3 Class initialisation and method overloading . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7.1.4 Definition of a custom class for text documents . . . . . . . . . . . . . . . . . . . . . . . . 71 7.2 Preparation of spam detection dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.2.1 Dataset loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.2.2 Word tokenisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.2.3 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7.2.4 Join email words list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.3 Text classification for spam detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.3.1 Splitting of dataset into training and test sets . . . . . . . . . . . . . . . . . . . . . . . . . 75 7.3.2 Text classification using Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 7.3.3 Computation and plotting of Naive Bayes’s classification performance . . . . . . . . . . . 76 7.3.4 Text classification using kNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3.5 Computation and plotting of kNN’s classification performance . . . . . . . . . . . . . . . 77 7.4 Saving and loading a trained machine learning model . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.4.1 Save trained model in a file for future use . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.4.2 Loading and use of saved trained model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 8 Workshop 8: Text classification using Recurrent Neural Networks (RNNs) 81 8.1 Dataset preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.1.1 Load fake news dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.1.2 Dataset pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 8.1.3 Create PyTorch dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 8.1.4 Divide dataset into training and test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 8.1.5 Create vocabulary using the training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 8.1.6 Create iterators for the training and test data . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.2 Create LSTM architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.2.1 Define network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 8.2.2 Define hyperparameters and initialise model . . . . . . . . . . . . . . . . . . . . . . . . . . 85 8.2.3 Define the optimiser, loss function and performance metric . . . . . . . . . . . . . . . . . 86 8.2.4 Define training function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 8.2.5 Define evaluation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 8.3 Train LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 8.4 Classify text using trained model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 A Appendix: Test files 89 A.1 movies.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 A.2 links.html . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 A.3 emails.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 A.4 cds.xml . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 A.5 alice.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 A.6 dune.txt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 vi Contents Workshop 1: Text query with Regular Expressions 1.1 Regular Expressions Definition A regular expression (shortened as regex or regexp) is a sequence of characters that define a search pattern. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. 1.2 The “re” Python package Python has a built-in package called “re”, which can be used to work with Regular Expressions. Let’s import this package and use a regular expression to check whether the sentence “I started studying this year at Durham University” ends with the word “University” or with the word “school”. 1.2.1 Example import re # Import the re package txt = “I started studying this year at Durham University” x1 = re.search(“University$”, txt) # Returns a Match object if there is a match anywhere in the string with the regex x2 = re.search(“school$”, txt) print(“x1:”,x1) print(“x2:”,x2) if(x1): print(“The text ends with ‘University’”) else: print(“The text does not end with ‘University’”) if(x2): print(“The text ends with ‘school’”) else: print(“The text does not end with ‘school’”) The output will look like: x1: <_sre.SRE_Match object; span=(39, 49), match='University'> x2: None The text ends with ‘University’ The text does not end with ‘school’ 1.2.2 RegEx functions in “re” The re module offers a set of functions that allows us to search a string for a match: 1 2 Workshop 1: Text query with Regular Expressions Function Description findall(args) Returns a list containing all matches search(args) Returns a Match object if there is a match anywhere in the string split(args) Returns a list where the string has been split at each match sub(args) Replaces one or many matches with a string 1.2.3 Regular expression metacharacters As you can see in our first example, we used the character “$” in order to indicate that a text matching the regular expression should end with the string preceding the character “$”. For example, the regular expression “car$” indicates that the text should and with the string “car”. In this case, “$” is considered as a metacharacter, i.e. a character with a special meaning. Below are the metacharacters supported by the “re” package: Character Description Example [ ] A set of characters “[a-f]” Signals a special sequence (can also be used to escape special characters) “s” . Any character (except newline character) “uni..rsity” Starts with “ She” $ Ends with “John$” * Zero or more occurrences “o*” + One or more occurrences “l+” Matches 0 or 1 repetitions of the preceding regex ab will match ei- ther “a” or “ab” {} Exactly the specified number of occurrences “o{2}” | Either or “he|she” () Capture and group The first metacharacters we’ll look at are [ and ]. They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a “-”. For example, [abc] will match any of the characters a, b, or c; this is the same as [a-c], which uses a range to express the same set of characters. If you wanted to match only lowercase letters, your regex would be [a-z]. Metacharacters are not active inside classes. For example, [akm$] will match any of the characters “a”, “k”, “m”, or “$”. “$” is usually a metacharacter, but inside a character class it is stripped of its special nature. You can match the characters not listed within the class by complementing the set. This is indicated by including a “ ” as the first character of the class. For example, [ 5] will match any character except “5”. If the caret appears elsewhere in a character class, it does not have special meaning. For example: [5 ] will match either a “5” or a “ ”. Perhaps the most important metacharacter is the backslash, . As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It is also used to escape all the metacharacters so you can still match them in patterns. For example, if you need to match a [ or , you can precede them with a backslash to remove their special meaning: [ or . Some of the special sequences beginning with “” represent predefined sets of characters that are often useful, such as the set of digits, the set of letters, or the set of anything that isn’t a white space. 1.2.4 Regular expression special sequences Let’s see some of the main regular expression special sequences. For a more detailed list, please refer to https://docs.python.org/3/library/re.html#re-syntax. These sequences can be included inside a character class. For example, [s,.] is a character class that will match any white space character, or “,” or “.”. 3Sequence Description d Matches any decimal digit. Equivalent to the class [0-9]. D Matches any non-digit character. Equivalent to the class [ 0-9]. s Matches any white space character. Equivalent to the class [ tnrfv]. S Matches any non-white space character. Equivalent to the class [ tnrfv]. w Matches any alphanumeric character. Equivalent to the class [a-zA-Z0-9 ]. W Matches any non-alphanumeric character. Equivalent to the class [ a-zA-Z0-9 ]. 1.2.5 Repeating regular expressions Being able to match varying sets of characters is the first thing regular expressions can do that isn’t already possible with the methods available on strings. However, if that was the only additional capability of regexes, they wouldn’t be much of an advance. Another capability is that you can specify that portions of the regular expression must be repeated a certain number of times. The first metacharacter for repeating things that we’ll look at is “*”. “*” does not match the literal character “*”, but it specifies that the previous character can be matched zero or more times, instead of exactly once. Example: “do*g” will match “dg” (zero “o” characters), “dog” (one “o” character), “doooog” (four “o” characters), and so on. Repetitions such as “*”,“+”, and “ ” are greedy. When repeating a regular expression, the matching engine will try to repeat it as many times as possible. If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions. If this behaviour is undesirable, you can add “ ” after the qualifier (“* ”,“+ ”, “ ”) to make it perform the match in non-greedy or minimal fashion, i.e. as few characters as possible will be matched. Step-by-step example Let’s consider the expression a[bcd]*b. This matches the letter “a”, zero or more letters from the class [bcd], and finally ends with a “b”. Now imagine matching this regular expression against the string “abcbd”. Step Matched Explanation 1 a The “a” in the regex matches. 2 abcbd The engine matches “[bcd]*”, going as far as it can, which is to the end of the string. 3 FAILED The engine tries to match “b”, but the current position is at the end of the string, so it fails. 4 abcb Back up, so that “[bcd]*” matches one less character. 5 FAILED Try “b” again, but the current position is at the last character, which is a “d”. 6 abc Back up again, so that “[bcd]*” is only matching “bc”. 7 abcb Try “b” again. This time the character at the current position is “b”, so it succeeds. The end of the regular expression has now been reached, and it has matched “abcb”. This demonstrates how the matching engine goes as far as it can at first, and if no match is found it will then progressively back up and retry the rest of the regular expression again and again. It will back up until it has tried zero matches for “[bcd]*”, and if that subsequently fails, the engine will conclude that the string does not match the regex at all. Another repeating metacharacter is “+”, which matches one or more times. Pay careful attention to the difference between “*” and “+”. “*” matches zero or more times, so whatever’s being repeated may not be present at all, while “+” requires at least one occurrence. Example: “do+g” will match “dog” (one “o” character), “dooog” (three “o” characters), and so on, but will not match “dg” (zero “o” characters). There are two more repeating qualifiers. The question mark character “ ” matches either once or zero times. Think of it as marking something as being optional. Example: “pre- processing” matches either “preprocessing” or “pre-processing”. 4 Workshop 1: Text query with Regular Expressions The most complicated repeated qualifier is “{m,n}”, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n. For example, “a/{1,3}b” will match “a/b”, “a//b”, and “a///b”, but it will not match “ab”, which has no slashes, or “a////b”, which has four slashes. You can omit either m or n. In this case, default values for m or n are used. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity. Note: Some qualifiers are interchangeable. For example “{0,}” is the same as “*”, “{1,}” is the same as “+”, and “{0,1}” is the same as “ ”. “*”, “+”, and “ ” make the regular expression easier to read, so try to use them if possible. 1.3 Using Regular Expressions to match strings 1.3.1 Matching example Let’s use the text “I started studying this year at Durham University” again and find out whether the string “at” or the string “in” exists in the text. txt = “I started studying this year at Durham University” x = re.search(“at|in”, txt) print(x.string) # Returns the string passed into the function print(x.span()) # Returns a tuple containing the start, and end positions of the match print(x.group()) # Returns the part of the string where there was a match print(txt[x.span()[0]:x.span()[1]]) # Print the content of the string at the positions of the match The output will look like: I started studying this year at Durham University (15, 17) in in As you can see, there was a match to our regex at the character with index 15 (counting starts from 0), ending at the character with index 17. Indeed, the string “in” was found within the word “studying”. However, if you read the input text, there should have been a second match for the word “at” but only the first match was returned. Note that if there is more than one match, only the first occurrence of the match will be returned by the search() function! We can use the findall() function to get a list of all matches in the order they are found. x = re.findall(“at|in”, txt) for match in x: print(match) The output will look like: in at As expected, the findall() function returned two matches, “in” and “at”. Consider the string “stp stop stoop stoooooop stooooooooooop”. Let’s find matches where the character “o” appears one or more times and matches where the character “o” appears two times only. x = re.findall(“o+”, “stp stop stoop stoooooop stooooooooooop”) print(“One or more occurrences of ‘o’:”) for match in x: print(match) x = re.findall(“o{2}”, “stp stop stoop stoooooop stooooooooooop”) print(“nTwo occurrences of ‘o’:”) for match in x: print(match) 5The output will look like: One or more occurrences of ‘o’: o oo oooooo ooooooooooo Two occurrences of ‘o’: oo oo oo oo oo oo oo oo oo Note that when looking for two occurrences of “o”, the findall() function returned 9 matches that correspond to the 9 non-overlapping matches from the start (left) to the end (right) of the input text: “stp stop st(oo)p st(oo)(oo)(oo)p st(oo)(oo)(oo)(oo)(oo)op” 1.3.2 Matching to validate strings Let’s use regular expressions to check the validity of various strings. Consider an identification number that should consist of exactly 10 digits (from 0 to 9), and the strings: “0123456789”, “12345”, “0000a00005”, “+000001111”, “00000011115”, “2030405060”. How can we check whether these strings are valid identification numbers text = list() text.append(“0123456789”) text.append(“12345”) text.append(“0000a00005”) text.append(“+000001111”) text.append(“00000011115”) text.append(“2030405060”) regex = “[0-9]{10}” result = list() for t in text: x = re.match(regex, t) if(x != None): print(t,”->”,x.group()) else: print(t,”-> No match”) result.append(x) for i in range(len(text)): if(result[i]!=None and result[i].group()==text[i]): print(text[i],”is a valid identification number”) else: print(text[i],”is NOT a valid identification number”) The output will look like: 0123456789 -> 0123456789 12345 -> No match 0000a00005 -> No match +000001111 -> No match 00000011115 -> 0000001111 2030405060 -> 2030405060 0123456789 is a valid identification number 12345 is NOT a valid identification number 0000a00005 is NOT a valid identification number 6 Workshop 1: Text query with Regular Expressions +000001111 is NOT a valid identification number 00000011115 is NOT a valid identification number 2030405060 is a valid identification number Notice that string “00000011115” consists of 11 numerical digits, thus the regular expression matches the subset “0000001111”. However, this is not a valid identification number according to the specification above. When validating input, remember to check whether the matched string is equal to the query string. 1.3.3 Credit card number validation Let’s try to validate whether the following strings are valid VISA or Mastercard credit card numbers: “1000000000000”, “4000000000000”, “5000000000000”, “50000000000000000”, “50000a0000000c000”, “40123456789”. VISA credit card numbers should start with a 4 and have 13 or 16 digits. Mastercard credit card numbers start with a 5 and have 16 digits. text = list() text.append(“1000000000000”) # 13 digits – Not valid text.append(“4000000000000”) # 13 digits – Valid VISA text.append(“5000000000000”) # 13 digits – Not valid text.append(“5000000000000000”) # 16 digits – Valid Mastercard text.append(“50000a0000000c000”) # Not valid, contains letters text.append(“40123456789”) # 11 digits – Not valid regex = “(5[0-9]{15})|(4([0-9]{12}|[0-9]{15}))” # Number 5 followed by 15 digits OR number 4 followed by either 12 or 15 digits result = list() for t in text: x = re.match(regex, t) if(x != None): print(t,”->”,x.group()) else: print(t,”-> No match”) result.append(x) print(“”) for i in range(len(text)): if(result[i]!=None and result[i].group()==text[i]): print(text[i],”is a valid VISA or Mastercard number”) else: print(text[i],”is NOT a valid VISA or Mastercard number”) The output will look like: 1000000000000 -> No match 4000000000000 -> 4000000000000 5000000000000 -> No match 5000000000000000 -> 5000000000000000 50000a0000000c000 -> No match 40123456789 -> No match 1000000000000 is NOT a valid VISA or Mastercard number 4000000000000 is a valid VISA or Mastercard number 5000000000000 is NOT a valid VISA or Mastercard number 5000000000000000 is a valid VISA or Mastercard number 50000a0000000c000 is NOT a valid VISA or Mastercard number 40123456789 is NOT a valid VISA or Mastercard number Let’s analyse the regex that we used. Both VISA and Mastercard numbers start with a specific digit but Mastercard has exactly 16 digits in total, while VISA can have either 13 or 16 digits. Let’s first create a regex for each case separately. For Mastercard, it should be “5[0-9]{15}”, the digit 5 followed by exactly 15 digits (0-9), for a total of 16 digits. For VISA, it should be “4([0-9]{12}|[0-9]{15})”, the digit 4 followed by either exactly 12 digits, for a total of 13 digits, or exactly 15 digits, for a total of 16 digits. Then, to include both the VISA and the Mastercard cases in our final regex, we can enclose each regex within parentheses and combine them with the OR (“|”) operator. 7Note that the expression [0-9] could be switched to [d]: regex = “(5[d]{15})|(4([d]{12}|[d]{15}))” # Number 5 followed by 15 digits OR number 4 followed by either 12 or 15 digits result = list() for t in text: x = re.match(regex, t) if(x != None): print(t,”->”,x.group()) else: print(t,”-> No match”) result.append(x) print(“”) for i in range(len(text)): if(result[i]!=None and result[i].group()==text[i]): print(text[i],”is a