Scala | search for phrases in documents in parallel CS671

Programming Assignment #4 CS671
1 st submission due 7/6/2018
2 nd submission due 7/12/2018
The system in this assignment should search for phrases in documents in parallel. There are many ways
to implement this parallelism:
directly in terms of threads;
using higher-level abstractions from java.util.concurrent;
using higher-level abstractions from scala.util.concurrent;
In this assignment, the system is implemented using the first approach only.
Document Search
Given a target phrase, we’ll look in a document for as many consecutive words from the target as
possible. For a match to be valid, the words have to be consecutive in both the document and the
target. Non-letter characters will be ignored and treated as whitespace. For simplicity, both the source
and the target will be converted to lowercase. For example, given the following source document:
How I want a drink, alcoholic of course,
after the heavy lectures involving quantum mechanics!
and the target: “I will take the course after all”, the longest match that can be found is two words:
“course after”. (“I” and “the” are other matches, but they are shorter.)
Technically, this is an instance of the longest common substring problem, for which efficient (but non-
trivial) algorithms exist. Instead of solving it as a general instance of a longest common substring
problem, this assignment adopts a different approach. Since the search phrase is assumed to be much
smaller than the document, we can filter out all the document words that do not appear in the target
phrase and group the remaining words accordingly. For instance, given search above, the document can
be formatted as two groups: “I” and “course after the”. (Non-target words like “how”, “want”, “a”, etc.
are treated as whitespace and ignored.) We can then search each group for a longest common
substring. Since the groups are small, a naive algorithm will be good enough.
Specifically, the details of the implementation are as follows:
Finder.words: produces all the words from a source, as a stream. Non-letter characters are
treated like whitespace and other characters are converted to lowercase. Words are consecutive
sequences of letters separated by whitespace (or non-letters). On the example document above,
this produces the strings: ” how “, ” i “, ” want “, ” a “, ” drink “, ” alcoholic “, ” of “, ” course “,
” after “, ” the “, ” heavy “, ” lectures “, ” involving “, ” quantum “, and ” mechanics “.
Finder.groupWords: builds groups of words that belong to the target. All other words are
discarded. Given the target and document above, this produces the lists: List(“i”) and
List(“course”, “after”, “the”) .
Finder.longestCommonPrefix: finds the longest common prefix of two lists. Note that this is
much easier than to find the longest common substring as there is no search involved. The longest
common prefix of list a,b,c,d,e and list a,b,x,c,d,e is a,b ; the longest common prefix of list
a,b,c,d,e and list x,a,b,c,d,e is empty.
longestSublist: solves the longest common sublist problem in a naive way. The longest
common sublist of two lists L and L’ is calculated by enumerating all the suffixes of L and L’