程序案例-SYSC 5207

CARLETON UNIVERSITY Department of Systems and Computer Engineering SYSC 5207 Distributed Systems Engineering (Fall 2021) Course Project [Please indicate your name and student no. on the cover page and each page of the document. Start Part II on a new page.] PART- I 1. In this question you will do a literature survey based on the two papers (attached) that focus on processing of documents concerning large volumes of text data: 1. Chanda, B., and Majumdar, S., “Filtering and Storing User-preferred Data: an Apache Spark Based Approach,” IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress, Online, August 2020, pp. 679-685. 2. Chanda, B., and Majumdar, S., “A Technique for Extracting User Specified Information from Streaming Data”, Proc. International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS), Online, July 2021. Prepare a report (at most 2 pages long) that outlines the important issues discussed in the papers. Your literature survey should include: (a) a critical appreciation of the research reported in the papers; (b) a comparison of the different important resource management techniques proposed by the authors that are described in these papers. Also, report any insights gained into system behaviour and performance from the papers. NOTE: The report must be type written. The length of this report is strictly limited to 2 pages (double spacing, a font size of 12 and a 1 inch border on all sides). Material beyond this page limit will not be read. PART II 2. Consider four identical parallel applications each of which comprises four independent processes with execution times: T, 2T, 3T and 4T. Consider two cases. In case 1, the four applications are run in sequence. The first application is run first. When it completes it is followed by the second application and so on. Preemption of processes is not allowed and processor allocation is done in such a way that the minimum completion time is achieved for each application. In case 2, the four applications are run simultaneously and the Processor Sharing scheduling discipline (that was discussed in class) is used. Note that in this case the processors are shared equally among all the ready processes in all the applications. Consider a system with two processors (CPUs): (a) What is the mean job turnaround time for case 1 (b) What is the mean job turnaround time for case Which of these two cases (case 1 or case 2) will give rise to the smaller mean job turnaround time Note that the turnaround time of a job includes both its waiting and execution times. Justify your answer. Show all your work. NOTE: Please start Part II on a separate page. Please submit a PDF file Due: December 6 (6:00 PM). Filtering and Storing User Preferred Data: an Apache Spark Based Approach Bannya Chanda Dept of Systems and Computer Engineering Carleton University Ottawa, Canada e-mail: bannya.chanda@carleton.ca Shikharesh Majumdar Dept of Systems and Computer Engineering Carleton University Ottawa, Canada e-mail: majumdar@sce.carleton.ca Abstract— This work-in-progress paper focuses on a filtering technique based on user preferences. It uses parallel processing and machine learning to effectively filter out user preferred data from a large raw data set. Although large volumes of data are generated, a user is often interested in only a select type (classes) of such data. The motivation behind this research is to devise an effective and efficient filtering technique for extracting user preferred data from large data sets. Storing only filtered data and discarding the remaining data can decrease latency in searching for specific information within a data set. It can also decrease the size of the storage required for storing these data. Such a filtering method that uses data classification techniques can give rise to high processing latencies. An algorithm and system that use both parallel processing and machine learning are presented. A proof-of- concept prototype is built on the Apache Spark parallel processing platform. Analysis of the results of preliminary experiments demonstrates the viability of the investigated technique. Keywords-component Apache Spark, parallel processing, text classification, preference-based filtering I. INTRODUCTION Text-based data are continuously increasing in importance. From education to business to personal use, text- based data have become an essential and integral part of modern society. Users are reliant on text data such as articles in newspapers or journals, meeting minutes, and tweets. Furthermore, actions in their daily lives are often dependent on the accuracy of extracting information that is useful to them from this text-data set. With this comes the challenge of searching this data set for the required information. This becomes even more difficult when the data set is large because the required information becomes more time consuming to find. This paper focuses on a technique for filtering raw data sets and storing only user “preferred” data as filtered data specified by a user. The user can search from these filtered data. This will reduce both search latency as well as the volume of data that needs to be stored for a given user. A user is often interested in specific topics and keywords. These may include specific names such as names of persons, places, and medicines or generic classes such as dates and sporting events. Consider an example use case in which raw data are comprised of all meeting conversations (converted into text) or meeting minutes recorded directly as text. An employee (user) may be interested in specific names of colleagues, clients, products or all such keywords that fit a general “name” class and in dates. If one can store only such keywords and discard the rest of the raw data, it will not only save space but will also speed up the search of the data based on specific user queries. When extracting preferred information from a raw data set, filtering operations can consume a significant amount of time and can benefit from the speedup provided by a parallel processing platform. Hadoop//MapReduce is a popular platform for processing large volumes of data [1]. One drawback of this system is that it stores intermediate results in disks at each processing iteration [2]. Apache Spark alleviates this problem by avoiding disks and performing in- memory computation [2]. Spark eliminates expensive intermediate disk writes via a data model called resilient distributed datasets (RDDs) that can be stored in the memory and partitioned across nodes for parallel computations [3]. Thus, by avoiding disk operations, Spark enhances the performance of applications. This work-in-progress paper proposes a technique for filtering and storing user preferred information from the raw data on the Apache Spark platform. This method leverages the parallel processing framework and the machine learning library provided by Spark to filter the raw data set based on user preferences and stores the filtered data to make them available for the user. A user can query the filtered data that contain only the user preferred information. This information, for example, may include data related to dates, names of persons, and products. If, for example, the user queries the filtered data set to retrieve the name of a person, this search method will return only the data related to the person’s name. This short paper reports on an initial investigation of this filtering technique. To the best of our knowledge, little work exists on the use of parallel processing and machine learning for filtering out user preferred data from a large raw data set. The key contributions of this paper include the following: x An effective filtering technique using Apache Spark. x A preliminary proof-of-concept prototype for the technique. x Demonstration of the viability of the technique and the speedup in computation through initial experiments performed on the prototype. 673 2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress 978-1-7281-6609-4/20/$31.00 2020 IEEE DOI 10.1109/DASC-PICom-CBDCom-CyberSciTech49142.2020.00115 The rest of this paper is organized as follows: Section II presents a key set of related works. Section III briefly explains the methodology and preliminary implementation of the proposed solution. Preliminary experiments and results are described in section IV. Section V concludes the paper and outlines the next steps for our future work. II. RELATED WORK A representative set of related works that concern the use of machine learning and parallel systems in text-based processing is presented. The use of machine learning in text classification has started receiving attention from researchers. In [6], the authors used a filtered classifier on text documents that filtered the text document and processed data using a decision tree classifier to classify the text documents. The usage of single-layer multisize filters convolutional neural network for the classification of Urdu documents is discussed in [7]. In [8], a model is proposed to perform deep sentiment analysis for documents based on a convolutional neural network and a long short-term memory network. Some research works are directed at using parallel algorithms for text classification and learning user preferences. Examples include [9], which used the MapReduce-based Rocchio relevance feedback algorithm for document classification. In [10], a MapReduce-based vector space model is used for user profiling using the news items read by the user. Several studies are based on preference- based search. In [11], the authors used an efficient Bayesian filter based on a group of users’ past searching behaviors to refine the search results by filtering out items for the user. In [12], the authors presented an integrated technique of user- filtering and query-refinement based on user preferences to retrieve user-preferred music. Research has been conducted using the machine learning library of Apache Spark. In [19], the authors evaluated the performance of logistic regression, decision tree and Support Vector Machine (SVM) in the design of the Apache Spark-based Intrusion Detection System (IDS). In [20], the authors developed five models using distributed machine learning based on Apache Spark for predicting diabetes, and among those models, the Logistic Regression Classifier achieved the highest accuracy. In [21], the authors evaluated four machine learning libraries of Apache Spark to provide an effective solution for the prediction of heart diseases. To the best of our knowledge, none of the existing works have used parallel processing and machine learning for filtering the raw data set based on user preferences. III. PROPOSED TECHNIQUE A. Methodology The Proposed technique is explained with the help of fig. 1. Each component of the system presented in the figure is described is described in the following: x Raw Data: The raw data may be comprised of one or multiple text files. The proposed filtering technique filters the raw text data set based on the user preferences. x User Preferences: The user provides preferences for the filtering technique. These preferences are comprised of a set of keywords that describe a topic the user is interested in (e.g. dates, names, and products). The filtering technique uses the user preferences to filter the raw data and store only the output of this module for user queries. x Auto Generation of User Preferences: This optional module uses machine learning techniques to generate user preferences using a history of previous queries sent by the user. x Filtered Data: These are the data obtained after running the filtering algorithm on the raw data set. The data related to the user-specified keywords are retained in a file. x User Queries: A user can query the filtered data module to get related data by providing keywords or a sentence. x Application: A program that processes the filtered data for a specific purpose. For example, an application for meeting dates is being devised to generate a meeting schedule for the user based on the data related to dates recorded in the filtered data module for a period specified by the user. B. Apache Spark Architecture Apache Spark uses a master/slave architecture containing one master node and multiple worker nodes [3] (see fig. 2). The master node is called the driver. The driver communicates with the worker nodes, which are referred to as executors. The user application code is written in the driver, which is responsible for creating a SparkContext, Resilient distributed data sets (RDDs), and executing all transformations and actions [3]. A Spark application is launched on a set of clusters with the help of a resource manager. The Standalone cluster manager is the default built-in resource manager of Spark [3]. With this cluster manager, an application can get all the cores available in the cluster by default [3]. In this work-in-progress paper, a standalone cluster manager has been used. SparkContext is the main entry point for Apache Spark [3]. It allows the application to access the spark cluster with Figure 1. Proposed technique 674 the help of the resource manager and invoke functions. An RDD is the main data structure for Apache Spark and consists of a distributed collection of objects [3]. An RDD derives new RDDs from the existing RDDs when the application executes Spark transformation functions [3]. Two basic transformations are map() and filter(). The Spark map() transformation takes in any function and applies that function to every element of an RDD. The Spark filter() transformation returns a new RDD that contains only the data satisfying the filter conditions. Spark transformations are lazy. Spark does not start the execution of the transformations immediately. Transformations get executed upon an action. An action is one of the ways of sending data from executors to the driver. Examples of Spark actions include collect(), reduce(), take(n) and foreach(). The action collect() is a common operation that returns the content of the new RDD to the driver. The action reduce() aggregates the RDD contents using a function. When any of the actions are triggered, the Spark driver creates an execution plan or jobs for the application [3]. A Spark job consists of multiple tasks that get initiated in response to Spark actions. The driver schedules these tasks to run on the executors. The task scheduler resides in the driver and distributes tasks across the cluster to be executed by executors on their partitions of the RDD. Then, results are sent back to the driver program for compilation. The other key subsidiary components included in the Spark core architecture are Spark SQL, Spark Streaming, and Spark MLlib libraries [3]. Spark SQL provides a structured data-processing abstraction called DataFrames, which are essentially distributed collections of data organized into columns. Spark Streaming allows the processing of stream data in real-time. Spark MLlib contains common machine learning algorithms implemented as Spark operations performed on RDDs. Apache Spark’s MLlib library contains different algorithms for classification. In this paper, three different types of classification algorithms – Multinomial Na ve Bayes, Multinomial Logistic Regression and Multilayer Perceptron Classifier of Spark MLlib have been trained with the text classification dataset. Multinomial Na ve Bayes is a multiclass classification algorithm which uses the multinomial distribution for each feature and computes the conditional probability distribution of the label for given features by applying Bayes’ theorem and uses it for prediction [22]. Multinomial Logistic Regression is a supervised classification algorithm that predicts a multiclass outcome by explaining the relationship between the class and the extracted features from the input [14]. Multilayer Perceptron Classifier is a feedforward artificial neural network which is combined of an input layer for input data, an output layer to provide prediction about the input data and an arbitrary number of hidden layers in between the two layers for computation [23]. Among them, Multinomial Logistic Regression demonstrates the highest accuracy (discussed further in section IV-C) and is used in this paper. Apache Spark MLlib also offers a set of multi-language APIs for machine learning algorithms. Among them, Pipeline [3] and K-folds Cross-Validation [3] are used in this research. Pipeline is a high-level API that consists of a sequence of algorithms to be run in a specific order to process data. K- folds Cross-Validation is a re-sampling procedure used to assess and test the efficiency of a machine learning model if the input data size is small [16]. It splits the training dataset into K folds that are used as separate training and test datasets to fit and validate the machine learning model. It finds the best model by running the machine learning algorithms with different combinations of parameters and training and testing datasets. C. Preliminary Prototype Implementation A preliminary proof-of-concept prototype is implemented on an Apache Spark cluster. The program is written in python and the PySpark API [3] is used. The prototype incorporates two methods: 1) the filter method which filters the raw data based on user preferences and 2) the search method, which uses keywords or sentences specified in a user query for retrieving searched data from filtered data. 1) Filter method: The filter method categorizes the raw data using the machine learning model and named entity recognition. Named entity recognition is used to extract specific entities such as person names, locations, dates, numbers, and money. A Machine learning model is used to classify the raw text data according to its content. The categorized data are then filtered based on user preferences. Consider, for example, the raw data that contains the following text lines, Line 1: “Don Deis and Mike Starek are representing the Faculty Senate.” Line 2: “Faculty Senate Speakers are invited to participate.” After applying the filter method on the raw data, the named entity recognition function extracts the person names “Don Deis” and “Mike Starek” from line 1 and classifies line 1 as “Person” (class), and the machine learning model also classifies it as “Person”. But in line 2, there is no relevant entity e.g. person name, location, date, etc., so the named entity recognition function cannot extract any entities and classify this line. So, in the case of line 2, the machine learning model classifies it as “Person”. If the user preferences include “Person”, the filter method stores these two lines. Each line of filtered data is comma-separated with the filtered text, the entity class is recognized by the named entity recognition function, extracted name entities, and Figure 2. Apache Spark cluster 675 class predicted by the machine learning model. Therefore, the filtered data (shown in fig. 1) will contain the following comma-separated lines, Line 1: “Don Deis and Mike Starek are representing the Faculty Senate.”, “Person”, “Don Deis, Mike Starek”, “Person” Line 2: “Faculty Senate Speakers are invited to participate.”, “Person” Fig. 3 describes the filter method component of the prototype. First, the texts from raw data set are preprocessed by eliminating punctuation marks and making the uppercase letters lower case in the “Text preprocessing” function. The preprocessed data are passed as input to the named entity recognition function and machine learning model. A pre- trained python library “spaCy” [13] is used to extract named entities. A Multinomial Logistic Regression (discussed in section III-B) classifier is used to classify the text according to its content. This classifier has been formed using the training dataset obtained from the Cognitive Computation Group of University of Illinois [4][5] (discussed in section IV). The classifier runs a sequence of pipeline stages (see fig. 3) including tokenization and stop words removal, converting each word into feature vectors, encoding labels to label indices and learn a logistic regression model from feature vectors and labels. Using Spark’s 5-folds Cross- Validation [3], these pipeline stages have been run with different combinations of logistic regression parameters to find the best model. The best logistic regression model is used in the filter method to classify the text data. After classifying the raw data, the filter method filters data whose class matches with user preferences and stores the filtered data (see fig. 3). Pseudocode 1 provides the algorithm of the proposed filter method. The input of the filter method is user preferences and the raw data directory where all the raw data files are located. For example, raw data files can be composed of multiple meeting-minutes files. The filter method reads the files from the local file system directory using a Spark function that creates RDDs in line 1 of the pseudocode. In lines 2–7, the “map” function processes the text data from each raw data file. Line 3 divides the text data into a list of sentences. Line 4 preprocesses each sentence by making the uppercase letters lower case and eliminating the punctuation marks. In line 5, the “NER” function extracts the named entities such as person names, locations, and dates based on user preferences from each sentence, whereas line 6 creates row instances where in each row there are three columns: “Text”, “NERTokens”, and “NERData” for the sentence, named entities classes, and the extracted names entities respectively. Line 7 performs the “collect” action to get all the data after performing the map function. Line 8 creates a DataFrame with the three columns. To classify the sentences according to their content, this DataFrame is passed through the stages of the machine learning model, and the model returns each row with another additional column “label” for the predicted class (see lines 9–10). Lines 11–14 store sentences based on user preferences input by iterating through each row and the predicted class from the machine learning model and named entities classes are compared with the user preferences. 2) Search Method: The search method looks within the filtered data module for the data that match the user- provided keywords or sentence. Consider the example filtered data presented earlier. Recall the contents of the two lines: Line 1: “Don Deis and Mike Starek are representing the Faculty Senate.”, “Person”, “Don Deis, Mike Starek”, “Person” Line 2: “Faculty Senate Speakers are invited to participate.”, “Person” If the user queries: “Is the name Don Deis mentioned in the data files ”, the search method preprocesses the query and classifies it as “Person” and extracts the named entity “Don Deis”. Then, the search method looks for that class and named entity in the filtered data and displays line 1: “Don Deis and Mike Starek are representing the Faculty Senate.” as the results to the user. If the user searches for the keyword “Senate”, it will display the text contained in the sentences from both lines 1 and 2. Fig. 4 shows components of the proof-of-concept prototype that implements the search method. There are two ways to retrieve data from filtered data. In the approach shown in fig. 4a, the user queries the filtered data by specifying a keyword. The “Text processing” block in the search method processes the comma-separated lines of Figure 3. Prototype of the filter method 676 filtered data and returns lists of sentences from the filtered data. Then, the search method searches for the keyword in those sentences and displays the relevant search results to the user. In the second way (see fig. 4b), the user queries the filtered data by specifying a sentence. The search method preprocesses the sentence by eliminating punctuation marks and converts the uppercase letters into lower case letters. The processed query is passed as an input in the machine learning model and the named entity recognition function. The machine learning model discussed in the Filter method (see section III-C) is used to classify the query and the named entity recognition function extracts the named entities. The “Text processing” block in the search method processes the comma-separated lines of filtered data and returns lists of sentences and their corresponding classes and named entities from the filtered data. Then, the class of the user query, extracted named entities of the user query, and processed filtered data are passed to a function. This function of the search method returns relevant sentences from the filtered data whose classes and named entities match the class and named entities of the user query. Fig. 4 presents a high-level description of the two search methods. Further details in terms of pseudocode could not be provided due to space limitations. IV. PRELIMINARY EXPERIMENTS AND PERFORMANCE ANALYSIS A set of preliminary experiments are performed on two Apache Spark clusters. The first cluster is set up on a local computer and is representative of a use case where the user uses her/his resources to filter meeting files (further details are provided in the section IV-C). This cluster has been created using 4 core Intel i5 processors, 4 GB RAM and the Ubuntu 18.04 operating system. This Spark cluster consists of one master node and three worker nodes. Some modifications are applied to the default Apache Spark configuration [3], which includes one core for each worker node, 1 GB for executor memory, and 1 GB for driver memory. The second cluster is comprised of six cores and 32 GB RAM running the Ubuntu 20.04 LTS operating system. This Spark cluster consists of one master node and six worker nodes. Each worker node has one core and 4 GB RAM, and the driver memory is set to 30 GB. Such a bigger system is used when a large number (hundreds/thousands) of text documents are to be filtered and a local computer does not have the capability of performing the filtering operations in a reasonable period. The performance of the proposed filtering method is evaluated, and sample performance results are presented for both clusters. Before discussing these results the datasets used in the experiments are presented. A. Training Dataset The training and testing dataset used for the machine learning algorithm is a labeled text classification dataset from the Cognitive Computation Group of University of Illinois [4][5]. This labeled dataset has two attributes: text and label, where the label attribute specifies the class of the text’s attribute. In this dataset, there are 50 hierarchical classes to identify the text semantically. In this paper, this dataset is utilized with class label attribute values modified to just 10 labels. Since the primary focus of this research is on filtering and storing of user-preferred data from a raw data set using Apache Spark, the classification dataset used in this paper can be easily replaced by any other relevant text classification dataset. B. Raw Data Set Preliminary experiments were performed on two types of synthetic text datasets. They are generated from a few council meetings minutes that have been collected from a city council website [15]. Each set of meeting minutes is comprised of over 5000 words. The smaller raw data set is obtained by replicating the meeting minutes, resulting in 50 raw data files that are stored in a local file system. Each file contains the minutes of a meeting, and the file size is approximately 125 KB. The raw data set is thus comprised of 50 files, and the total raw data size is 6.2 MB. The second type of data sets concerns large data obtained by copying the meeting minutes data several times in a single file. Then, this single file is replicated from 6–57 times. The resulting total size of the raw data set is 117 MB–1.1 GB. The first type of data sets is used for the use case that focuses on the user using a local computer, while the second type of data sets are processed on the cluster in the cloud. The system performance achieved with each type of data set and a variable number of worker nodes is presented next. C. Preliminary Results 1) Performance: The number of resources in a Spark cluster and the size of the data being processed are observed to have a significant impact on performance [18]. This section focuses on analyzing the impact of the number of Spark worker nodes and data size on the performance of the proposed filter method. The prototype for the proposed filtering technique has been run on the Spark cluster by Figure 4. Prototype of user query to filtered data: a) search by a keyword b) search by a sentence 677 changing the number of worker nodes with each worker node running on an independent core. The computation time of the filtering technique has been compared for different numbers of worker nodes and data sizes. The computation time is the total time required by the filtering method (application) to complete its processing of the raw data and the difference between the time at which the application completes and the time at which it starts execution. The filtering application has been run in a Spark cluster. As discussed in section IV, two types of systems are experimented with using different numbers of Spark worker nodes and raw data sets. Fig. 5 shows the computation time required by the proposed filter method in the first Spark cluster residing on a local computer. In this experiment, the filter method is run for different numbers of worker nodes (varies from 1 to 3) with the small raw data set. As expected, the computation time of the filter method is observed to decrease by adding more worker nodes to the cluster. For example, the computation time is 210 seconds for one worker node, whereas it decreases to 84 seconds after adding two more worker nodes to the cluster. Fig. 6 displays the computation time achieved with different numbers of worker nodes and different sizes of the raw data set in the second Spark cluster residing in the cloud. In this experiment, the filter method is run for different numbers of worker nodes with a large number of raw data sets at a time. The number of worker nodes is varied from 1 to 6, and the raw data size is varied from 117 MB to 1.1 GB accordingly at each run. For a given data size, the time is observed to decrease with an increase in the number of worker nodes. This decrease seems to be more for a higher size of the raw data set. For example, for a raw data size of 1.1 GB, the completion time decreases from 10 min (600 seconds) to 4.1 min (246 seconds) (by 59%) as the number of worker nodes changes from 1 to 6. In both experiments, adding more worker nodes to the cluster increases the parallelism, so more tasks of the Spark jobs are processed in parallel, thus decreasing the computation time of the filter method. This paper has included preliminary experimental results that establish the viability of the proposed technique. Significant work that includes using a higher number of processing elements and multiple larger data sets is being planned. The investigation will include the impact of the system and workload parameters on various additional performance metrics including speedup and efficiency. The impact of different machine learning algorithms on performance and accuracy will be investigated. This paper has focused on the filtering method. A detailed performance analysis of the searching method is also part of the plans for future research. The results of this research will be presented in our subsequent full research papers. 2) Accuracy: Accuracy is used to measure the performance of the machine learning model. In this paper, three different class