COMP9313

COMP9313 2022T2 Final Exam The deadline for the final exam is: Tuesday 23rd, August 5:00 pm (AEDT) Please submit your answers through Moodle. Do not wait till the last minute to submit and double check if it is successful. Question 1. Concepts (4 marks) (a) (2 marks) Explain the data flow in MapReduce using the word count problem as an example. (b) (2 marks) Explain the data flow in Spark using the word count problem as an example. Question 2. MapReduce Programming (14 marks) Requirement: You should explain how the input is mapped into (key, value) pairs by the map stage, i.e., specify what is the key and what is the associated value in each pair, and how the key(s) and value(s) are computed. Then you should explain how the (key, value) pairs produced by the map stage are processed by the reduce stage to get the final answer(s). You only need to provide the pseudo code for the classes including Mapper and Reducer (optionally Combiner etc., and self-defined classes if necessary, and the efficiency of your method will be considered). (a) (7 marks) Problem: Given a large text dataset, find the top-k frequent co-occurring term pairs. The co-occurrence of (w, u) is defined as: u and w appear in the same sentence (i.e., (w, u) and (u, w) are treated equally). The sentences are separated by one full stop and a space character “. “ while the terms are separated by a space character. Note that one sentence could be split into multiple lines. For the two terms in a pair, sort them in alphabetical order. For pairs having the same frequency, sort them in alphabetical order as well. 2 (b) (7 marks) Problem: Assume a data set crawled from a location-based social network application, in which each record is in format of (userID, locID, check_in_time), where check_in_time is the timestamp of the check-in. We denote the number of check-ins at location by user as and the number of check-ins from as . Thus, = ∑ ∈ , where is the set of locations visited by . The probability that checked-in at is computed as = . Your task is to compute for each user at each location, and then output the result in format of: “: , ”. The results are first sorted by location IDs in ascending order, and then by the user’s check-in probabilities in descending order. If two users have the same probability, sort them by user IDs in ascending order. Sample input and output are as below: Input: Output: 1: (1, 1, 1) 2: (1, 1, 2) 3: (1, 2, 3) 4: (2, 1, 4) 5: (2, 3, 5) 6: (3, 2, 6) 7: (3, 2, 7) 8: (3, 3, 8) 1: 1,0.6667 1: 2,0.3333 2: 3,0.6667 2: 1,0.3333 3: 2,0.5 3: 3,0.5 3 Question 3. Spark Programming (15 marks) Provide the Spark code for the given problems (minor grammar errors are acceptable). (a) (6 marks) RDD programming: Given a set of documents in the format of key-value pairs Text>, where docID is the ID of a document and Text is the content of a document, the following code will construct the Boolean inverted index for this document set. That is, the output is in the format of , if there are n documents containing this term. In the pairRDD InvList, the keys (terms) are sorted in alphabetical order, and the docIDs in each list are sorted in ascending order according to their IDs (numerical values). val docs = sc. parallelize(List((1, “hello world”), (2, “this scala program”), (3, “creates a pair RDD”), (4, “in spark”) … …)) // fill your code here, and store the result in a pair RDD InvList InvList.saveAsTextFile(“hdfs://…”) (b) (3 marks) DataFrame programming: Given the data in format of key-value pairs , the following code will find all keys such that the gap between the minimum and maximum values for each key is larger than 100. Sort the keys in descending order finally. val pairs = List((1, 2), (3, 4), (3, 5),… …) // fill your code here, and store the result in a DataFrame resMinMax resMinMax.write.format(“csv”).save(“hdfs://…”) (b) (6 marks) Given a directed graph in format of “sourceNodeID, targetNodeID, Distance”, a node “s”, and a distance threshold D, find the nodes whose distance to s in the graph is smaller than D (i.e., there exists a path from the node to s and the length of this path is smaller than D). Use the GraphX pregel operator for this task. The efficiency of your method will be considered val pairs = sc. parallelize(List((1, 2, 5), (1, 4, 6), (3, 4, 2), (3, 6, 8) … …)) val sourceNode = s.toInt val edgeList = pairs.map( x => Edge(x._1.toLong, x._2.toLong, x._3.toDouble)) val graph = Graph.fromEdges[Int, Double](edgeList, 0) // fill your code here, and store the nodes in a variable NeighborNodes println(NeighborNodes) 4 Question 4. Finding Similar Items (6 marks) Given three documents D1 : “abcbacb”, D2 : “cbaaabc”, and D3 : “acbaaac”, using the characters as tokens to compute k-shingles, we can obtain a Boolean matrix for Min-Hashing. (i) (2 marks) What is the smallest value of k such that the total number of distinct k-shingles is 9 List all the k-shingles generated from the three documents. (ii) (1 marks) Compute the Jaccard similarity of three document pairs based on the k-shingles in the above question (i.e., sim (D1, D2), sim (D1, D3), sim (D2, D3)). (iii) (3 marks) Complete the steps of the Min-Hashing algorithm using the below 3 permutations. The original row index of the k-shingles is obtained according to the order of their occurrence in the three documents. 1 2 3 4 6 8 2 2 1 0 7 2 7 3 3 5 8 4 3 4 5 1 0 6 8 5 7 6 1 0 5 Question 5. Mining Data Streams (6 marks) (a) (3 marks) Bloom Filter Consider a Bloom filter of size m = 7 (i.e., 7 bits) and 2 hash functions that both take a string (lowercase) as input: h1(str) = ∑ ( ′′) mod 7 h2(str) = str.length mod 7 Here, c – ‘a’ is used to compute the position of the letter c in the 26 alphabetical letters, e.g., h1(“bd”) = (1 + 3) mod 7 = 4. (i) (1.5 marks) Given a set of string S = {“hi”, “big”, “data”}, show the update of the Bloom filter (ii) (0.5 marks) Given a string “spark”, use the Bloom filter to check whether it is contained in S. (iii) (1 marks) Given S in (i) and the Bloom filter with 7 bits, what is the percentage of the false positive probability (a correct expression is sufficient: you need not give the actual number) (b) (3 marks) CM-Sketch Assume that we have 5 buckets and three hash functions: h0(str) = str.length * 2 mod 5 h1(str) = str.length mod 5 h2(str) = (str[0]-‘a’) mod 5 Given you a stream of terms: “big”, “data”, “data”, “set”, “data”, “analytics”, show the steps of building the CM-Sketch. Then, use the built CM-sketch to get the count for word “data”. 6 Question 6. Recommender Systems (5 marks) Consider four users u1, u2, u3 and u4, and three movies m1, m2, m3, m4, and m5. The ratings of movies from the users are as below: m1 m2 m3 m4 m5 u1 3 5 2 2 u2 2 4 4 1 u3 4 5 2 u4 3 2 5 4 (i) (2 marks) Estimate the missing ratings using the user-user collaborative filtering method (round to 2 decimal places). (ii) (2 marks) Estimate the missing ratings using the item-item collaborative filtering method (round to 2 decimal places). (iii) (1 mark) If the rating from u2 to m5 is 3, and the rating from u3 to m2 is 1, which method has a better performance using the Root-mean-square error (RMSE) measure Justify your answer by showing the calculation. Note: (a) Adopt the cosine similarity measure to compute the similarities (b) Use the weighted average to perform the rating prediction (c) When computing similarities, you need to ignore the ratings on the item/from the user to be predicted in the other columns/rows (for example, when predicting the rating on m5 from u2 using the user-user CF, you need to ignore the ratings on m5 from the other users for computing the user similarities). End Of Paper