程序案例-DATA3404

Paper code: EXAMPLE Semester 1, 2019 Page 1 of 8 Room Number Seat Number Student Number | | | | | | | | | | DATA3404 Data Science Platforms Example Examination Semester 1, 2019 EXAMPLE SOLUTION This examination paper consists of 8 pages. INSTRUCTIONS TO CANDIDATES 1. Please ensure you first complete the unit’s USS survey at http://sydney.edu.au/itl/ surveys/complete/ 2. The exam will be restricted-open book: You are allowed to bring a single A4 sheet (double- sided) of your own notes to the exam, but no textbook. 3. A simple calculator (programmable versions, smartphones and PDAs not allowed) may be taken into the exam room. 4. Take care to write legibly. Use blue or black ink. Please do not use a pencil or red ink. 5. This examination paper consists of a total of 6 questions with multiple parts. 6. Answer all questions within the space provided on this question paper. 7. The points for each question are shown at the start of the question. The total points for the final exam paper is 80. Office use only Question DATA3404 1 / 10 2 / 5 3 / 15 4 / 15 5 / 23 6 / 12 Total / 80 SAMPLE PAPER Paper code: EXAMPLE Semester 1, 2019 Page 2 of 8 Question 1: Storage Cost Estimations [10 points] This question has three (3) parts, ((a)—(c)). Given the following relational schema of a ’City’ table: CREATE TABLE city ( id INTEGER PRIMARY KEY, name VARCHAR(80) NOT NULL, countrycode CHAR(3) NOT NULL, population INTEGER NULL); (a) [4 points] Assume a database iStore that uses the following internal storage structure: each record has a fixed-size header of 10 bytes followed by an attribute directory with an offset pointer of 2 bytes for each attribute; at the end, all non-null attributes are stored in sequence; an INTEGER attribute requires 4 bytes, a CHAR(n) attribute n x 1 bytes, and VARCHAR strings require between 0 (for empty strings) and their maximum length bytes. iStore represents null attributes with an offset of 0; so we still need the offset pointer, but no further storage is required for attributes with a NULL value. What is the minimum and maximum size of a city record in iStore Solution: The table has four attributes, 3 fixed size and 1 variable size (VARCHAR). The last attribute can be null, hence we not always do need to store it. The minimum size is for an empty string and the last attribute being NULL: sizemin = header+ 4× offsets+ id+ name+ countrycode = 10+ 4× 2 + 4+ 0+ 3 = 25 The maximum size is for full strings and all attributes not NULL: sizemax = header + 4× offsets+ {attributes} = 10 + 4× 2 + 4 + 80 + 3 + 4 = 109 (b) [3 points] iStore is using a slotted page architecture. Suppose that each page in iStore is 8192 bytes and has a page header of 20 bytes, and that each slot pointer of iStore needs 2 bytes. What are the minimum storage requirements to store 1000 City records in iStore Solution: The usable free space per page in iStore is: 8192 20 = 8172. The smallest City record is 25 bytes long and each slot needs 2 bytes; i.e. b8172/(25 + 2)c = 302 records fit on one page.Hence for 1000 records, we need at least d1000/302e = 4 pages. (c) [3 points] Database access costs are traditionally measured in disk I/O operations. The underlying assumption is that databases are too large to fit into main memory and that they hence have to be stored on spinning hard disk drives (HDDs) as secondary storage which however are much slower to access than main memory. Is this assumption still holding for solid state drives (SSDs) Which access characteristics change when using a SSD as compared to a HDD This page is intentionally left blank and can be used for rough work. Paper code: EXAMPLE Semester 1, 2019 Page 3 of 8 Solution: Yes, because SSDs might be 1 to 2 orders of magnitude faster than HDDs, but there is still an access gap. Random access has the same characteristics than sequential access. OR: Deletes are much more expensive than on HDDs; append only works best, not update-in-place. Question 2: External Sorting Costs [5 points] This question has three (3) parts, ((a)—(c)). See Week 5 tutorial solutions; You have to sort a large table with 250,000 database pages and only 250 sort buffers available using a k-way external sort algorithm. (a) [1 point] How many initial runs will be generated in the initital sort phase Solution: N/B = 250000/250 = 1000 (b) [2 points] How many merge passes will the external merge sort algorithm need Solution: Can be worked out iteratively or by using equation: dlogB 1(N/B)e = dlog249 1000e = d1.xe = 2 (c) [2 points] How much total I/O will the sort perform including the cost of writing the output Solution: 2N(dlogB 1(N/B)e+ 1) = 2 250, 000 (2 + 1) = 1500000 End of Question 2 Paper code: EXAMPLE Semester 1, 2019 Page 4 of 8 Question 3: Index Choices [15 points] This question has four (4) parts, ((a)—(d)). See Week 5 tutorial solutions (Homework Ex 6) Consider the following relations: Emp(eid:integer, ename: varchar, sal: integer, age: integer, did: integer) Dept(did: integer, budget: integer, floor: integer, mgr eid: integer) Salaries start from $50,100 and go up in $100 increments up to $150,000; ages vary from 20 to 80; each department has about five employees on average; there are 10 floors; and budgets vary from $10,000 to $1 million. (a) [2 points] Assume the Emp relation is stored as an unordered heap file for which the only index is an unclustered B+-tree index on field sal. If you want to retrieve all records with sal > 100, 000, is using the index always the best alternative Briefly explain. Solution: No. As the index is unclustered, each qualifying data entry could point to a distinct data page, leading to as many data page I/Os as the number of entries that match the range query. In this situation, using index is actually worse than file scan. (b) [4 points] Assume the relation Emp has 100,000 tuples in 2,000 pages and that values of sal are uniformly distributed. The B+-tree has height 2 (excluding the root), and each leaf-level index page holds an average of 100 (key, pointer) pairs. Compute the smallest value of v such that the most efficient way of executing a range query sal > v is to use the B+-tree index. Solution: 3+ x 100 < 2000; hence: x = 19, which means v = 150000 x 100 = 148100 (c) [5 points] Two common queries on the above relations are: SELECT name, age, sal FROM Emp; SELECT did FROM Dept WHERE floor=10 AND budget<15000; Your DBMS supports hash and B+-tree secondary indexes. Indexes can be clustered, but index-only scans are are yet supported. What indexes would you recommend for the above queries Justify your choices. Solution: First query: No index since all records visited Second (range) query: B+ tree index on the (floor, budget) fields of Dept to co-locate, clustered to co-locate results. (d) [4 points] You DBMS is upgraded to a new version that supports index-only queries. How would your responses to the previous question change Solution: First query: could use a covering index on Emp(name, age, sal).But not saving much over a file scan. Second query: would need to extend index to Dept(floor, budget,did)) Could also discuss further, e.g. whether a hash index would be more efficient in space than a B+-tree or whether clustering is appropriate for first query. This page is intentionally left blank and can be used for rough work. Paper code: EXAMPLE Semester 1, 2019 Page 5 of 8 Question 4: Query Optimization [15 points] This question has five (5) parts, ((a)—(e)). For the query: SELECT e.name, b.building FROM Employee e, Dept d, Building b WHERE e.dept_id = d.id AND d.bldg_id = b.id AND e.sal > 10000 (a) [3 points] Write a corresponding relational algebra expression E1 — made up of only cross- products (×), selections (σ) and projections (pi) — that contains the fewest operations. Solution: We are using the rename (ρnew attribute name/old attribute name ) operator to uniquely rename the different id attributes so that the join conditions become unambiguous. piname,building(σdept id=did∧bldg id=bid∧sal>10K(Employee×ρdid/id(Dept)×ρbid/id(Building))) (b) [3 points] Transform E1 into an equivalent relational expression E2 that uses joins instead of cross-products. Solution: piname,building(σsal>10K(Employee 1dept id=did (ρdid/id(Dept) 1bldg id=bid ρbid/id(Building)))) Equivalent to previous (c) [2 points] Transform E2 into an equivalent expression E3 so that selections are done as early as possible. Solution: piname,building((σsal>10K(Employee) 1dept id=did ρdid/id(Dept)) 1bldg id=bid ρbid/id(Building)) (d) [4 points] Show one possible left-deep query execution tree for the query that is likely to be considered by a query optimizer. Solution: Equivalent query Left-deep tree πname,building σ sal>10K Employee dept_id=did Dept Building bldg_id=bid Question 4 continues on next page Paper code: EXAMPLE Semester 1, 2019 Page 6 of 8 (e) [3 points] Consider an execution plan that uses your given execution tree. What indexes could be used in such a plan, if any Which statistics would the optimizer need to estimate the costs of your execution plan Solution: Index on Employee(salary) for selection if used in an INDEX SCAN Index on Dept(did) and/or Building(bidfor joins if used in a NESTED INDEX JOIN. Statistics neede are selectivity of salary and size of all 3 relations. Question 5: Distributed Data Management and NoSQL [23 points] This question has five (5) parts, ((a)—(e)). (a) [10 points] Give examples of two different NoSQL systems, explaining how they are different from each other and describing a specific example of a system for which one is more suitable. Solution: Suggested marking scheme: Valid NoSQL systems given 2× 1pt Description of category of each system 2× 1pt Explanation of key differences 2 pt use case given 2 pt Reasons for different suitability of each system 2× 1pt (b) [4 points] Explain how the CAP theorem applies to the two systems you describe in Part (a). Solution: answer depends on the chosen systems and needs to identify how the chosen systems balance consistency, availability and network partitioning, typically in favour of two of those versus some compromise at the third. (c) [4 points] Describe two differences between a (distributed) database system and Hadoop MapReduce. Solution: Three possible differences (two are enough to give): a) Distributed databases push query execution of sub-queries to remote databases as much as possible, where different systems can execute different queries; with MapRe- duce, typically all nodes execute the same map or reduce function on a part of the data. b) distributed databases support local indexing per node; MapReduce: always scanning input files c) Distributed databases can support pipelining between nodes; MapReduce: always materialization of intermediate results (d) [2 points] What is the relationship between Apache Flink data transformations and query execution operators in a DBMS Question 5 continues on next page Paper code: EXAMPLE Semester 1, 2019 Page 7 of 8 Solution: A Flink data transformation operator typically corresponds to one logical query operator, without specifying which exact algorithm is used (eg. join). A DBMS query execution operator is typically already prescribing the algorithm (eg. merge-join(X,Y)). (e) [3 points] You are considering the possibility of horizontally partitioning the relation Employee(tfn, name, salary, title, location) by location and storing each partition in the database at that location, with the possibility of replicating some partitions at different sites. Discuss the types of queries and updates (and their frequencies) that might influence your decision. Solution: In the presence of range queries, eg. on the salary, one should consider range partitioning on the attribute(s) used in the range queries. If the workload solely uses point-queries, eg. on tfn or location, hash or range partitioning is possible. If there are a lot of updates too, the replication factor should be low. If those updates can be restricted to some partitions, we could consider range partitioning with selected replication on those partitions not updated very often. Example: Mainly point queries with updates occurring mainly in some locations. In this case, range partitioning by location with partial replication of seldomly updated location partitions would be suitable. Question 6: Review Questions [12 points] This question has six (6) parts, ((a)—(f)). (a) [2 points] What is cost-based query optimization A query optimization approach that chooses between query execution alternatives based on the estimate lowest execution costs. 2 A query optimization approach that is based on an economical model for query execution. 2 An approach to choose the cheapest query optimizer for a database implementation. 2 A query optimizer that computes the most effective query plan by applying certain heuristics. (b) [2 points] Query optimizers include some basic optimization heuristics to improve the run- time of a given query. Which of the following approaches is NOT a valid optimization heuristic 2 Perform selections early (reduces the number of tuples) 2 Perform projections early (reduces the number of attributes) Always sort after table scans to produce sorted input later query operators 2 Perform most restrictive selection and join operations before similar operations. (c) [2 points] Which of the following statements describes a correct characteristic of ”Synchronous Replication” It is good for consistency. 2 It is good for performance. 2 It sends updates to replicas, using separate transactions. 2 It allows a modifying transaction to commit before all copies have been changed. Question 6 continues on next page Paper code: EXAMPLE Semester 1, 2019 Page 8 of 8 (d) [2 points] What is the correct definition of distributed data independence 2 Users should be able to know where data is stored. Users should not have to know where data is located. 2 Applications should be able to query multiple nodes as if they are local nodes. 2 Users should be able to write transactions accessing multiple sites just like local transactions. (e) [2 points] Which approach to data partitioning and data replication does the Hadoop File System (HDFS) use Select any options which apply. Eager, Primary-Copy Replication. 2 Lazy, Update-Anywhere Replication. 2 Hash Partitioning. Range Partitioning. (f) [2 points] What are two differences between data stream processing systems (DSPS) and database management systems (DBMS) 2 DBMS: window queries; DSPS: aggregation queries DBMS: exact answers, bound data; DSPS: approximate answers over unbound data DBMS: transient querying; DSPS: continuous querying 2 DBMS: one-pass query evaluation; DSPS: arbitrary query evaluation End of Question 6