CS44: Database Management Systems

Homework 3

Due by 11:59 p.m., Tuesday, December 11, 2018.

Addendum to the partnered policy below: while I would prefer you to work with a partner, you are permitted to work alone on this lab if it is easier for your schedule (or if you prefer to use late days that your partner does not have). You must select alone on Teammaker to create your repo.

This is a partnered lab. You are to complete this lab with one other person, who must attend the same lab as you. You may discuss the lab concepts with other classmates. Please use Teammaker to set your lab partner preferences. You can choose “Random” if you do not have a partner. Remember the Academic Integrity Policy: do not show your code/solution to anyone outside of the course staff and your lab partner. Do not look at any other group’s code/solution for this lab (current or past). If you need help, please post on Piazza.

Overview

This assignment is a problem set that will engage the last few modules of the semester. The goals of his lab are to:

While you will work with a partner, you are not allowed to divide and conquer this assignment with your lab partner. Each individual must work on each of the problems below - this is in your best interest as these problem sets closely mimic questions you will see on exams.

Getting Started

Both you and your partner will share a repo named Homework3-userID1-userID2. Note that the lab will not release until you have both marked your partner preferences on Teammaker. You should find your repo on the GitHub server for our class: CPSC44-F18

Clone your Homework 1 git repo containing starting point files into your labs directory that you created last week:

cd ~/cs44/labs
git clone [the ssh url to your your repo]
cd Homework3-userID1-userID2

If you need help, see the instructions on Using Git (follow the instructions for repos on Swarthmore’s GitHub Enterprise server).

If all was successful, you should see the following files (highlighted files require modification):

In addition, you should add the following file:

You should submit your solution as a PDF and by using LaTeX. If you are not familiar with LaTeX, here is a primer. You may ask your classmates any questions about LaTeX without worrying about violating the academic integrity policy. To create a pdf, you can use pdflatex and open the pdf with your preferred viewer (e.g., acroread):

pdflatex hw3.tex
acroread hw3.pdf

Problem Set

  1. Duplicate elimination

    Consider the following query:

    SELECT DISTINCT E.title, E.ename FROM Executives E
    

    You have the following data stored in the database about the table :

    • The schema:

    • All fields are strings and use the exact same amount of space

    • The relation contains 10,000 pages

    • There are 10 buffer pages

    Using the optimized version of sort-based duplicate elimination (the initial pass creates sorted runs of tuples containing only and ; subsequent passes simultaneously merge runs while eliminating duplicates), answer the questions below:

    1. How many sorted runs are produced in the initial pass? How big is each run? What is the total I/O cost of this phase?

    2. How many merging passes are required to fully sort/eliminate duplicates? What is the I/O cost of each of these passes? You may assume the amount of duplicates is negligible.

    3. How many I/Os are needed if we instead used a clustered index on ? Assume that we are using a secondary B+-tree index, with 5000 leaf pages.

    4. Would an unclustered index on the composite key be superior to sorting for eliminating duplicates? Explain your answer.

  2. Join Algorithms

    Consider the join with the following information:

    • Relation contains 10,000 tuples with 10 tuples per page

    • Relation contains 2,000 tuples with 10 tuples per page

    • Attribute is the primary key of

    • Both tables are organized as heap files, with no indices available

    • Your buffer is 52 pages

    Answer the questions below, defining the precise page I/O cost of the algorithm and ignoring the size of the output in your analysis. Note: for each algorithms, you should choose your which is your inner and outer relation so that it minimizes costs:

    1. What is the cost of a block-nested loop join?

    2. What is the cost of a sort-merge join (use the optimized version from lecture)?

    3. What is the cost of a hash join?

    4. For each of the previous three algorithms (block-nested loop, sort-merge, hash), calculate the smallest value of buffer pages for which your cost does not change. That is, how small can we make the buffer and still have the same I/O cost? Be precise (i.e., don’t simply use the rule of thumb equations from lecture). Your answer may vary for each algorithm.

  3. Exercise 16.3, Ramakrishnan and Gehrke.

  4. Read the first few pages of the article BASE: An ACID Alternative (through the first four pages). And briefly answer the following prompts to demonstrate understanding (just a sentence or two):

    1. What is the main implication of the CAP theorem for distributed applications (e.g., web apps)?

    2. In comparing ACID to BASE, which properties of the CAP theorem are emphasized? That is, what trade-off are we making in choosing ACID vs BASE?

    3. How do we use the relative importance of different consistency requirements to create functional partitions?

Submitting your lab

Before the due date, push your solution to github from one of your local repos to the GitHub remote repo. Only one of you or your partner needs to do this.

From your local repo (in your ~cs44/labs/Homework3-userID1-userID2 subdirectory)

git add hw3.tex hw3.pdf README.md
git commit -m "our solution"
git push

If that doesn’t work, take a look at the “Troubleshooting” section of the Using Git page. Also, be sure to complete the README.md file, add and push it.