CS44: Database Management Systems

Homework 2

Due by 11:59 p.m., Tuesday, November 27, 2018.

This is a partnered lab. You are to complete this lab with one other person, who must attend the same lab as you. You may discuss the lab concepts with other classmates. Please use Teammaker to set your lab partner preferences. You can choose “Random” if you do not have a partner. Remember the Academic Integrity Policy: do not show your code/solution to anyone outside of the course staff and your lab partner. Do not look at any other group’s code/solution for this lab (current or past). If you need help, please post on Piazza.

Overview

This assignment is a problem set that will engage the modules since the midterm exam. The goals of his lab are to:

While you will work with a partner, you are not allowed to divide and conquer this assignment with your lab partner. Each individual must work on each of the problems below - this is in your best interest as these problem sets closely mimic questions you will see on exams.

Getting Started

Both you and your partner will share a repo named Homework2-userID1-userID2. Note that the lab will not release until you have both marked your partner preferences on Teammaker. You should find your repo on the GitHub server for our class: CPSC44-F18

Clone your Homework 1 git repo containing starting point files into your labs directory that you created last week:

cd ~/cs44/labs
git clone [the ssh url to your your repo]
cd Homework2-userID1-userID2

If you need help, see the instructions on Using Git (follow the instructions for repos on Swarthmore’s GitHub Enterprise server).

If all was successful, you should see the following files (highlighted files require modification):

In addition, you should add the following file:

You should submit your solution as a PDF and by using LaTeX. If you are not familiar with LaTeX, here is a primer. You may ask your classmates any questions about LaTeX without worrying about violating the academic integrity policy. To create a pdf, you can use pdflatex and open the pdf with your preferred viewer (e.g., acroread):

$$ pdflatex hw2.tex
$$ acroread hw2.pdf

Hash Indexing

  1. Exercise 11.1, subproblems 4-7 only, Ramakrishnan and Gehrke.

  2. After an insert that causes the directory to double, how many buckets have exactly one directory entry pointing to them (i.e., are not shared)? Briefly explain.

  3. Exercise 11.7, Ramakrishnan and Gehrke.

Sorting

  1. Assume you have a file with 2,000,000 pages and 17 available buffer pages. Answer the following questions. Your answers should be exact, not rough estimates. Read Section 13.3 for the exact formulas for the number of passes and the number of sets in a run.

    A) How many runs does Pass 0 produce?

    B) How many passes will it take to sort the entire file completely?

    C) What is the total I/O cost of sorting the entire file?

    D) How many buffer pages are needed to sort the entire file with only one merge pass (i.e., two passes in total)?

  2. In class, we discussed an extension to external merge sort called double buffering. Consider the possibility of extending this idea to triple buffering (i.e., 3 buffer pages per input as opposed to 2). Describe one potential benefit to this approach as well as one key disadvantage.

Relational Algebra

  1. Assume you have two relations and , where contains tuples and contains tuples, and (i.e., has more rows). For each expression below, give:

    i) the minimum number of tuples possible in the resulting relation

    ii) the maximum number of tuples possible in the resulting relation

    iii) the restrictions on the schemas of and , if any. That is, will the expression only work on certain schemas for and ?

    A.

    B.

    C.

    D.

    E.

    F.

  2. You are given the following schema:

    Underlined fields form the primary key for the relation. Write each of the following queries as a relational algebra expression:

    A. Find the names of suppliers who supply some red part.

    B. Find the sids of suppliers who supply some red part or are at 500 College Avenue.

    C. Find the sids of suppliers who supply some red part and some green part.

    D. Find the sids of suppliers who supply every part.

    E. Find the sids of suppliers who supply every red part.

    F. Find the pids of parts supplied by at least two different suppliers.

    Note: one benefit to the renaming operator is that it makes a copy of a relation. So, you could use it to store an intermediate result if you want to break up an expression into pieces. For example, to simplify we could produce:

  3. Using the same schema as above, state the query that the following expressions compute. If the query is illegal, please state why:

    A.

    B.

    C.

SQL

  1. What is the purpose of the CHECK command?

  2. For this question, you will define a SQL query to translate the relational operator of division using the EXISTS/NOT EXISTS commands in lieu of an SQL division operator. Define a query for the following example where you have you have two relations for enrollments:

    And you want to find the students enrolled in all courses; i.e.,

Submitting your lab

Before the due date, push your solution to github from one of your local repos to the GitHub remote repo. Only one of you or your partner needs to do this.

From your local repo (in your ~cs44/labs/Homework2-userID1-userID2 subdirectory)

git add hw2.tex hw2.pdf README.md
git commit -m "our solution"
git push

If that doesn’t work, take a look at the “Troubleshooting” section of the Using Git page. Also, be sure to complete the README.md file, add and push it.