CS21 Lab 09: Sorting

Due WEDNESDAY, November 29, by 11:59pm


Please read through the entire lab before starting!

In this lab you will continue using the USDA data set from Lab 08. In Lab 08, you implemented three ways to search the data. In this lab, you will implement two ways to filter and sort the data instead of just searching.

All of the standard requirements from previous labs such as using good variable names, including function comments, etc. still apply.

Programming Tips

As you write programs, use good programming practices:

  • Use a comment at the top of the file to describe the purpose of the program (see example).

  • All programs should have a main() function (see example).

  • Use variable names that describe the contents of the variables.

  • Write your programs incrementally and test them as you go. This is really crucial to success: don’t write lots of code and then test it all at once! Write a little code, make sure it works, then add some more and test it again.

  • Don’t assume that if your program passes the sample tests we provide that it is completely correct. Come up with your own test cases and verify that the program is producing the right output on them.

  • Avoid writing any lines of code that exceed 80 columns.

    • Always work in a terminal window that is 80 characters wide (resize it to be this wide)

    • In vscode, at the bottom right in the window, there is an indication of both the line and the column of the cursor.

Function Comments

All functions should have a top-level comment! Please see our function example page if you are confused about writing function comments.

Goals

  • Write a program that filters data by different criteria

  • Write a program that sorts data by different fields

  • Continue practicing Top-Down Design (TDD)

  • Connect CS topics with real-world data

1. Filtering and sorting the USDA data

1.1. Re-using your Lab 08 code

In this lab, you will implement a program called sort_usda.py, which will allow the user to filter the USDA data set and sort by different fields. This will be an extension of work done for the previous lab, so you are encouraged to use portions of your previous program (you don’t need to don’t re-implement functions you’ve already written).

Two major pieces from Lab 08 that you’ll need to reuse are the function that reads the data from the USDA dataset into a list of Usda objects, and the function that displays the table of results. If your versions of these functions aren’t working, we’ve provided working implementations for you. There are other functions you wrote (such as the functions for displaying the menu or functions for getting valid input) that you can likely use in this lab with very little modification and you are encouraged to copy over the versions from your previous lab if they are working well.

1.2. Program Requirements

  • Reads in stored data from a file, storing a list of records. This is the same as your solution to Lab 08 and you may copy over the code to read this data over to your Lab 09 solution.

    As before, the data is stored in the file:

    /data/cs21/usda/usda.csv

    Please don’t copy this file to your labs/09 directory, just use "/data/cs21/usda/usda.csv" as the file name in your python program.

    As you read in the data, each line of the file corresponds to one record, which your program should represent using a Usda object and store in a list.

    Recall that there are two versions of the data file; the complete one is: /data/cs21/usda/usda.csv but there’s also a small copy that’s had most of the lines removed, which may be easier to use for testing your program, particularly in the early stages: /data/cs21/usda/usda_small.csv

    If your version of this function from Lab 08 is broken, you can use a version we have provided. At the top of your file, include the line from usda_lab9 import read_usda_file. The function definition and block comment for read_usda_file are shown below:

    def read_usda_file(filename):
        """
        Read the contents of a CSV file containing information from
        the USDA. Returns a list of Usda objects where each object
        represents one line of the file.
    
        Args:
           filename (str): the name of the file to read
    
        Returns:
           list: a list of Usda objects representing the data in the file
        """
  • Repeatedly presents a menu to the user with six choices shown below:

    Please select one of the following choices:
    1. Filter records by county name
    2. Filter records by median income
    3. Sort by population
    4. Sort by poverty rate
    5. Reset list to all records
    0. Quit
    
    Enter selection:

    Depending on the user’s selection, prompt them for additional information and then do one of:

    1. Filter the data set by county name. Use linear search, not binary search

    2. Filter the data set by median income by providing a minimum and maximum value (using linear search)

    3. Sort the current set of records by population

    4. Sort the current set of records by poverty rate

    5. Reset the current set of records to the original data set

    6. Exit the program

  • When sorting, your program should perform the sort in place, meaning it should mutate the list passed in to the sorting function. It should not create a new list.

  • You must use different sorting algorithms for the two sort operations (e.g. Selection Sort for one and Bubble Sort for the other)

  • Your filtering functions should return a new list of records that match only the filter criteria.

  • Ensure that the user provides valid input for the menu choices and the minimum and maximum income values.

  • After a filter or sort step, print all current records.

  • Use formatted printing to show search results in a table format that fits in an 80-character wide terminal window exactly like you did for lab 08. If your code from Lab 08 to display the records isn’t working, you can use a version we have provided. At the top of your file, include the line from usda_lab9 import display_usda_data. The function definition and block comment for display_usda_data are shown below:

    def display_usda_data(usda_list):
        """
        Given a list of Usda objects, display the data in a nice format
        as shown in the Lab 8 and Lab 9 writeups. If the usda_list is
        empty, prints "No records found".
    
        Args:
           usda_list (list): a list of Usda objects to print
    
        Returns:
           No return value.
        """

1.4. USDA Set

Refer to Lab 08 for an overview of the USDA data set.

Recall that each line of the file contains several features, which can be easily accessed using the provided library.

1.5. Tips

1.5.1. Continue to use TDD and incremental development

We strongly recommend that you create a top-down design with a fully fleshed out main and stubs for all other functions. Likewise, you should continue to implement and test one function at a time, making sure it works correctly before moving on to the next one.

1.5.2. Re-use code from Lab 08

Since this lab works with the same data as we used for Lab 08 and performs a number of similar operations, you should try to re-use functions you wrote for Lab 08 when completing this lab, rather than re-implementing them from scratch. Simply copy/paste these functions into your new python file, and then modify them if necessary.

It’s fine to copy your own code; however, be sure that you’re only re-using code that you personally wrote. Using code from other students, the internet, etc. is still the same violation of academic misconduct that it has been all semester.

1.5.3. Create modular, reusable functions

Avoid duplicating similar code in multiple places throughout the program. Think about how you could re-use the same function. For example, we are doing two types of filters (by state and by population), but the results of those searches are displayed in exactly the same way. Consider having a function that takes in a list of results and displays them. Then you could call this function for either type of filter or sort.

Note also that you may be able to re-use functions you wrote for the perevious lab; you should always re-use code that you wrote in the past to make your current life easier!

Likewise, sorting by state and sorting by population can re-use the same swap function. You may need to copy/modify the inclass examples to sort objects by different fields (population/state name). Make sure when doing a comparison to determine which item comes first, you only compare the appropriate field/subpart of each object.

1.5.4. Applying filters

The initial set of records should be the every record in the data file. However, anytime you apply a filter operation, the filter operation replaces the current set of records with the records matching the filter. Thus if you filter by the county prefix "Del" and then sort by population, you should only display the records from counties that start with "Del", and those records should be sorted by population. The reset option restores the current set of records to be all records. For this reason, you will therefore need separate variables to keep track of the "original" data set and the "filtered" data set.

Note that filtering is algorithmically the same as the linear searches you did in Lab 08. When you filter, you will use linear search to find all matching Usda records and you will return them in a list. Because the data might be sorted by fields other than the county, you can’t use binary search in this lab. That is, when searching for matching counties in order to apply a filter, you do not know how the records are sorted (they could change after a sort operation). For this reason, you should use linear search for all your filters.

A sample non-trivial workflow might be to filter by a county (e.g., "Del"), sort by population, then filter by median income (e.g. 50000-60000). The list of records displayed after this sequence would be counties that start "Del" with a median income between 50000-60000, all sorted by their population.

When getting minimum and maximum values to filter by median income, you just need to check if the value entered by the user is a non-negative integer. Median incomes of 5, or 20000000000 are fine inputs, even if there are no records in the data set that have values this extreme. You do not need to verify that the minimum value is less than the maximum value.

1.5.5. Copying a list

Normally, assigning two variables to refer to the same list is what you want to happen, since copying a list may be computationally expensive. However, in this program you’ll need to modify a list while still maintaining a copy of the original list (so you can implement the "reset" functionality).

In python, the easiest way to do this is to use the .copy() method of the list class. To see how it works, try running code like the following:

$ python3
>>> a = [3, 7, 2, 4]
>>> b = a
>>> c = a.copy()
>>> a[1] = 5
>>> a
[3, 5, 2, 4]
>>> b
[3, 5, 2, 4]
>>> c
[3, 7, 2, 4]

Notice that when we change list a, list b also changes. That’s because on the stack, both a and b are pointing to the same list. Changing a changes b. But c is an exact copy of a from before we modified list a. Therefore, changing a doesn’t impact the value of c. If we wanted to restore a back to its original value, we could say a = c.copy().

1.5.6. Adapting sort

You should use two of the quadratic sorts discussed in class. Just like you did in Lab 08, you can start by using the sorting algorithms we provided you. Then, you will need to modify them slightly so that they sort by population and by poverty rate.

2. Provided Library

Remember that this lab comes with a usda_lab library that gives you several helpful tools intended to make this lab easier. See the appropriate section of Lab 08 for full details.

3. Extra Challenges

Here are some optional extensions! These are not required, but they might be fun and rewarding.

If you implement extensions, please make sure the original menu operations remain the same. That is, you are welcome to add a menu item 6, but don’t change menu item 3 to be menu item 4.

3.1. Optional: Add to the working set

As the lab shows, you can filter on different criteria. Consider adding additional filters of your choosing, or extending your records by adding an e.g., 'extend by county' option that queries all records and expands the current set of records by the given county. For example, a filter by "Adams" followed by an extend by "Brown" would result in a list containing the records from both "Adams" and "Brown" counties.

3.2. Optional: Sort or filter on other fields

The lab asks you to sort and filter on only a few fields. However, you can apply the same process to sort any of the fields accessible through the object. Choose one (or more) additional fields to add to the set of things the program can deal with.

3.3. Optional: Add features based on your own interests!

Do whatever you’d like that we haven’t told you to do! Go crazy!

Answer the Questionnaire

Each lab will have a short questionnaire at the end. Please edit the Questions-09.txt file in your cs21/labs/09 directory and answer the questions in that file.

Once you’re done with that, you should run handin21 again.

Submitting lab assignments

Remember to run handin21 to turn in your lab files! You may run handin21 as many times as you want. Each time it will turn in any new work. We recommend running handin21 after you complete each program or after you complete significant work on any one program.

Logging out

When you’re done working in the lab, you should log out of the computer you’re using.

First quit any applications you are running, like the browser and the terminal. Then click on the logout icon (logout icon or other logout icon) and choose "log out".

If you plan to leave the lab for just a few minutes, you do not need to log out. It is, however, a good idea to lock your machine while you are gone. You can lock your screen by clicking on the lock xlock icon. PLEASE do not leave a session locked for a long period of time. Power may go out, someone might reboot the machine, etc. You don’t want to lose any work!