CS21 Lab 8: Searching through a Dataset!

Due Saturday, April 1 by 11.59PM


Please read through the entire lab before starting!

This is a one week lab.

  • For the previous lab, you used a full week to practice the TDD component and a second week to implement your solution. Since the design process should be similar to the previous lab, you will be doing the design and implementation for this lab in a single week.

  • If you found the open-ended nature of the last lab challenging, you will want to start this lab early.

  • Additionally, all the "standard" requirements from previous labs such as programming tips, function comments, etc. still apply; feel free to refer to previous lab pages as a reminder.

Goals

  • Write a program that uses linear search

  • Write a program that uses binary search

  • Continue practicing Top-Down Design (TDD)

  • Connect CS topics with real-world data

1. Searching a real-world data set

For this this assignment, we will be working with a real-world dataset from CORGIS (which is an acronym for: The Collection of Really Great, Interesting, Situated Datasets).

  • We will be using a slightly modified version of this county demographics dataset that contains information about race, education, housing and income broken down by county. This data can be used to identify communities with significant economic and education inequalities but it also contains additional relevant information such as population, housing and income levels.

  • In this lab we will focus on searching for records by county name, state, and population. A more interesting real-world application would likely allow the user to cross-reference the data and search for and/or display particular patterns or relationships in the data.

You will implement a program called search-demographics.py, which will allow the user to search through this dataset and find particular information.

1.1. Trigger Warning and notes about real-world data

While this data is provided as demographics (i.e. percentage counts) and does not provide information on specific individuals, it is actual data collected about the real-world relating to income inequality and access to education, housing, and other measures of economic success.

If you are not comfortable working with this data, please reach out to one of the instructors about the possibility of completing an alternate version of this lab.

Biases in real-world data

  • It is worth being aware of what this data represents, and the power that programs and coding can give you.

  • Technology is not value-neutral; even if the search algorithms and code you implement are mathematically indifferent to the meaning of the data, the data you are searching are produced by a world which is not. Thus, the results of the search will not be bias-free even if the search algorithm is "fair".

  • In addition to the unfairness explicitly being represented and highlighted in this data set (i.e. the unequal distribution of higher education, income, home ownership across race and geographic boundaries), keep in mind that data collected about the world is not a perfect representation of that world.

  • The process of collecting and curating data is done by humans (with limited resources, and limited time), and as a result the data we see typically have embedded biases that are difficult or impossible to identify simply from looking at the numbers.

In this lab, we’ll explore the use of algorithms that might help us use data to shed light on areas of inequality and opportunities that the power of data analysis gives us to summarize the data, and present it to improve living conditions. It is important however to keep in mind not only the strengths but also the limitations of your data and algorithms.

1.2. Program Requirements

  • Reads in stored data from a file, storing a list of records. The data is stored in the file:

    /usr/local/doc/county_data.csv

    DO NOT copy this file to your labs/08 directory, just use "/usr/local/doc/county_data.csv" as the file name in your python program.

    As you read in the data, each line of the file corresponds to one record, which your program should represent using a CountyRecord object and store in a list.

We’ve also created a test data-set to help in the early-stages of writing your code which just has the first 20 lines of the file: /usr/local/doc/county_data_small.csv
  • Repeatedly presents a menu to the user with three choices: to search by county name, to search by state, or to quit. For example:

    Please select one of the following choices:
    (1) Find by County
    (2) Find by State
    (3) Find by Population
    (0) Quit
    
    Choice?

    Depending on the user’s selection, prompt them for additional information and then do one of:

    1. Search the dataset by county name, using binary search.

    2. Search the dataset by state, using linear search.

    3. Search the dataset by population threshold, using linear search.

    4. Exit the program.

  • When searching for name or state, compares data in lowercase format (so that user input is case insensitive) but displays results with their original case.

  • Searches by prefix, meaning that if the name in the database is longer than the string typed, it counts as a match as long as they’re the same up to the point the typed string ends.

    • For example, if the user enters Ada, it should be a match for the database entry Ada County.

    • If the user just types A, it should match all the county names that start with the letter A.

      This means if you search for an empty string, it should match the entire database)_ TODO: add examples and links to examples
  • When searching for population, compares data by numeric threshold, such that any records with a population larger than the specified value are returned.

  • Ensures that the user provides valid input (i.e. the program should not crash if the user enters something unexpected).

  • Uses formatted printing to show search results in a table format that fits in an 80-character wide terminal window.

    • Note that this will require using methods to retrieve different pieces of information about each record; see the CountyRecord object for a list of methods you can use.

  • Uses linear search to find matches by state and by population.

  • Uses binary search to find matches by county name.

1.4. Tips

1.4.1. Continue to use TDD

We strongly recommend that you create a top-down design with a fully fleshed out main and stubs for all other functions. However, you may not work with a partner on this lab. Also, you don’t need to get approval from an instructor for your design, though we highly recommend asking for our review via a private message on EdStem or in lab, office hours, or ninja sessions.

You should only begin implementing the other functions after you have a clear plan in place for the structure of main.

1.4.2. Create modular, reusable functions

Avoid duplicating similar code in multiple places throughout the program. Think about how you could re-use the same function.

  • For example, we are doing several types of searches (by county name, by state, and by population), but the results of those searches are displayed in exactly the same way.

  • Consider having a function that takes in a list of search results and displays them. Then you could call this function for either type of search.

1.4.3. Displaying matches

We want the results of our searches to be nicely displayed in the terminal window. However, some fields in the data set are quite long.

  • We can use slicing to truncate these strings to a fixed maximum length, allowing us to create an easy to read table format. This section of the online textbook explains how to slice a string. For instance, the following example shows how to slice out the first 10 characters from a string:

    >>> alphaString = "abcdefghijklmnopqrstuvwxyz"
    >>> alphaString[0:10]
    'abcdefghij'

    Notice how this nicely works, even if we provide an upper bound that is longer than the string we’re trying to slice (unlike when dealing with lists):

    >>> shortString = "xyz"
    >>> shortString[0:10]
    'xyz'
  • Recall also that you can specify the width of a field when printing, e.g. print("%10s" % ("hi")) will print "hi" with 8 spaces in front of it.

  • You can also use the - flag to get left-justified printing (the default is right-justified), e.g. print("%-10s %10s" % ("hi", "there")) will print "hi" followed by 8 spaces, and then "there" with 5 spaces in front of it:

    >>> print("|%-10s|=|%10s|" % ("hi", "there"))
    |hi        |=|     there|

    This type of formatting works with numbers the same way it does with strings, but remember that %s → strings, %d → integers, and %f → floats. Here’s an example using numeric types:

    >>> print("|%-10d|=|%10.3f|" % (837, 45.345986))
    |837        |=|     45.346|
  • By combining these methods, you can make your output look like a nice table where everything is aligned (as demonstrated in the example output). For instance, to get the header shown in the example output, try:

    print("================================================================================")
    print("| %15s | %6s |%8s| %8s| %8s | %8s | %8s|" % ( "County", "State", \
    "Ed-HS", "H-Own%", "# House", "Pop-2020", "Pop. PSQM" ) )
    print("--------------------------------------------------------------------------------")

    Note that we haven’t shown all the columns that exist in the table. Several of the column labels shown here are abbreviations so it fits in 80 characters:

  • Ed-HS represents percentage of people with a high-school education,

  • H-Own%` represents percentage of home ownership,

  • # House represents the number of households,

  • Pop-2020 is the total population of the county in 2020 and,

  • Pop. PSQM is the population per square mile

See here for further details on how to get these numbers and what they represent.

To print the values in each column as shown in the examples you will have to print using the correct format specifier. I.e., if you are printing the percentage of people who have a high school education (the third column shown above), you might want to use something like %6.1f to represent the percentages as floating point values. This number (%6.1f) is just an example). You will need to figure out what the appropriate spacing is to align these columns.

The version of linear search that we discussed in class stops as soon as it finds the item it is looking for, and returns the index of that item or -1 if not found.

  • For this problem, your linear search will need to accumulate a list of matching results and return that.

  • In addition, the basic version of linear search assumes that you are searching a flat list. We will be searching a list of objects, so will need to adapt linear search to use method syntax to get at the data within each object.

  • Finally, when searching for state names, make sure your search function uses the .startswith method (see here) rather than == so it doesn’t miss any partial matches.

    • E.g. using New should match both New York and New Mexico). You’ll also want to use the .lower method (see here) to convert both the target and the candidate match to lower case.

Because the data is sorted by county name, we can use binary search to very quickly locate a particular county.

  • However, names can be somewhat long, so we’ll want to be able to search by prefix rather than complete string. Additionally, the data may contain multiple county names with the same name prefix.

    For example, there are six county names that start with the prefix Allen:

    Allegheny County     :this has a different prefix
    Allen County
    Allen County
    Allen County
    Allen County         :suppose binary search finds this one
    Allen Parish
    Allendale County
    Alpena County        :this has a different prefix

1.6.1. Finding the first match

First lets write a binary search function that searches for a partial match to a county name and returns the index of the first match that it finds (or -1 otherwise).

  • Once again, make sure your search function uses the .startswith method (see here) rather than == so it doesn’t miss any partial matches (e.g. using "Ada" to match on "Ada County").

  • You’ll also want to use the .lower method (see here) to convert both the target and the candidate match to lower case.

1.6.2. Finding additional matches

There might be more matches that are just before or just after that initial index you locate. So next you’ll need to search in the vicinity of that location to find all of the other potential matches.

Fortunately, you can use the provided library function for this.

1.7. County Demographics Data Set

For this lab, we’ll be focusing on just a few of these features. Much of the work of parsing the file will be done for you by objects by the provided CountyRecord object type, but it’s worth understanding how the data file is set up anyway.

  • The data is stored in CSV (Comma Separated Value) format and is located on our systems at:

    /usr/local/doc/demographics/county_data.csv
  • The data file has one "record" per line (where each record contains the data for a particular county). Here are the first three lines from the file:

    County,State,Education.Bachelor-or-Higher,Education.HighSchool-or-Higher,Ethnicities.American Indian-and-Alaska-Native,Ethnicities.Asian,Ethnicities.Black,Ethnicities.Hispanic-or-Latino,Ethnicities.NativeHawaiian-and-Pacific-Islander,Ethnicities.White,Housing.Homeownership-Rate,Housing.Households,Income.Median-Houseold-Income,Population.2020-Population,Population.2010-Population,Population.Population-per-Square-Mile
    Abbeville County,SC,15.6,81.7,0.3,0.4,27.6,1.6,0,70.2,75,9660,38741,24295,25417,51.8
    Acadia Parish,LA,13.3,79,0.4,0.3,18,2.8,0,79.7,71.1,22236,43396,57576,61773,94.3

    And the last three lines of the file

    Zapata County,TX,11.6,61.9,0.5,0.2,0.6,94.7,0,98.4,74.8,4503,33952,13889,14018,14
    Zavala County,TX,10.9,66.9,1.1,0.3,1.3,94,0.2,96.5,72,3571,34459,9670,11677,9
    Ziebach County,SD,16.4,84.1,71.1,0.3,0.5,4,0,24.5,57.2,754,37400,2413,2801,1.4

Notice that the file is sorted alphabetically by county name.

Each line contains 16 fields separated by commas:

Table 1. Features of County Demographics Data Set

Feature index

Type

Description

0

string

County Name

1

string

State Name

3, 4

float

Highest Educational Qualification (% of Population)

5 - 10

float

Ethinicity (% of Population)

11

float

Home Ownership Rate (% of Population)

12

int

Number of Households

13

int

Median Household Income

14, 15

int

Total Population (2020 and 2010)

16

float

Population per square mile

A full description of the file format can be found here: County Demographics data et description link

2. Provided Library

The lab comes with a county_demographics library that gives you several helpful tools intended to make this lab easier:

2.1. CountyRecord Class

The library provides a CountyRecord class, which encapsulates all the features needed to store one record.

  • You can create an object of this class by calling the CountyRecord() constructor, which takes as its input a string containing a record (i.e. one line of the file).

  • In order to create the objects, you’ll need to read in the file, split it into lines, and then create an object from each line. For example, to make a record from a line of the file, you could do something like this:

    myFile = open("/usr/local/doc/county_data_small.csv", "r")
    header_line = myFile.readline() # skip the header line
    first_line = myFile.readline() # read the first line in the list
    record = CountyRecord(first_line)

    Note: You might want to use a loop after reading the header_line, to read the rest of the rows of the file.

  • After this, the variable record will refer to the newly created object. For your program, you’ll need to create an object for each line of the file, then assemble them into a list of records which you can pass around.

  • Once you have a CountyRecord object, you can use various getter methods to retrieve information about it:

Table 2. Features of County Demographics Data with Getter Methods

Getter Method

Parameters

Return Type

Description

get_county()

none

string

County name (i.e. the county the rest of the data pertains to)

get_state()

none

string

State the county is located in

get_population(year)

int: 2010, 2020

int

Total population (of the county)

get_population_square_mile()

none

int

Total population (of the county)

get_num_households()

none

int

Total housing units

get_home_ownership()

none

float

Percentage of home ownership

get_education(degree)

string: "Bachelors", "HighSchool"

float

Education population percentages

get_ethnicity(ethnicity)

string: values listed below

float

Ethnicity population percentages

List of ethnicity strings that can be passed as input to get_ethnicity: "Am-Indian", "Asian", "Black", "Hispanic", "Hawaiian", "White".

You should test your understanding of each of these getter methods on a single record. E.g., record.get_county(), record.get_state() and so on.

2.2. Finding nearby matches

We have conveniently provided you with a function to help with finding nearby matches based on search index (often returned by your binary search algorithm). This is called find_neighbor_matches() and you can import it from the county_demographics library.

This function takes four parameters and returns a list of matching items that it found looking either forward or backwards. Those parameters are:

  • A list of record objects to search over

  • An integer where the search should start from (this is the index that your "binary search" function will return)

  • The partial match (string) that the function should be searching for (e.g. "Ada" in the above example)

  • A search direction. This is an integer which should have the value 1 (to search forward) or -1 (to search backwards)

A call to the find_neighbor_matches() function might look something like this:

backwardMatches = find_neighbor_matches(currData, currIndex, name, -1)
  • You will need to import the function at the top of the program (from county_demographics import *) and call it twice for each binary search (once to get the forward matches and again to get the backward matches).

  • You’ll then need to return a single list containing the match found as well as the forward and backward matches in the correct order.

3. Extra Challenges

Here are some optional extensions! These are NOT required, but they might be fun and rewarding, and allow you to get closer to how we might like to use this data in the real world.

If you implement extensions, please do so in a copy of your program called search-demographics-ext.py

3.1. Optional: Find counties with …​.

Rather than importing the provided find_neighbor_matches function, try writing your own. This should work the same way as the provided version: it will look linearly forwards or linearly backwards based on an initial seed index that it takes as a parameter. The function will also need to take the search string as an input parameter (as well as potentially other information depending on your design). This function should return a list of all the partial matches based on that initial seed index.

The program can then add that list together with the entry at the initial seed to compute the total list of partial matches which are then displayed to the user.

Answer the Questionnaire

Each lab will have a short questionnaire at the end. Please edit the Questions-08.txt file in your cs21/labs/08 directory and answer the questions in that file.

Once you’re done with that, you should run handin21 again.

Submitting lab assignments

Remember to run handin21 to turn in your lab files! You may run handin21 as many times as you want. Each time it will turn in any new work. We recommend running handin21 after you complete each program or after you complete significant work on any one program.

Logging out

When you’re done working in the lab, you should log out of the computer you’re using.

First quit any applications you are running, like the browser and the terminal. Then click on the logout icon (logout icon or other logout icon) and choose "log out".

If you plan to leave the lab for just a few minutes, you do not need to log out. It is, however, a good idea to lock your machine while you are gone. You can lock your screen by clicking on the lock xlock icon. PLEASE do not leave a session locked for a long period of time. Power may go out, someone might reboot the machine, etc. You don’t want to lose any work!