Lab 8: Searching

Due Saturday, November 15, by 11:59pm


Please read through the entire lab before starting!

This is a one week lab. For the previous lab, you used a full week to practice the TDD component and a second week to implement your solution. Since the design process should be similar to the previous lab, you will be doing the design and implementation for this lab in a single week. If you found the open-ended nature of the last lab challenging, you will want to start this lab early.

Additionally, all the "standard" requirements from previous labs such as programming tips, function comments, etc. still apply; feel free to refer to previous lab pages as a reminder.

Goals

  • Write a program that uses linear search

  • Write a program that uses binary search

  • Continue practicing Top-Down Design (TDD)

  • Connect CS topics with real-world data

1. Searching a real-world data set

For this this assignment, we will be working with a real-world dataset from CORGIS (which is an acronym for: The Collection of Really Great, Interesting, Situated Datasets). We will be using a slightly modified version of this dataset that contains information about access to stores selling food (grocery stores, etc.) broken down by county. This data can be used to identify communities with significant "food deserts," but it also contains additional relevant information such as population, access to vehicles, income levels, etc.

1.1. Trigger Warning and notes about real-world data

While this data is provided as demographics (i.e. population numbers) and does not provide information on specific individuals, it is actual data collected about the real world relating to income inequality and hunger.

If you are not comfortable working with this data, please reach out to one of the instructors about the possibility of completing an alternate version of this lab.

Biases in real-world data

  • It is worth being aware of what your data represents, where it came from, and who collected it.

  • This data demonstrates inequality (i.e., the unequal distribution of access to food). Keep in mind that data collected about the world is not a perfect representation of that world.

  • The process of collecting and curating data is done by humans, and as a result the data we see can have embedded biases that are difficult or impossible to identify simply from looking at the numbers. Even if our algorithms are not inherently biased, the output of our programs might be biased if the input data set is biased.

In this lab, we’ll explore the use of algorithms that might help us use data to identify unfairness in the world. Often, identifying unfairness is a first step towards remedying systemic issues. Even with good intentions, however, it is important to keep in mind not only the strengths but also the limitations of your data and algorithms.

1.2. Using the food_access data

The data we will be wroking with are stored in a file as comma-separated values and is located at: /data/cs21/food_access/food_access.csv

There are actually two versions of the data file; the complete one is: /data/cs21/food_access/food_access.csv but there’s also a copy that’s had most of the lines removed, which may be easier to use for testing your program, particularly in the early stages: /data/cs21/food_access/food_access_small.csv

Each line contains 25 fields separated by commas:

Table 1. Features of Food Access Data Set

Feature index

Type

Description

0

str

County name

1

int

Population of the county

2

str

State name

3

int

Residents in group quarters

4

int

Total housing units

5…​24

int

Access to food markets for different groups

After the first 5 lines, the remainder of the file is made up of five parts, each containing four numbers. Each part describes food access based on a particular demographic group: households lacking vehicles, children, low-income individuals, all individuals, and seniors. Within each part, the four numbers refer to different distances (1/2 mile, 1 mile, 10 miles, 20 miles), and are counts of how many individuals in that demographic live further than the specified distance from the nearest supermarket.

A full description of the file format can be found here: Data set description link

Here are the first three lines and the last three lines from the file. Each column is separated by a comma.

Abbeville County,25417,South Carolina,901,9990,559,912,116,0,4249,5262,601,0,7585,9784,1567,0,18958,23281,2647,0,3068,3836,456,0
Acadia Parish,61773,Louisiana,1050,22841,480,1002,6,0,8615,12008,15,0,10500,16733,28,0,31316,43547,70,0,3466,5190,16,0
Accomack County,33164,Virginia,428,13798,1274,1397,226,0,5731,6482,135,0,11970,13236,298,0,27105,30794,727,0,5089,5812,156,0
...
Zapata County,14018,Texas,30,4297,157,234,40,0,1931,3354,447,1,3569,5989,962,2,5828,9704,1548,3,688,994,200,1
Zavala County,11677,Texas,409,3573,308,442,229,101,1539,2819,1017,310,3389,5531,2311,787,5053,8811,3139,1000,527,986,335,101
Ziebach County,2801,South Dakota,0,836,67,75,45,32,626,739,387,281,1110,1272,701,509,1706,1949,1050,735,148,153,94,58

Note that the lines are sorted alphabetically by county name (the first column).

In this lab we will focus on searching for records by county name, state, and population. A more interesting real-world application would likely allow the user to cross-reference the data and search for and/or display particular patterns or relationships in the data.

You will implement a program called search-food.py, which will allow the user to search through this dataset and find particular information.

2. Implementation

At a high level, your program will do the following (each of these steps is described in more detail below):

  • Read in the data file (/data/cs21/food_access/food_access.csv)

  • Present the user with a menu of options

  • Repeatedly get an option from the user, and search the data based on the selected option

  • Exit the program once the user selects the "Quit" option

2.1. Reading in the data

Your program should read in data from a file, storing the information in the file about food access. The data you should read in is in the file: /data/cs21/food_access/food_access.csv. Recall that there is a smaller version of this file that may be useful for testing, called /data/cs21/food_access/food_access_small.csv.

Note: don’t copy this file into your cs21/labs/08 directory, just use /data/cs21/food_access.csv as the file name in your program.

As you read in the data, each line of the file corresponds to one record, which your program should represent using a FoodAccessRecord object, and store it in a list.

2.2. Present a menu

Your program should repeatedly present a menu to the user with three options for searching the data — by county, by state, or by population — and an option to quit. For example:

Please select one of the following choices:
(1) Find by County
(2) Find by State
(3) Find by Population
(0) Quit

Choice?

2.3. Search the data set

Depending on the user’s selection from the menu, your program should prompt the user for additional information, and do one of:

  1. Search the dataset by population threshold, using linear search.

    • The user will enter a number (the population threshold), and your program should return any records with a population larger than the number entered by the user. See below for more details.

  2. Search the dataset by state, using linear search.

  3. Search the dataset by county name, using binary search.

  4. Exit the program

When searching by county name or by state, your program should compare data in lowercase format (so that user input is case insensitive), but display results with their original case.

County and state searches should also be done by prefix, meaning that as long as the name (either county or state, depending on the search) in the database starts with the string the user entered, it should be considered a match.

Details about how to adapt the search algorithms from class to achieve this are described below, in Adapting the search algorithms.

2.4. Validate inputs

You should ensure that the user provides valid input at all prompts.

  • For validating menu options, the user can only type in options that you provide them.

  • For the population threshold, make sure the user enters a positive int.

2.5. Display the results

Use formatted printing to show the search results in a nice table format that fits an 80-character wide terminal window. This will require using methods from the FoodAccessRecord object to get information about each record. See the FoodAccessRecord object below for a list of methods you can use.

2.6. Advice for the lab

2.6.1. Continue to use TDD

We strongly recommend that you create a top-down design with a fully fleshed out main and stubs for all other functions. However, you may not work with a partner on this lab. You don’t need to get approval from an instructor for your design, though we’d be glad to review it with you in lab, office hours, or ninja sessions. You should only begin implementing the other functions after you have a clear plan in place for the structure of main.

2.6.2. Create modular, reusable functions

Avoid duplicating similar code in multiple places throughout the program. Think about how you could re-use the same function. For example, we are doing several types of searches (by county name, by state, and by population), but the results of those searches are displayed in exactly the same way. Consider having a function that takes in a list of search results and displays them. Then you could call this function for either type of search.

3. FoodAccessRecord Class

The food_access_lab library provides a FoodAccessRecord class, which encapsulates all the features needed to store one "record." In order to use this library, we’ve included the following line at the top of your file:

from food_access_lab import *

You can create an object of this class by calling the FoodAccessRecord constructor, which takes as its input a string containing a record (i.e. one line of the file).

In order to create the objects, you’ll need to read in the file line by line, and then create an object from each line. For example, if one_line is a string containing a single line from the file, then writing record = FoodAccessRecord(one_line) will create an object called record containing the food access data for that line. For your program, you’ll need to create an object like this for each line of the file and assemble them into a list of records which you can pass around.

Once you have a FoodAccessRecord object, you can use various getter methods to retrieve information about it:

Table 2. Features of Food Access Data with Getter Methods

Getter Method

Return Type

Description

get_county()

string

County name (i.e. the county the rest of the data pertains to)

get_state()

string

State the county is located in

get_population()

int

Total population (of the county)

get_housing_total()

int

Total housing units

get_housing_group()

int

Group housing units

get_lfa_people(distance)

int

Low Food Access population: Total number of people living beyond the given destance from nearest supermarket

get_lfa_children(distance)

int

Low Food Access children: Children living beyond the given destance from nearest supermarket

get_lfa_seniors(distance)

int

Low Food Access seniors: Seniors living beyond the given destance from nearest supermarket

get_lfa_low_income(distance)

int

Low Food Access low income population: Low income people living beyond the given destance from nearest supermarket

get_vehicle_access(distance)

int

Low Food Access no-vehicle households: Households with no vehicle (car/truck/etc.) beyond the given destance from nearest supermarket

Note that many of these methods take a distance parameter; this is the radius at which the number is recorded. For instance, .get_lfa_low_income(0.5) would return the number of low-income individuals for who live more than half a mile from the nearest supermarket; .get_lfa_children(10) would return the number of children who live more than 10 miles from the nearest supermarket. The data set contains entries for the distance values 0.5, 1, 10, and 20.

For your program, it’s fine to just use the value 1 all the time unless you’re implementing the related optional extension.

4. Adapting the search algorithms

4.1. Adapting linear search for population (threshold)

For population threshold searches, your program should ask the user to enter the minimum population they want to searh for, and your search should find any records where the total population (using the .get_population() method) is greater than the number entered by the user.

See Example 3 for the results of this search.

4.2. Adapting linear search for state name

The version of linear search that we discussed in class stops as soon as it finds the item it is looking for, and returns the index of that item or -1 if not found. For this problem, your linear search will need to accumulate a list of matching results and return that.

In addition, the basic version of linear search assumes that you are searching a flat list. We will be searching a list of objects, so will need to adapt linear search to use method syntax to get at the data within each object.

When searching for state names, a food access record should be considered a match whenever the state name starts with the search string (e.g. using "New" should match both "New York" and "New Mexico"). You’ll want to use the .startswith method (see here) rather than == so it doesn’t miss any partial matches (such as matching "New York" with the search string "New").

To do case-insensitive search, you’ll also want to use the .lower method (see here) to convert both the target and the candidate match to lower case.

Because the data is sorted by county name, we can use binary search to very quickly locate a particular county. However, names can be somewhat long, so we’ll want to be able to search by prefix rather than complete string. Additionally, the data may contain multiple county names with the same name prefix, and our program should find and display all of them.

For example, there are six county names that start with the prefix "Allen":

...
Allegheny County     <- this has a different prefix
Allen County
Allen County
Allen County
Allen County         <- suppose binary search finds this one
Allen Parish
Allendale County
Alpena County        <- this has a different prefix
...

4.3.1. Finding the first match

First write a binary search function that searches for a partial match to a county name and returns the index of the first match that it finds (or -1 otherwise).

  • Once again, make sure your search function uses the .startswith method (see here) rather than == so it doesn’t miss any partial matches (e.g. using "Ada" to match on "Ada County").

  • You’ll also want to use the .lower method (see here) to convert both the target and the candidate match to lower case.

4.3.2. Finding additional matches

There might be more matches that are just before or just after that initial index you locate. So next you’ll need to search in the vicinity of that location to find all of the other potential matches.

We have provided you with a function to help with finding nearby matches based on an initial seed. This function is called look_forward_backward and it is also imported from the food_access_lab library. This function takes three parameters and returns a list of matching items found both before and after the initial seed index. Here is the function definition and the header comment explaining the parameters and return:

def look_forward_backward(data, idx, name):
  """
  This function takes three parameters and returns a list of matching items that
  it found in data, looking both forward and backward from index idx.
  Params:
    data: list of records to search over
    idx: integer index indicating where the search should start from
    name: string (prefix) that the function should be searching for
  Returns:
    list: a list of FoodAccessRecord objects that match the prefix search
          on the county name
  """

A call to the look_forward_backward() function might look something like this:

all_matches = look_forward_backward(food_data, index, search_name)

In the example search above, the returned list of all_matches would include all of the matching records both before and after the "Allen County" result found by binary search, including the other "Allen County" results as well as "Allen Parish" and "Allendale County".

5. Displaying Matches

We want the results of our searches to be displayed nicely in the terminal window. For example, the table below shows the results in a county search for "Ada":

================================================================================
|          County |        State |Total Pop| LFA Pop| Income | Senior | Vehicle|
--------------------------------------------------------------------------------
|      Ada County |        Idaho |   392365|  134683|  28151 |  12200 |    1122|
|    Adair County |         Iowa |     7682|    4917|   1448 |    973 |      82|
|    Adair County |     Kentucky |    18656|   14058|   6890 |   2071 |     233|
|    Adair County |     Missouri |    25607|   13036|   5095 |   1916 |     340|
|    Adair County |     Oklahoma |    22683|   17318|   9400 |   2231 |     331|
|    Adams County |     Colorado |   441603|  101040|  25118 |   7087 |     968|
|    Adams County |        Idaho |     3976|    3014|   1097 |    618 |      39|
|    Adams County |     Illinois |    67103|   29120|   6096 |   4702 |     307|
|    Adams County |      Indiana |    34387|   25684|   8656 |   3248 |    1093|
|    Adams County |         Iowa |     4029|    2371|    715 |    474 |      15|
|    Adams County |  Mississippi |    32297|   21509|  11395 |   3115 |     672|
|    Adams County |     Nebraska |    31364|   15052|   4459 |   2757 |     287|
|    Adams County | North Dakota |     2343|    1042|    255 |    238 |       7|
|    Adams County |         Ohio |    28550|   23829|  11290 |   3387 |     811|
|    Adams County | Pennsylvania |   101407|   74543|  16482 |  11683 |    1025|
|    Adams County |   Washington |    18728|    8640|   4630 |    849 |     138|
|    Adams County |    Wisconsin |    20875|   18218|   6918 |   4258 |     252|
--------------------------------------------------------------------------------

Notice that each column has a fixed width even if the data in that column have different widths. For example, the titles shown take up different amount of space, but the column remains well formatted regardless of the title being displayed. More details on the information displayed, and tips on using string formatting to achieve this output are given in Displaying Matches below.

5.1. Truncating fields

Some fields in the data set are quite long. We can use slicing to truncate these strings to a fixed maximum length, allowing us to create an easy to read table format. This section of the online textbook explains how to slice a string. For instance, the following example using the REPL shows how to slice out the first 10 characters from a string:

>>> alphaString = "abcdefghijklmnopqrstuvwxyz"
>>> alphaString[0:10]
'abcdefghij'

Notice how this nicely works, even if we provide an upper bound that is longer than the string we’re trying to slice (unlike when dealing with lists):

>>> shortString = "xyz"
>>> shortString[0:10]
'xyz'

5.2. Formatting column width

Recall that with the % string formatting, you can specify the width of a field when printing, e.g. print("%10s" % ("hi")) will print "hi" with 8 spaces in front of it, so that it takes up a total of 10 characters in the display.

If it’s helpful, you can also use the - flag to get left-justified printing (the default is right-justified). For example:

>>> print("|%-10s|%10s|" % ("hi", "there"))
|hi        |     there|

Notice that the first format specifier, %-10s, produced "hi" followed by 8 spaces (so it takes up a total of 10 characters), and the second format specifier, %10s, produced "there" with 5 spaces in front of it (so that it also takes up a total of 10 characters).

Specifying the width also works with numbers. When displaying floats, we typically also specify the number of digits that should follow the decimal point. For example:

>>> print("|%-10d|=|%10.3f|" % (837, 45.345986))
|837        |=|     45.346|

By combining these ideas, you can make your output look like a nice table where everything is aligned. For instance, to get the header shown in the example output, try:

print("=" * 80)
print("| %15s | %12s |%9s| %7s| %6s | %6s | %7s|" % ( "County", "State", \
      "Total Pop", "LFA Pop", "Income", "Senior", "Vehicle" ) )
print("-" * 80)

Notice that several of the column labels shown here are abbreviations so it fits in 80 characters; "Total Pop" is the total population of the county, "LFA Pop" is the low food access population (i.e. the number of people too far from the nearest supermarket, via the .get_lfa_people method). Likewise, "Income", "Senior", and "Vehicle" are the counts for the various different groups that are too far from a market (see here for further details on how to get these numbers and what they represent). As mentioned above, for the statistics that have fields corresponding to multiple distances, you can use a distance of 1 mile (that is, for the getters that take a destance parameter, you can use 1).

6. Requirements

  1. Your program should read data from the food_access.csv data set and store the data in a list of FoodAccessRecord objects.

  2. Your program must validate all inputs as described above. In particular, when entering a population threshold, the program should validate that the user enters a positive integer.

  3. Your program should repeatedly present the user with a menu of options, including: search by county name, search by state, search by population threshold, and quit. After each search, the program should return to the main menu and allow the user to either perform another search or quit the program.

  4. When the user selects to search by population threshold, the program should search the data using linear search as described above.

  5. When the user selects to search by state, the program should search by using linear search, as described above.

  6. When the user selects to search by county name, the program should search the data by using binary search, as described above.

  7. After performing the search, your program should display the results in a nicely formatted table that fits in an 80-character wide terminal window, using string formatting to align the data in a format similar to what is shown in the Displaying matches section. If no results are found, your program should display a message indicating that no results were found.

7. Extra Challenges

Here are some optional extensions! These are NOT required, but they might be fun and rewarding, and allow you to get closer to how we might like to use this data in the real world.

If you implement extensions, please do so in a copy of your program called search-food-ext.py

7.1. Optional: Adjustable range

Right now, the distance parameter is always set to 1 mile. Add a menu option to let the user set a different distance threshold (in miles) which you will then pass to the getter functions.

7.2. Optional: Find counties with "food deserts"

Prompt the user for a number of individuals with poor access to food, then use linear search to find all counties where the "distant population" is greater than the number given.

For even more challenge, try searching for counties that pass a specified ratio threshold (i.e. find counties where more than a given percentage of the population has poor access to food).

Note that you can do this for other numeric columns if you want.

Rather than importing the provided lookForwardBackward function, try writing your own. This should work the same way as the provided version: it will look linearly forwards or linearly backwards based on an initial seed index that it takes as a parameter. The function will also need to take the search string as an input parameter (as well as potentially other information depending on your design). This function should return a list of all the partial matches based on that initial seed index.

The program can then add that list together with the entry at the initial seed to compute the total list of partial matches which are then displayed to the user.

Answer the Questionnaire

After each lab, please complete the short Google Forms questionnaire. Please select the right lab number (Lab 08) from the dropdown menu on the first question.

Once you’re done with that, you should run handin21 again.

Submitting lab assignments

Remember to run handin21 to turn in your lab files! You may run handin21 as many times as you want. Each time it will turn in any new work. We recommend running handin21 after you complete each program or after you complete significant work on any one program.

Logging out

When you’re done working in the lab, you should log out of the computer you’re using.

First quit any applications you are running, including your vscode editor, the browser and the terminal. Then click on the logout icon (logout icon or other logout icon) and choose "log out".

If you plan to leave the lab for just a few minutes, you do not need to log out. It is, however, a good idea to lock your machine while you are gone. You can lock your screen by clicking on the lock xlock icon. PLEASE do not leave a session locked for a long period of time. Power may go out, someone might reboot the machine, etc. You don’t want to lose any work!