CS21 Lab 9: Filtering and Sorting

Due Saturday, November 23, before midnight

Goals

  • Extend sorting beyond a list of integers

  • Learn to filter and reduce dataset sizes before sorting

  • Continue practicing top-down design (TDD)

  • Manage real world data

Please read through the entire lab before starting!

Federal Election Commission Data

The Federal Election Commission (FEC) collects fundraising data on candidates running for federal elections. You can interactively browse much of the data online, but we can also use our CS skills to explore the data interactively in python. As candidates must itemize individual campaign contributions in excess of $200, there is a lot of public data to sift through and explore.

We provide a number of cleaned up and occasionally aggregated data files that each have records in the following format.

P00010520|HICKENLOOPER, JOHN W.|DENVER|CO|01|03|2019|32800

Fields are separated by a vertical pipe |, and contain the following information:

  1. A unique ID for a particular 2020 presidential campaign. P00010520 in this case is the campaign for John Hickenlooper, whose 2020 campaign ended in August 2019.

  2. The name of the donor, e.g., HICKENLOOPER, JOHN W., who donated to his own campaign.

  3. The city of the donor, e.g., DENVER.

  4. The two letter state abbreviation of the donor, e.g., CO

  5. The two digit month of the donation, e.g., 01

  6. The two digit day of the donation, e.g., 03

  7. The two digit year of the donation, e.g., 2019

  8. The amount of the donation in US dollars.

In the directory /usr/local/doc/ we placed the file 2020all.txt which has itemized individual campaign contributions for every 2020 presidential candidate. Contributions may span from 2016 to the present, and there are over 1.6 million total records in this file. Working with this file may be challenging so we also provide some smaller summary files in the same /usr/local/doc/ folder.

  • 2020candidates.txt (105 records) This file has one entry per candidate per calendar year containing all the contributions for that year. The donor name, city and location are replaced with a dummy |Summary|Summary|US, and the month and day are replaced with 01|01.

    P00011866|Summary|Summary|US|01|01|2019|2015652
    P00011312|Summary|Summary|US|01|01|2019|869686
    P00010793|Summary|Summary|US|01|01|2019|12020868
    P00006486|Summary|Summary|US|01|01|2019|2383887
    P00011239|Summary|Summary|US|01|01|2019|740304
    P00011999|Summary|Summary|US|01|01|2019|3475776
    P00011338|Summary|Summary|US|01|01|2019|1318045
    P00012716|Summary|Summary|US|01|01|2019|48199335
    P00006213|Summary|Summary|US|01|01|2019|1461829
    P00009910|Summary|Summary|US|01|01|2019|2781635
  • 2020months.txt (497 records) This file has one entry per candidate per calendar year per month containing all the contributions for that month. It is similar in format to 2020candidates.txt but we do not replace the month with 01

  • 2020states.txt (9,951 records) Similar to 2020months.txt, but monthly aggregates are further subdivided by state, so the state column is unmodified and not replaced by US as it was in the 2020months.txt file

  • 2020donors.txt (308,816 records) This file has one entry per candidate per donor per year. Occasionally a donor will make multiple campaign contributions to the same candidate. These are reported as individual transactions in the 2020all.txt file, but are aggregated into a single yearly transaction with a calendar date of 12/31 for each donor in the 2020donors.txt file

Searching the Dataset

Your task is to write a program that:

  • Prompts the user with a menu of six choices:

    Please select one of the following choices:
    0. Quit
    1. Load file
    2. Print top donations
    3. Filter by campaign ID
    4. Filter by state
    5. Filter by city
    
    Number of current records: 0
    
    Choice?
  • Repeatedly asks the user for a valid choice

  • Completes the selected task until the user quits (option 0)

Each option aside from 0 is described in more detail below.

Load file

For option 1, load file, you must:

  • Ask the user to select one of the valid filenames mentioned above

  • add the path /usr/local/doc/ to the provided name

  • read in the appropriate file of donation records from a file

  • return a new list of Donation records

  • print the number of records returned

You should print a small sub-menu of the valid filenames and have the user select the filename by number,

$ python3 campaign.py
0. Quit
1. Load file
2. Print top donations
3. Filter by campaign ID
4. Filter by state
5. Filter by city

Number of current records: 0

Your choice? 1

Data files:
1. Grouped by candidates (2020candidates.txt)
2. Grouped by month (2020months.txt)
3. Grouped by state (2020states.txt)
4. Grouped by donor (2020donors.txt)
5. All records (2020all.txt)

Your choice? 1

0. Quit
1. Load file
2. Print top donations
3. Filter by campaign ID
4. Filter by state
5. Filter by city

Number of current records: 105

The electiondata module provides a small Donation class that you can use to store donor info.

Donation(campaign, name, city, state, month, day, year, amt)

Given a line in the donor file, you could create a Donation object as follows:

info = line.strip().split("|")
d = Donation(info[0],info[1],info[2],info[3],
       int(info[4]), int(info[5]),  int(info[6]),  int(info[7]))

Print top donations

For option 2, print top donations, you must:

  • prompt the user for the number of records to view, n.

  • sort the current list of donations by the donation amount.

  • print the top n donations in decreasing order with the highest amount first.

  • you should print the name of the campaign associated with each donation.

If there are fewer than n records in the current list of donations, you should print all the donations. The hints below will show you how you use the get_candidates() function in the electiondata library to map the campaign ID to a candidate name.

You must implement your own sorting algorithm to sort the records. It can be a variant of any of the quadratic runtime sort algorithms we discussed in class. You will need to adapt the algorithm to sort Donation objects by comparing the amount of two separate donations and swapping their positions in the list if necessary.

A sample run on the candidate summary data file is shown below.

$ python3 campaign.py
0. Quit
1. Load file
2. Print top donations
3. Filter by campaign ID
4. Filter by state
5. Filter by city

Number of current records: 0

Your choice? 1

Data files:
1. Grouped by candidates (2020candidates.txt)
2. Grouped by month (2020months.txt)
3. Grouped by state (2020states.txt)
4. Grouped by donor (2020donors.txt)
5. All records (2020all.txt)

Your choice? 1
0. Quit
1. Load file
2. Print top donations
3. Filter by campaign ID
4. Filter by state
5. Filter by city

Number of current records: 105

Your choice? 2
How many?: 8
STEYER, TOM:                 48199335
		P00012716|Summary|Summary|US|01|01|2019|$    48199335

BUTTIGIEG, PETE:             33943835
		P00010298|Summary|Summary|US|01|01|2019|$    33943835

SANDERS, BERNARD:            30523241
		P60007168|Summary|Summary|US|01|01|2019|$    30523241

HARRIS, KAMALA D.:           22208134
		P00009423|Summary|Summary|US|01|01|2019|$    22208134

BIDEN, JOSEPH R JR:          20317082
		P80000722|Summary|Summary|US|01|01|2019|$    20317082

WARREN, ELIZABETH:           15329719
		P00009621|Summary|Summary|US|01|01|2019|$    15329719

KLOBUCHAR, AMY J.:           13165462
		P80006117|Summary|Summary|US|01|01|2019|$    13165462

TRUMP, DONALD J.:            12945866
		P80001571|Summary|Summary|US|01|01|2019|$    12945866

Filter by campaign ID

For option 3, filter by campaign ID, you must:

  • Prompt the user for a campaign ID, e.g., P00009621.

  • Filter the current list of donation records by creating a new list containing only the records whose campaign ID matches the given ID and replacing the current list of records with this new list.

  • Print the number of records returned.

Note it is perfectly fine to return an empty list if the user specifies an invalid campaign ID.

You will probably want to write a function that takes the list of current donors and returns a new list.

donations = filter_by_campaign(donations, id)

Below is list of campaign ID for some of the top fundraising campaigns. You can find more by using your program and looking at the 2020candidates.txt file.

P00012716|STEYER, TOM
P00010298|BUTTIGIEG, PETE
P60007168|SANDERS, BERNARD
P00009423|HARRIS, KAMALA D.
P80001571|TRUMP, DONALD J.
P80000722|BIDEN, JOSEPH R JR
P00009621|WARREN, ELIZABETH
P80006117|KLOBUCHAR, AMY J.
P00009795|BOOKER, CORY A.
P00010793|O'ROURKE, ROBERT BETO
P00009290|GILLIBRAND, KIRSTEN
P00011833|BENNET, MICHAEL F.
P00010454|INSLEE, JAY R
P00010520|HICKENLOOPER, JOHN W.
P00009092|CASTRO, JULIAN
P00011999|BULLOCK, STEVE
P00006213|DELANEY, JOHN K.
P00006486|YANG, ANDREW MR.
P00009183|GABBARD, TULSI
P00009910|WILLIAMSON, MARIANNE
P00012054|DE BLASIO, BILL

Filter by state

For option 4, filter by state, you must :

  • Prompt the user for a two letter state abbreviation, e.g., PA.

  • Filter the current list of donation records by creating a new list containing only the records whose state matches the state and replacing the current list of records with this new list.

  • Print the number of records returned.

This is very similar to filter by campaign ID, but just filtering by a different criteria. Note since each filter replaces the current list of donations, applying successive filters can further reduce the list size. For example, filtering first by campaign P00009621 and then by state PA should result in a list of donations to the P00009621 (Elizabeth Warren) campaign from PA.

Filter by city

For option 5, filter by city, you must :

  • Prompt the user for a city name, e.g., Springfield.

  • Filter the current list of donation records by creating a new list containing only the records whose city name matches the city and replacing the current list of records with this new list.

  • Print the number of records returned.

The city match only matches by city name, so for city names that exist in multiple states, you would need to apply a city filter followed by a state filter to narrow down your results. This would be the user’s responsibility, not the programmers responsibility, so to search for Springfied, PA the user would filter by city==Springfield then state==PA. Also note that the data in the data files are in all caps. You may want to use the .upper() method on the user input to convert the user’s response to something that is easier to match in the list of records.

Sample output

See the sample output page for examples of how the program might run.

Provided Library

The electiondata library has some helpful tools for managing donation records in your program.

Donation Class

The electiondata library provides a Donation class, which encapsulates all the information you’ll need to keep track of one donation record. You can create an instance object of this class by calling the Donation() constructor with each piece of information from one row of your data file passed in as a separate parameter. Each number is an integer, and all other fields are strings.

You can call the following methods on a Donation object to retrieve information about it:

  • get_candidate(): Retrieve a unique campaign identifier. This isn’t easy for humans to remember, but we can match it to a real candidate with the Candidate class described below.

  • get_name(): Retrieve the name of the donor.

  • get_city(): Retrieve the city string.

  • get_state(): Retrieve the state string (abbreviated, e.g., PA).

  • get_month(): Retrieve the donation month as an integer.

  • get_day(): Retrieve the donation day as an integer.

  • get_year(): Retrieve the donation year as an integer.

  • get_amt(): Retrieve the donation amount as an integer.

For example:

>>> from electiondata import *
>>> record = Donation("P00010520", "HICKENLOOPER, JOHN W.", "DENVER", "CO", 01, 03, 2019, 32800)
>>> print(record.get_year())
2019
>>> print(record.get_state())
CO
>>> print(record)
P00010520|HICKENLOOPER, JOHN W.|DENVER|CO|01|03|2019|$     32800

Candidate class

Donations are listed by a unique campaign identifier, but this ID does not easily describe a candidate. A separate Candidate class stores information about candidates, including their campaign ID

  • get_candidate_id(): Retrieve the campaign ID associated with this candidate as it appears in the donor files.

  • get_name(): Retrieve real name of candidate.

  • get_committee(): Retrieve name of fundraising committee for candidate.

  • get_party(): Retrieve political party abbreviation of candidate.

  • get_state(): Retrieve state of registration for this candidate.

A text file containing info about the 2020 presidential candidates can be found in /usr/local/doc/2020presinfo.txt, but the function get_candidates() in the electiondata library will read this file for you and return a list of Candidate objects, so you should not need to parse this file on your own.

>>> from electiondata import *
>>> cand = get_candidates()
>>> print(cand[1])
P00010298|C00697441|BUTTIGIEG, PETE|PETE FOR AMERICA, INC.|DEM|IN
>>> print(cand[4])
P80001571|C00580100|TRUMP, DONALD J.|DONALD J. TRUMP FOR PRESIDENT, INC.|REP|NY
>>> print(cand[5])
P80000722|C00703975|BIDEN, JOSEPH R JR|BIDEN FOR PRESIDENT|DEM|DC
>>> print(cand[7])
P80006117|C00696419|KLOBUCHAR, AMY J.|AMY FOR AMERICA|DEM|MN

When printing your result in option 2, you can pass in a list of candidates in addition to your donor info when displaying the results. You can use this list of candidates to search for a matching candidate_id and lookup the candidate name. Since there are only 74 candidates in the 2020 candidates list, a basic linear search will be fast enough for most reporting.

Requirements, Hints, and Tips

  • You should practice good top-down design, incrementally implement and test your solution, and document your code with comments. Start with the main menu and get a skeleton of your code working first. Then add each option from the menu one at a time (start with quit as it is easy and will allow you to exit your program when you are testing it). While much of the design is up to you, the requirements below are designed to avoid some headaches in the initial design.

  • You should only need to read a donation file when the user selects option 1. For all other options, you should be using a list that was previously loaded from a file.

  • To print an instance object of the Donation class, you can call print() directly on the object:

    donation = alldonations[0] #alldonations is list of Donation objects
    print(donation)

    The class knows how to format itself into a string, so you can avoid writing a nasty formatting string.

  • Validate menu selections. If the user doesn’t enter a valid number (1-6), let the user know it wasn’t a valid selection and prompt them again for a new menu item.

  • When reading file input, split() only generates a list of strings. Some of the strings may need to be converted to integers.

Extra Challenges

This will not affect your grade or gain extra points. It is included just as an extra challenge, if you are interested (and have already finished the entire assignment as described above).

There are numerous ways to extend this lab to make it even more useful. Here are some challenges you may want to try. If you decide to try them, add any extra menu options to the end of the list so that you do not change the default order.

  • Allow users to filter by city and state in a single step instead of applying two filters.

  • Allow an undo option, where selecting undo will undo the most recent filter and replace the current list of records with the previous list. This could be handy if you wanted to search for one state, undo, and search for a second, without reloading the entire file. For an extra, extra challenge, allow multiple levels of undo, e.g., allow users to undo up to the previous three filters.

  • Add an exclude filter option where you can return a list of records that do not match a query. For example, if you wanted to remove candidates that had dropped out of the race already, you could apply a filter on campaign ID that returns all records except a given campaign ID to remove a candidate.

  • Allow a default menu option so that users can e.g. load a default filename simply by pressing enter instead of typing a number. You can choose the default name. Your program should still allow users to select a file other than the default.

Answer the Questionnaire

Please edit the Questions-09.txt file in your cs21/labs/09 directory and answer the questions in that file.

Once you’re done with that, run handin21 again.

Turning in your labs…​.

Remember to run handin21 to turn in your lab files! You may run handin21 as many times as you want. Each time it will turn in any new work. We recommend running handin21 after you complete each program or after you complete significant work on any one program.