CS21 Lab 9: Filtering and Sorting
Due Saturday, November 23, before midnight
Extend sorting beyond a list of integers
Learn to filter and reduce dataset sizes before sorting
Continue practicing top-down design (TDD)
Manage real world data
Please read through the entire lab before starting!
Federal Election Commission Data
The Federal Election Commission (FEC) collects fundraising data on candidates running for federal elections. You can interactively browse much of the data online, but we can also use our CS skills to explore the data interactively in python. As candidates must itemize individual campaign contributions in excess of $200, there is a lot of public data to sift through and explore.
We provide a number of cleaned up and occasionally aggregated data files that each have records in the following format.
P00010520|HICKENLOOPER, JOHN W.|DENVER|CO|01|03|2019|32800
Fields are separated by a vertical pipe
|, and contain the following information:
A unique ID for a particular 2020 presidential campaign.
P00010520in this case is the campaign for John Hickenlooper, whose 2020 campaign ended in August 2019.
The name of the donor, e.g., HICKENLOOPER, JOHN W., who donated to his own campaign.
The city of the donor, e.g., DENVER.
The two letter state abbreviation of the donor, e.g., CO
The two digit month of the donation, e.g., 01
The two digit day of the donation, e.g., 03
The two digit year of the donation, e.g., 2019
The amount of the donation in US dollars.
In the directory
/usr/local/doc/ we placed the file
2020all.txt which has itemized individual campaign contributions for every 2020 presidential candidate. Contributions may span from 2016 to the present, and there are over 1.6 million total records in this file. Working with this file may be challenging so we also provide some smaller summary files in the same
2020candidates.txt(105 records) This file has one entry per candidate per calendar year containing all the contributions for that year. The donor name, city and location are replaced with a dummy
|Summary|Summary|US, and the month and day are replaced with
P00011866|Summary|Summary|US|01|01|2019|2015652 P00011312|Summary|Summary|US|01|01|2019|869686 P00010793|Summary|Summary|US|01|01|2019|12020868 P00006486|Summary|Summary|US|01|01|2019|2383887 P00011239|Summary|Summary|US|01|01|2019|740304 P00011999|Summary|Summary|US|01|01|2019|3475776 P00011338|Summary|Summary|US|01|01|2019|1318045 P00012716|Summary|Summary|US|01|01|2019|48199335 P00006213|Summary|Summary|US|01|01|2019|1461829 P00009910|Summary|Summary|US|01|01|2019|2781635
2020months.txt(497 records) This file has one entry per candidate per calendar year per month containing all the contributions for that month. It is similar in format to
2020candidates.txtbut we do not replace the month with
2020states.txt(9,951 records) Similar to
2020months.txt, but monthly aggregates are further subdivided by state, so the state column is unmodified and not replaced by
USas it was in the
2020donors.txt(308,816 records) This file has one entry per candidate per donor per year. Occasionally a donor will make multiple campaign contributions to the same candidate. These are reported as individual transactions in the
2020all.txtfile, but are aggregated into a single yearly transaction with a calendar date of 12/31 for each donor in the
Searching the Dataset
Your task is to write a program that:
Prompts the user with a menu of six choices:
Please select one of the following choices: 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 0 Choice?
Repeatedly asks the user for a valid choice
Completes the selected task until the user quits (option 0)
Each option aside from 0 is described in more detail below.
For option 1, load file, you must:
Ask the user to select one of the valid filenames mentioned above
add the path
/usr/local/doc/to the provided name
read in the appropriate file of donation records from a file
return a new list of
print the number of records returned
You should print a small sub-menu of the valid filenames and have the user select the filename by number,
$ python3 campaign.py 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 0 Your choice? 1 Data files: 1. Grouped by candidates (2020candidates.txt) 2. Grouped by month (2020months.txt) 3. Grouped by state (2020states.txt) 4. Grouped by donor (2020donors.txt) 5. All records (2020all.txt) Your choice? 1 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 105
electiondata module provides a small
Donation class that you can use to store donor info.
Donation(campaign, name, city, state, month, day, year, amt)
Given a line in the donor file, you could create a
Donation object as follows:
info = line.strip().split("|") d = Donation(info,info,info,info, int(info), int(info), int(info), int(info))
Print top donations
For option 2, print top donations, you must:
prompt the user for the number of records to view,
sort the current list of donations by the donation amount.
print the top
ndonations in decreasing order with the highest amount first.
you should print the name of the campaign associated with each donation.
If there are fewer than
n records in the current list of donations, you should print all the donations. The hints below will show you how you use the
get_candidates() function in the
electiondata library to map the campaign ID to a candidate name.
You must implement your own sorting algorithm to sort the records. It can be a variant of any of the quadratic runtime sort algorithms we discussed in class. You will need to adapt the algorithm to sort
Donation objects by comparing the amount of two separate donations and swapping their positions in the list if necessary.
A sample run on the candidate summary data file is shown below.
$ python3 campaign.py 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 0 Your choice? 1 Data files: 1. Grouped by candidates (2020candidates.txt) 2. Grouped by month (2020months.txt) 3. Grouped by state (2020states.txt) 4. Grouped by donor (2020donors.txt) 5. All records (2020all.txt) Your choice? 1 0. Quit 1. Load file 2. Print top donations 3. Filter by campaign ID 4. Filter by state 5. Filter by city Number of current records: 105 Your choice? 2 How many?: 8 STEYER, TOM: 48199335 P00012716|Summary|Summary|US|01|01|2019|$ 48199335 BUTTIGIEG, PETE: 33943835 P00010298|Summary|Summary|US|01|01|2019|$ 33943835 SANDERS, BERNARD: 30523241 P60007168|Summary|Summary|US|01|01|2019|$ 30523241 HARRIS, KAMALA D.: 22208134 P00009423|Summary|Summary|US|01|01|2019|$ 22208134 BIDEN, JOSEPH R JR: 20317082 P80000722|Summary|Summary|US|01|01|2019|$ 20317082 WARREN, ELIZABETH: 15329719 P00009621|Summary|Summary|US|01|01|2019|$ 15329719 KLOBUCHAR, AMY J.: 13165462 P80006117|Summary|Summary|US|01|01|2019|$ 13165462 TRUMP, DONALD J.: 12945866 P80001571|Summary|Summary|US|01|01|2019|$ 12945866
Filter by campaign ID
For option 3, filter by campaign ID, you must:
Prompt the user for a campaign ID, e.g.,
Filter the current list of donation records by creating a new list containing only the records whose campaign ID matches the given ID and replacing the current list of records with this new list.
Print the number of records returned.
Note it is perfectly fine to return an empty list if the user specifies an invalid campaign ID.
You will probably want to write a function that takes the list of current donors and returns a new list.
donations = filter_by_campaign(donations, id)
Below is list of campaign ID for some of the top fundraising campaigns. You can find more by using your program and looking at the
P00012716|STEYER, TOM P00010298|BUTTIGIEG, PETE P60007168|SANDERS, BERNARD P00009423|HARRIS, KAMALA D. P80001571|TRUMP, DONALD J. P80000722|BIDEN, JOSEPH R JR P00009621|WARREN, ELIZABETH P80006117|KLOBUCHAR, AMY J. P00009795|BOOKER, CORY A. P00010793|O'ROURKE, ROBERT BETO P00009290|GILLIBRAND, KIRSTEN P00011833|BENNET, MICHAEL F. P00010454|INSLEE, JAY R P00010520|HICKENLOOPER, JOHN W. P00009092|CASTRO, JULIAN P00011999|BULLOCK, STEVE P00006213|DELANEY, JOHN K. P00006486|YANG, ANDREW MR. P00009183|GABBARD, TULSI P00009910|WILLIAMSON, MARIANNE P00012054|DE BLASIO, BILL
Filter by state
For option 4, filter by state, you must :
Prompt the user for a two letter state abbreviation, e.g.,
Filter the current list of donation records by creating a new list containing only the records whose state matches the state and replacing the current list of records with this new list.
Print the number of records returned.
This is very similar to filter by campaign ID, but just filtering by a different criteria. Note since each filter replaces the current list of donations, applying successive filters can further reduce the list size. For example, filtering first by campaign
P00009621 and then by state
PA should result in a list of donations to the
P00009621 (Elizabeth Warren) campaign from
Filter by city
For option 5, filter by city, you must :
Prompt the user for a city name, e.g.,
Filter the current list of donation records by creating a new list containing only the records whose city name matches the city and replacing the current list of records with this new list.
Print the number of records returned.
The city match only matches by city name, so for city names that exist in multiple states, you would need to apply a city filter followed by a state filter to narrow down your results. This would be the user’s responsibility, not the programmers responsibility, so to search for
Springfied, PA the user would filter by
state==PA. Also note that the data in the data files are in all caps. You may want to use the
.upper() method on the user input to convert the user’s response to something that is easier to match in the list of records.
See the sample output page for examples of how the program might run.
electiondata library has some helpful tools for managing donation
records in your program.
electiondata library provides a
Donation class, which encapsulates all the information you’ll
need to keep track of one donation record. You can create an instance object of this
class by calling the
Donation() constructor with each piece of information
from one row of your data file passed in as a separate parameter.
Each number is an integer, and all other fields are strings.
You can call the following methods on a
Donation object to retrieve information about it:
get_candidate(): Retrieve a unique campaign identifier. This isn’t easy for humans to remember, but we can match it to a real candidate with the
Candidateclass described below.
get_name(): Retrieve the name of the donor.
get_city(): Retrieve the city string.
get_state(): Retrieve the state string (abbreviated, e.g.,
get_month(): Retrieve the donation month as an integer.
get_day(): Retrieve the donation day as an integer.
get_year(): Retrieve the donation year as an integer.
get_amt(): Retrieve the donation amount as an integer.
>>> from electiondata import * >>> record = Donation("P00010520", "HICKENLOOPER, JOHN W.", "DENVER", "CO", 01, 03, 2019, 32800) >>> print(record.get_year()) 2019 >>> print(record.get_state()) CO >>> print(record) P00010520|HICKENLOOPER, JOHN W.|DENVER|CO|01|03|2019|$ 32800
Donations are listed by a unique campaign identifier, but this ID does not easily describe a candidate. A separate
Candidate class stores information about candidates, including their campaign ID
get_candidate_id(): Retrieve the campaign ID associated with this candidate as it appears in the donor files.
get_name(): Retrieve real name of candidate.
get_committee(): Retrieve name of fundraising committee for candidate.
get_party(): Retrieve political party abbreviation of candidate.
get_state(): Retrieve state of registration for this candidate.
A text file containing info about the 2020 presidential candidates can be found in
/usr/local/doc/2020presinfo.txt, but the function
get_candidates() in the
electiondata library will read this file for you and return a list of
Candidate objects, so you should not need to parse this file on your own.
>>> from electiondata import * >>> cand = get_candidates() >>> print(cand) P00010298|C00697441|BUTTIGIEG, PETE|PETE FOR AMERICA, INC.|DEM|IN >>> print(cand) P80001571|C00580100|TRUMP, DONALD J.|DONALD J. TRUMP FOR PRESIDENT, INC.|REP|NY >>> print(cand) P80000722|C00703975|BIDEN, JOSEPH R JR|BIDEN FOR PRESIDENT|DEM|DC >>> print(cand) P80006117|C00696419|KLOBUCHAR, AMY J.|AMY FOR AMERICA|DEM|MN
When printing your result in option 2, you can pass in a list of candidates in addition to your donor info when displaying the results. You can use this list of candidates to search for a matching
candidate_id and lookup the candidate name. Since there are only 74 candidates in the 2020 candidates list, a basic linear search will be fast enough for most reporting.
Requirements, Hints, and Tips
You should practice good top-down design, incrementally implement and test your solution, and document your code with comments. Start with the main menu and get a skeleton of your code working first. Then add each option from the menu one at a time (start with quit as it is easy and will allow you to exit your program when you are testing it). While much of the design is up to you, the requirements below are designed to avoid some headaches in the initial design.
You should only need to read a donation file when the user selects option 1. For all other options, you should be using a list that was previously loaded from a file.
To print an instance object of the
Donationclass, you can call
print()directly on the object:
donation = alldonations #alldonations is list of Donation objects print(donation)
The class knows how to format itself into a string, so you can avoid writing a nasty formatting string.
Validate menu selections. If the user doesn’t enter a valid number (1-6), let the user know it wasn’t a valid selection and prompt them again for a new menu item.
When reading file input,
split()only generates a list of strings. Some of the strings may need to be converted to integers.
This will not affect your grade or gain extra points. It is included just as an extra challenge, if you are interested (and have already finished the entire assignment as described above).
There are numerous ways to extend this lab to make it even more useful. Here are some challenges you may want to try. If you decide to try them, add any extra menu options to the end of the list so that you do not change the default order.
Allow users to filter by city and state in a single step instead of applying two filters.
Allow an undo option, where selecting undo will undo the most recent filter and replace the current list of records with the previous list. This could be handy if you wanted to search for one state, undo, and search for a second, without reloading the entire file. For an extra, extra challenge, allow multiple levels of undo, e.g., allow users to undo up to the previous three filters.
Add an exclude filter option where you can return a list of records that do not match a query. For example, if you wanted to remove candidates that had dropped out of the race already, you could apply a filter on campaign ID that returns all records except a given campaign ID to remove a candidate.
Allow a default menu option so that users can e.g. load a default filename simply by pressing enter instead of typing a number. You can choose the default name. Your program should still allow users to select a file other than the default.
Answer the Questionnaire
Questions-09.txt file in your
and answer the questions in that file.
Once you’re done with that, run
Turning in your labs….
Remember to run
handin21 to turn in your lab files!
You may run
handin21 as many times as you want. Each time it will
turn in any new work. We recommend running
you complete each program or after you complete significant
work on any one program.