Week 7: using files and top-down design

Using files

Conceptually a file is a sequence of data that is stored in the memory of the coputer. You can think of a file as a very long string of text (even if it contains numbers).

A file is typically made up of many lines. There is a special newline character that is stored at the end of each line in a file: "\n". However, it is not visible when you look at the file.

In order to experiment with files, we will be using a file named students.txt. When you do update21 this file will be copied to your cs21/inclass/w07 directory. I downloaded this file from the Registrar’s roster for this class.

Open students.txt in atom. Notice that each line contains a last name, a first name, and a class year. Each piece of data on a line is separated by a comma. A file with this structure is known as a CSV file, which stands for Comma Separated Values.

We’d like to be able to read in this file and store the data it contains in a program. Then we could use this data to compute things such as the number of first year students in the class.

Opening and Closing files

In order to use a file in a program you must first open it. This allows the computer to set up a link between the memory and the program. When you are done with a file you should close it.

To open a file do:

<fileVariable> = open(<filename>, <mode>)

The <filename> is a string. If the file is stored in the same directory as the program, then you just need to provide the name of the file. If the file is stored in some other directory, then you need to give the path as well as the name.

The <mode> is also a string. It indicates how you plan to use the file. Some possible modes are "r" for reading and "w" for writing.

To close a file do:

<fileVariable>.close()

Notice that files must be objects because we are using the dot notation to do operations on them.

Reading from a file

There are a number of different ways that you can read the data from a file. One of the most straightforward ways is to use a for loop to read in one line at a time until you reach the end of the file.

fp = open("students.txt", "r")
for line in fp:
   print(line)
fp.close()

When you run this snippet of code, you’ll see that each line of the file is printed with a blank line in between. This is because each line ends with the special newline character, and the print statement also adds a newline character to the output.

Stripping off white space

Typically, when reading data from a file we want to remove the newline characters from the end of each line. Each line is of type string, and strings have some useful methods we can use. One is called strip(). This method will remove any whitespace (tabs, spaces, and newlines) from both the front and end of a given string. For example:

>>> s = "    this is a test   \n\n\n  "
>>> s.strip()
'this is a test'

We can update our previous code to incorporate this:

fp = open("students.txt", "r")
for line in fp:
   line = line.strip()
   print(line)
fp.close()

Now when we run this version of the code, no extra blank lines will be printed.

Splitting data

Each line of data is made up of three pieces of information: last name, first name, and class year. We’d like to be able to work with these individual pieces of data rather than a single string. We can use the string method split() to break the long string up into these pieces. The split() method assumes you want to split the data based on spaces, unless you give it a different character to split on. We want to split on a comma. For example:

>>> line="Meeden,Lisa,1985"
>>> line.split(",")
['Meeden', 'Lisa', '1985']

Notice that split() returns a list of items that were split apart from the given list.

We can update our code to incorporate this:

fp = open("students.txt", "r")
for line in fp:
   line = line.strip()
   studentData = line.split(",")
   print(studentData)
fp.close()

Casting data

When we read data from a file it will always start as type string, but often times the data will contain numbers. In our student roster example, the class year is an integer, and we would like to make it be this type. After you have split the data apart, you can cast the pieces of the data within the list to be of the correct type.

fp = open("students.txt", "r")
for line in fp:
   line = line.strip()
   studentData = line.split(",")
   # cast the class year to be of type int
   studentData[2] = int(studentData[2])
   print(studentData)
fp.close()

Storing data into a list of lists

We would like to gather all of the student data together into a single data structure. We will use a list of lists. Each inner list will represent a single student, and the entire list will represent all of the students in the class.

The structure of the list will be:

roster =
[ [student0_lastName, student0_firstName, student0_year],
  [student1_lastName, student1_firstName, student1_year],
  ...
]

We know how to use indexing to access the items within a list. So to access items in a list of lists like this we will need to use double indexing. For example, roster[0] gives us the list representing student0, and roster[0][2] gives us the class year of student0.

We want to accumulate a list. So we will need to modify our previous code to initialize an empty list before the start of the for loop, and to append() to that list each time through the loop like this:

fp = open("students.txt", "r")
# create an empty list to accumulate with
roster = []
for line in fp:
   line = line.strip()
   studentData = line.split(",")
   studentData[2] = int(studentData[2])
   # add the current student onto the list
   roster.append(studentData)
fp.close()
# all students are now gathered together in a single list
print(roster)

Function to read in file and return a list

Now that we can put all of this together into a single function that when given the filename of a file containing student data will return a list of all of the students in that class.

def readFile(filename):
    """
    Parameters:
    filename a string representing the name of a roster file to read from
    where each line is CSV format (lastName,firstName,classYear)
    Returns:
    list of lists containing the data within the file
    """
    fp = open(filename, "r")
    roster = []
    for line in fp:
       line = line.strip()
       studentData = line.split(",")
       studentData[2] = int(studentData[2])
       roster.append(studentData)
    fp.close()
    return roster

Exercise 1

Write a function called countYear that takes in the roster of students and a class year and returns the number of students who will graduate in that year.

Exercise 2

Write a function called findStudent that takes in the roster of students and a first name and prints out all of the students with that name in the roster. If there are no students with that name, a message stating this should be printed.

Top-Down Design

For every lab so far, we have told you exactly which functions to write and what their interface should be (what parameters they take in and what they return). It is now time for you to start figuring this out on your own. But trying to solve large problems can be very overwhelming, so we need a strategy to help us approach this in a systematic way. The strategy we will use is called top-down design.

The basic idea is to start with the problem and try to express the solution in terms of smaller problems. Then each smaller problem is handled in the same way until eventually the problems get so small that they are easy to solve.

Our goal is to express the solution in terms of functions.

We will begin by developing the interface for each function:

  • Determine the parameters

  • Determine the return value (some functions may not have one)

  • Stub out the function

    • Explain how it works in the triple quoted comment

    • The function should print a message saying it has been called

    • The function should return a dummy value of the right type

    • Do not implement the function!

Next write the main() program. This should be fully fleshed out. It should call the stubbed out functions in the proper sequence in order to solve the problem. For example, if the main() program needs a for loop with an if statement inside of it, then you should fill in all of this structure.

The stubbed out program should be syntactically correct and executable. It won’t actually do anything yet, but it should be clear how the solution is structured.

This is similar to the process you might go through when you are developing the outline of a paper that you need to write.

Flashcards example

Flashcards are a common way for students to practice a foreign language. Let’s create a program that reads in a file (like the one called french.txt shown below) and produces a quiz for the user.

dire,to say
aller,to go
savoir,to know
trouver,to find
donner,to give
comprendre,to understand
parler,to speak
penser,to think
entendre,to listen
vivre,to live

Here’s what a sample run of the program might look like:

% python3 flashcards.py

Flashcard file? french.txt
==============================
comprendre: to learn
Nope... comprendre means to understand
------------------------------
savoir: to know
Correct!
------------------------------
dire: to say
Correct!
------------------------------
donner: to give
Correct!
------------------------------
penser: to think
Correct!
------------------------------
entendre: to listen
Correct!
------------------------------
aller: to go
Correct!
------------------------------
vivre: to live
Correct!
------------------------------
trouver: to look
Nope... trouver means to find
------------------------------
parler: to speak
Correct!
------------------------------
You got 8 out of 10
Well done.
Go again (y/n)? n
Bye!

Notice that the program tracks how many the user got correct. It should also give the user the chance to go through the cards again. Each time through the cards should be re-shuffled.

Let’s develop a top-down design for solving this problem.

Flashcards design

Here is one possible design for solving the Flashcards problem. Notice that the main program is fully fleshed out, but that every other function is just a stub. However, this program is syntactically correct and will run without errors, though it doesn’t do much of anything yet.

from random import shuffle

def main():
    filename = input("Flashcard file? ")
    cards = readCardData(filename)
    done = False
    while not done:
        shuffle(cards)
        correct = testUser(cards)
        reportResults(correct, len(cards))
        if not goAgain():
            done = True
    print("Bye!")

def readCardData(filename):
    """
    Purpose:
    reads in the flashcard data from the given filename
    Parameters:
    filename a string representing name of a file containing the flashcard
    data
    Returns:
    list of lists storing a series of prompts and answers to test on
    """
    print("Calling readCardData...")
    return [[]]

def testUser(listOfCards):
    """
    Purpose:
    presents all of the flashcards to the user one time and keeps track
    of how many correct answers they provide
    Parameters:
    listOfCards List of lists storing the prompts and answers
    Returns:
    int representing the number of correct answers
    """
    print("Calling testUser...")
    return 0

def reportResults(correct, totalPossible):
    """
    Purpose:
    Summarizes the user's results on one run through the flashcards
    Parameters:
    correct int representing the number of correct answers
    totalPossible int representing the number of questions
    Returns:
    None, just prints a message about the results
    """
    print("Calling reportResults...")
    print("results", correct, totalPossible)

def goAgain():
    """
    Purpose:
    Prompts the user to find out if they want to try again
    Returns:
    Either True or False
    """
    print("Calling goAgain...")
    return False

main()