Week 10: Sorting

Class Recordings

To access recordings, sign in with your @swarthmore.edu account.

Monday

Section 1 (Joshua)

Section 2 (Kevin)

Wednesday

Section 1 (Joshua)

Section 2 (Kevin)

Friday

Section 1 (Joshua)

Section 2 (Kevin)

Announcements

  • Quiz 4 on Friday.

  • Lab 8 is available now.

Week 10 Topics

  • Continue analyzing / implementing binary search.

  • Sorting.

Monday

To explore the complexity of search, we will narrow the problem to searching for an item x in a python list of items. Python already has two ways of doing this. The first is the Boolean in operator, x in ls which returns True if x is in the list ls and False otherwise. Python also supports the index() method which will tell you the position in the list of the first occurrence of x, provided x appears in the list. So ls.index(x) will return an integer position if x is in ls. If x is not in the list, Python will generate an error called an exception that will likely crash your program since we have not talked much about how to handle exceptions in this class.

But how do these methods actually work? At some point, a computer scientist and python programmer designed and wrote the code for these built-in features. We will discuss the algorithm for searching a collection of items. Together, we’ll write a function that does something familiar: search through a list for an item without using the in operator. Our functions will be called contains and position_of.

Examples

To help illustrate the behavior of binary search, I’ve added a worksheet, searchWorksheet.txt, to the inclass/w10 directory. Before we finish our implementations, let’s look at a couple of examples.

A key algorithmic question is: can we do better? In some cases, we can’t (and we can prove this!). In the case of linear search, we cannot do better in the general case. However, if all items in the collection are in sorted order, we can perform a faster algorithm known as binary search.

Binary search works using divide and conquer approach. Each step of the algorithm divides the number of items to search in half until it finds the value or has no items left to search. Here is some pseudocode for the algorithm:

set low = lowest-possible index
set high = highest possible index
LOOP:
    calculate middle index = (low + high) // 2
    if item is at middle index, we're done (found it! return matching index)
    elif item is < middle item,
      set high to middle - 1
    elif item is > middle item,
      set low to middle + 1
if low is ever greater than high, item not here (done, return -1)

Search Timing Comparison

I’ve implemented both linear search and binary search in a small program that measures their performance (in terms of wall-clock time). I’ll share these files with you (timeSearch.py and timeBinarySearch.py), and we’ll run them to see how long searching takes as we vary the size of the input list.

Wednesday

Sorting

Sorted input is a prerequisite for binary search. While Python does have built-in support for sorting, let’s pretend it doesn’t in an attempt to better understand how sorting algorithms work.

Suppose you have an unsorted pile of exams, and you need to sort them alphabetically by name before you can enter them in a grading spreadsheet. For example, let’s say your exams are in this order:

Index Name

0

Lisa

1

Andy

2

Tia

3

Rich

4

Vasanta

5

Ameet

6

Kevin

7

Lila

8

Joshua

9

Zach

10

Xiaodong

Imagine these names are in a list, where "Lisa" is at index 0, "Andy" is at index 1, etc.

  • Can you come up with an algorithm to sort them? Note: your algorithm can’t "just look at the name" to determine where an exam goes, it must systematically compare exams until every exam is in the right place.

  • What is the worst-case number of comparisons your algorithm would take?

Swap

When humans look at this problem, we may be able to look at multiple items at once and move large groups of items that are already in sorted order to the proper location. This concept is hard to express algorithmically however in a computer language, so instead we focus on a simpler operation that swaps only two elements at a time.

The swap(lst, i, j) function will swap two items at positions i and j in a list lst. Once we have this small helper function working, we can use it to implement some sorting routines by comparing two elements at a time using Boolean relational operators, and calling swap() if the elements are out of order. For example:

#assume i < j and we want to sort items in increasing order

if lst[i] > lst[j]:    #if items are out of order
    swap(ls, i, j)

In general, we will need to perform several swaps to sort an arbitrary list.

Selection Sort

The key idea of selection sort is that it repeatedly selects the minimum item remaining in the unsorted portion of the list and then swaps it into its final sorted location.

Consider how selection sort would operate on following list:

[5, 3, 8, 0, 4, 1, 7]
 0  1  2  3  4  5  6

When it begins, the entire list is unsorted. Therefore, the goal is to get the minimum item in the list into position 0. To do this it keeps track of the indexOfMin. It always starts this variable at the first index of the unsorted portion of the list, and updates it whenever it finds a smaller item. In this way, we can do a linear search to find the index of the smallest item in the unsorted portion of the list.

The indexOfMin starts at 0. Since 3 < 5, indexOfMin is updated to location 1. Since 8 is not < 3, indexOfMin is unchanged. Since 0 < 3, indexOfMin is updated to location 3. All other locations in the list will be checked to see if their contents are < 0, and they aren’t.

After the entire unsorted portion of the list has been checked, one swap is done between locations indexOfMin and the location at the beginning of the unsorted portion. The current ordering in the list will become:

[0, 3, 8, 5, 4, 1, 7]
 0  1  2  3  4  5  6

I will redraw this to make it clear where the division between the sorted and unsorted portions of the list occurs.

[0,     ||| 3, 8, 5, 4, 1, 7]
 0          1  2  3  4  5  6
 sorted     unsorted

Next, selection sort will repeat the process described above. It will find that the indexOfMin is now 5, and it will swap the items at locations 1 and 5. Giving us this new list ordering:

[0, 1, ||| 8, 5, 4, 3, 7]
 0  1      2  3  4  5  6
 sorted    unsorted

Notice that the sorted portion of the list is growing by one more item each iteration through selection sort.

What would the list look like after the next iteration?

[0, 1, 3, ||| 5, 4, 8, 7]
 0  1  2      3  4  5  6
 sorted       unsorted

Selection sort can be implemented as a loop within a loop, where the inner loop is used to find indexOfMin, and each iteration of the outer loop identifies the smallest element in the unsorted portion of the list and swaps it into place.

How many steps does it take to run selection sort?

We need to first find the indexOfMin in a list of N elements. Finding the indexOfMin is a linear search, so it takes N steps. Then we need to find indexOfMin in the unsorted portion of the list — a list of (N-1) elements. This takes (N-1) steps. Finding the next minimum element takes (N-2) steps, and so on.

The toal number of steps that selection sort takes is N + (N-1)
(N-2) + …​ + 2 + 1, which is O(n2) steps.

Friday

Bubble Sort

Another approach to sort a list is to scan over the list and just compare adjacent elements — elements at index i and i+1. If they are out of order, we will swap. But one scan over this list may not be enough. So we can perform multiple scans until the list is sorted. A list will be sorted when we perform zero swaps while scanning the list.

keepGoing = True
while keepGoing == True:
    #Optimistically assume we are done
    keepGoing = False
    for each adjacent pair of items:
        if out of order:
            swap
            keepGoing = True   #another scan is needed

Like selection sort, bubble sort is implemented as a loop within a loop, where the amount of work can be described as the summation (n-1) + (n-2) + …​ + 1 which is O(n2). So unfortunately, bubble sort does not improve on the worst case running time of selection sort.