Week 8, Wednesday: Searching

binary search

Here is an example of the binary search:

****** looking for g in:

['a', 'b', 'b', 'f', 'g', 'g', 'h', 'j', 'm', 'p', 'q', 'r', 'u', 'u', 'v', 'v']
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15  
  L                                  M                                       H  

 Low:  0
High: 15
 Mid: (low+high)/2 = 15/2 =  7

g < j, so high = mid - 1 = 6



['a', 'b', 'b', 'f', 'g', 'g', 'h', 'j', 'm', 'p', 'q', 'r', 'u', 'u', 'v', 'v']
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15  
  L              M              H                                               

 Low:  0
High:  6
 Mid: (low+high)/2 = 6/2 =  3

g > f, so low = mid + 1 = 4



['a', 'b', 'b', 'f', 'g', 'g', 'h', 'j', 'm', 'p', 'q', 'r', 'u', 'u', 'v', 'v']
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15  
                      L    M    H                                               

 Low:  4
High:  6
 Mid: (low+high)/2 = 10/2 =  5

g found at index 5.

And here is an example of what happens when the item we are looking for is NOT in the list:

****** looking for x in:

['c', 'e', 'g', 'i', 'k', 'k', 'l', 'o', 'o', 'r', 'r', 's', 's', 'z', 'z', 'z']
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15  
  L                                  M                                       H  

 Low:  0
High: 15
 Mid: (low+high)/2 = 15/2 =  7

x > o, so low = mid + 1 = 8



['c', 'e', 'g', 'i', 'k', 'k', 'l', 'o', 'o', 'r', 'r', 's', 's', 'z', 'z', 'z']
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15  
                                          L              M                   H  

 Low:  8
High: 15
 Mid: (low+high)/2 = 23/2 = 11

x > s, so low = mid + 1 = 12



['c', 'e', 'g', 'i', 'k', 'k', 'l', 'o', 'o', 'r', 'r', 's', 's', 'z', 'z', 'z']
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15  
                                                              L    M         H  

 Low: 12
High: 15
 Mid: (low+high)/2 = 27/2 = 13

x < z, so high = mid - 1 = 12



['c', 'e', 'g', 'i', 'k', 'k', 'l', 'o', 'o', 'r', 'r', 's', 's', 'z', 'z', 'z']
  0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15  
                                                              LMH               

 Low: 12
High: 12
 Mid: (low+high)/2 = 24/2 = 12

x > s, so low = mid + 1 = 13



 ------ Low (13) > High (12) ------- 

x is not in the list.

Notice that the algorithm continues until either the item is found, or the low and high indecies go past each other (low > high).

Here is the pseudo-code for binary search:

set low = lowest-possible index
set high = highest possible index
LOOP:
    calculate middle index = (low + high)/2
    if item is at middle index, we are done (found it! return True)
    elif item is < middle item,
      set high to middle index - 1
    elif item is > middle item,
      set low to middle index + 1
    if low is greater than high, item not here (done, return False)

Write the binarySearch(x, L) function in searches.py and add some test code to make sure it works.

testing with assert

The assert statement is one way to add test code to your modules and libraries. Here is an example of adding assert statements to our searches.py file:

if __name__ == "__main__":

  N = 100
  L = range(N)
  assert linearSearch(N/2, L) == True
  assert binarySearch(N/2, L) == True
  assert linearSearch(N*2, L) == False
  assert binarySearch(N*2, L) == False
  assert linearSearch(2, []) == False
  assert binarySearch(2, []) == False

When you run that, there will be no output if all tests pass. If one of the tests fails, you will see an AssertionError message.

analysis of algorithms

One simple way to compare algorithms is to count how many steps each requires, or measure how long each takes to run on the same computer with the same data. In addition to this, it is often useful to know how the number of steps (or the runtime) will change as the size of the data increases. For instance, if we are searching for an item in a list, will the runtime double if we double the size of the list? If so, we call this a linear algorithm, since the time needed to search the list is directly proportional to the size of the list.

How many steps will linearSearch take, for a list with N items in it? How about binarySearch? Let's look at each and compare the algorithms.

For a linear search, we need to, at worst, check each item in the list. There's just a simple compare (if item == x) done for each item in the list. So, for the worst case, there will be N compares or N steps needed. If the item we are looking for is the first item in the list, then only 1 compare is needed, but that is the best possible case. A better measure would be the average case, requiring N/2 compares. Still, this clearly depends on the size of the list, N. So, if we double the size of the list, linear search will require, on average, or in the worst case, twice as many steps.

Since binary search splits the list in half each time, there are far fewer compares or steps. Given a list of size N=16, it is not hard to see there will at worst be 5 compares (just think about splitting 16 in half five times: 16-8-4-2-1). Mathematically, this can be computed with log-base-2 of N. If the size of the list is now doubled to N=32 items, there is only one additional step required! (log-base-2 of 32 is 6, log-base-2 of 64 is 7, etc).

In python, you can explore the base-2 logs with the following:

>>> from math import *
>>> log(16,2)
4.0
>>> log(32,2)
5.0
>>> log(64,2)
6.0
>>> log(1000000,2)
19.931568569324174
>>> log(2000000,2)
20.931568569324174

So, for one-million items in a sorted python list, binary search will require at most, about 20 steps. If we double the size of the list to two-million, binary search now requires only about 21 steps. Notice how this is very different from the linear search! A linear search would require, at worst, two-million steps, compared to binary searches 21 steps.

To summarize, we say linear search is an O(N) algorithm, and binary search is an O(logN) algorithm. The big-O notation just shows how each algorithm behaves versus the size of the data (N). Here are some other examples we will see in this class:

O(logN)     one extra step, each time size of data doubles
O(N)        number of steps doubles, each time size of data doubles
O(NlogN)    
O(N*N)      number of steps quadruples, each time size of data doubles

And here is a simple plot of each, comparing the number of steps or runtime versus the size of the data:

python tutor image

your turn

How many steps for each of the following? And what happens to the number of steps when n doubles to 2n?

#############################################
# 1
n = int(raw_input("n: "))
for i in range(n):
  print i
#############################################
# 2
n = int(raw_input("n: "))
for i in range(100):
  print i*n
#############################################
# 3
n = int(raw_input("n: "))
for i in range(n):
  print i
for j in range(n):
  print j
#############################################
# 4
n = int(raw_input("n: "))
for i in range(n):
  for j in range(n):
    print i, j
#############################################
# 5
n = int(raw_input("n: "))
for i in range(n):
  for j in range(i,n):
    print i, j
#############################################
# 6
n = int(raw_input("n: "))
for i in range(n):
  for j in range(10):
    print i, j
#############################################
# 7
n = int(raw_input("n: "))
while n > 1:
  print n
  n = n/2
#############################################
# 8
L = [1,2,5,7,13,21,24,25,26,33,34,38,50,57,58,63]
n = len(L)
mid = n/2
print L[mid]
#############################################
# 9
n = int(raw_input("n: "))
for i in range(n):
  k = n
  while k > 1:
    print i, k
    k = k/2
#############################################