CS21B Lab 7: Protein Alignment and Similarity

Top-down design due 11:59pm Tuesday, March 20
Implmentation due 11:59pm Tuesday, March 27

Over the next two weeks you will write a program that aligns protein sequences. For the first week , you will focus on using top-down design to create theoverall structure of the program. Once your proposed structure has been reviewed by your professor, you will use bottom-up implementation and unit testing to incrementally implement and test the full program as part of the final submission.

Your program should be divided into a modular set of functions and the main function should be clear and concise. You should test each function individually as you write it. Remember that any data you need to persist throughout the program should be saved in main and then passed into functions that need access to it.

A skeleton version of the program alignment.py and a file of test data called fakeData.txt will appear in cs21/labs/07 when you run update21. The handin21 program will only submit files in this directory. Please note that you will need to email me your top-down design as soon as it is finished. See details below.

Read through the entire lab to get a basic understanding of what you are trying to accomplish.

Overview

A central theme in biology is the study of the shared molecular mechanims across a variety of species. Evolutionary theory infers that all organisms have a common ancestory. Furthermore, closely related species share a more recent common ancestory that encapsulated many of their shared features. Thus, any set of species is related, and this relationship is termed phylogeny. One way to infer phylogeny is to study sequences of these species under the hypothesis that more closely related species will share a common blueprint for their respective architectures. Closely related species have fewer mutations, or differences, when you compare their protein sequences than distantly related species. We can represent this complicated phylogeny of a set of species using a phylogentic tree where the common ancestory for two species is represented by the merging of their two respective brances in the evolutionary tree. In this assignment, you will compare the amino acid sequence of the Cytochrome c protein as expressed in a wide range of organisms. Cytochrome c is a highly conserved protein, which means that the amino acid sequence of the protein is very similar across different organisms. To compare the amino acid sequence in one organism to the amino acid sequence in another organism, you will align the protein sequences and use a substitution matrix to score the alignment. Sequences with higher scores are judged to be more similar to one another than sequences with lower scores.

Once you are able to determine the similarity between two protein sequences, you will create a similarity matrix, which will describe the pair-wise similarities between many organisms. From this similarity matrix, you can cluster the most similar organisms together into a binary tree that provides a way of determining the evolutionary similarity between the organisms you clustered.

Here are the major steps you'll need to accomplish:

  1. Read the names and amino acid sequences of various species from a file and save them into separate lists in your program.
  2. Initialize a list of lists with zeros to store the pair-wise alignment scores of all of the species.
  3. Given two aligned amino acid sequences, determine the alignment score.
  4. Consider all possible alignments between two amino acid sequences and find the best score.
  5. Compute the scores of the best pairwise alginments for all species.
  6. Draw the tree to reveal the evolutionary similarity between species.

0. Top-down design

Before you begin implementing your solution, you must submit a top-down design to me via email. To do so, perform the following in the unix prompt:

         $ cd cs21/labs/07
         $ mail soni < alignment.py 

Once you email me your design, I will try to get comments back to you within 24-36 hours. However, if everyone emails me their design at the same time, it may take longer. You are encouraged to start your design early, so you have enough time to get our helpful comments and still implement the full program. You can email me your design any time before Tuesday, March 20 at midnight. Your design should meet the following guidelines:

Here is an example of a stubbed out program. What you email to me as your top-down design should be similarly structured.

Unit Testing

You should do incremental, bottom-up, implementation and unit testing of functions. One way to do unit testing is within the python interpreter (another is to add calls to the function you are testing from main). To do unit testing in the interpreter, import your program in the python interpreter and then calls your functions from the interpreter prompt, passing in different values to test them out. To import your program in the python interpreter without having python run it, you need to have the following at the bottom of your file:
if __name__ == '__main__': main()
Then do:
$ python
>>> import alignment
Now you can call functions directly that are in alignment.py. For example, if I have a function named printInstructions, I can call it from the python interpreter prompts like this:
>>> alignment.printInstructions()


1. Reading in the data

The file /usr/local/lib/genetics/cytochrome_c.txt contains the full data set that you will eventually need for the lab. The file /usr/local/lib/genetics/cytochrome_c-10.txt contains a subset of the full data set of just 10 species that you can use as an intermediary test of your program. The file fakeData.txt contains an even smaller set of simplified data that you should use as you initially develop your program.

Each line of these files consists of three pieces of comma separated information: a binomial name (e.g. "homo sapiens"), a common name (e.g. "human"), and the amino acid sequence for the Cytochrome c protein for that organism. Here is a sample of a few (partial) lines from the actual data:

homo sapiens,human,GDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSY....
camelus dromedarius,dromedary camel,GDVEKGKKIFVQKCAQCHTVEKGGKHKTGPN....
tarsius bancanus,western tarsier,GDVEKGKKIFVQKCAQCHTVEKGGKHKTGXNLHG....
You will need to read each line of the file, extract each field, and accumulate the data into three lists: one storing the binomial name, one storing the common name, and one storing the amino acid sequences. For example, the list of common names from the above file would be ['human', 'dromedary camel', 'western tarsier'].

The binomial name, common name, and amino acid sequence for a particular organism should all appear at the same position. In other words, if the third item in the binomial name list is 'tarsius bancanus', then the third item in the common name list should be 'western tarsier', and the third item in the sequence list should be the amino acid sequence for the western tarsier. HINT: to return three lists from a function, you can always construct a larger list of lists e.g.:

list1 = [...]
list2 = [...]
listOfLists = [list1, list2]
return listOfLists


2. Creating and initializing the scores matrix
If you have n organisms, you will need to begin by creating an empty two-dimensional list of scores that has n rows and n columns. A two-dimensional list is another term for lists of lists. There are two levels of indexing needed to access an item. You can create such a list by doing the following:
scores = [0] * n
for i in range(n):
    scores[i] = [0] * n
This will create the two-dimensional list and initialize all of the values to 0.

Once you have created this two-dimensional list, you can store the value x in row r, column c by writing scores[r][c] = x.

3. Scoring one alignment of two amino acid sequences

Suppose that there is a very small imaginary protein protA produced by three animals: a mouse, a fly and a bee. You would like to compare protA's amino acid sequence as found in each of these animals to figure out which two animals have the most similar sequence. Here are the amino acid sequences of each of these animal's protA:

mouse : AGDVEK
fly   : AGWVEK
bee   : AGFVEK
You can see that the mouse, fly and bee all have five amino acids in common and one that is different. However, aspartic acid (Asp/D) has a small, polar side chain and tryptophan (Trp/W) has a large, non-polar side chain. Because of the structural differences of these two amino acids, it is likely that substituting an Asp for a Trp would have an impact on the functionality of the protein. On the other hand, both Trp and Phenylalanine (Phe/F) have large, non-polar side chains terminating in a carbon ring. Therefore, it is likely that substituting a Phe for a Trp would have less of an impact on the protein's function than substituting an Asp for a Trp or Phe would have.

By studying the variances in proteins across species, biochemists have developed metrics that assign a numerical value to particular amino acid substitutions. In this lab, we will use the BLOSUM80 substitution matrix to quantify the difference between two proteins. BLOSUM80 assigns a positive score to likely differences and a negative score to unlikely differences. The most likely "difference" occurs when an amino acid in one sequence is identical to the amino acid in the other sequence. Here is a representation of the BLOSUM80 matrix. Notice that substituting a D for a W (find where the row labeled D intersects the column labeled W) has a score of -6, substituting a D for an F has a score of -4, and substituting an F for a W has a score of 0.

To find out the score for substituting 'D' for 'W' in your program, be sure you have imported the blosum80 function from the genetics library, then type blosum80('D','W').

With this matrix, we can now score the differences between any two sequences. We do this by computing the sum of the scores of all the substitutions necessary to transform one sequence into the other sequence.

To determine which two animals have the most similar sequences, we need to score each pair of animals. First, let's score the fly and mouse sequence alignment:

fly   :  A  G  W  V  E  K
mouse :  A  G  D  V  E  K
score :  5  6 -6  4  6  5  --> total = 5+6-6+4+6+5 = 20
Then, let's score the fly and bee sequence alignment:
fly   :  A  G  W  V  E  K
bee   :  A  G  F  V  E  K
score :  5  6  0  4  6  5  --> total = 5+6+0+4+6+5 = 26
And finally, let's score the bee and mouse sequence alignment:
mouse :  A  G  D  V  E  K
bee   :  A  G  F  V  E  K
score :  5  6 -4  4  6  5  --> total = 5+6+0+4+6+5 = 22
From this, we see that the fly and bee are the most similar (with a score of 26). We also see that the mouse is more similar to the bee (with a score of 22) than the fly (with a score of 20).

4. Consider all possible alignments of two sequences and find the best score

The above mouse, fly, and bee example showed how to score two aligned amino acid sequences. Now we must figure out how to align two amino acid sequences. Let's look at another hypothetical (and very short) protein produced by three animals from the fakeData.txt file:

horse :  GDVAK
cow   :  PAGDAER
pig   :  AGDVA
Notice that unlike in the previous example, the amino acid sequences in this example are not all the same length. The horse and pig sequences both have 5 amino acids, but the cow sequence has 7 amino acids. How can we score the alignment between the horse and cow sequences if they have different lengths?

Let's begin by introducing the notion of a gap. Whenever one sequence has an amino acid and the other does not, we will incur a gap penalty. The BLOSUM80 matrix does not define the gap penalty. For this lab, we use a gap penalty of -6. Gaps will be represented below by a dash. When you align a gap with an amino acid, you incur the gap penalty. If you align a dash with another dash, it incurs no penalty and should be scored as 0.

For example:

horse :  G  D  V  A  K  -  -
cow   :  P  A  G  D  A  E  R
score : -3 -2 -4 -2 -1 -6 -6  --> total = -3-2-4-2-1-6-6 = -24
A score of -24 is quite bad. Notice that this is not the only possible way to align the horse and cow sequences. Consider the following alternative:
horse :  -  -  G  D  V  A  K 
cow   :  P  A  G  D  A  E  R
score : -6 -6  6  6  0 -1  2  --> total = -6-6+6+6+0-1+2 = 1
This alignment yields a much better score than the first alignment we tried.

In this lab, we will determine the similarity score between two sequences by choosing the maximum score of all possible alignments between one sequence and another. Note that we will make the simplifying assumption that gaps can only appear at the start and/or end of an amino acid sequence.

The only way to determine which alignment is the best is to consider every possible aligment. For the cow and horse, one way to view all of these alignments is shown below:

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : G  D  V  A  K  -  -  -  -  -  -  -  -  -  -  
score :-6 -6 -6 -6 -1 -6 -6 -6 -6 -6 -6  0  0  0  0  => -61  

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  G  D  V  A  K  -  -  -  -  -  -  -  -  -  
score : 0 -6 -6 -6 -1 -1 -6 -6 -6 -6 -6  0  0  0  0  => -50 

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  -  G  D  V  A  K  -  -  -  -  -  -  -  -  
score : 0  0 -6 -6 -3  5 -2 -6 -6 -6 -6  0  0  0  0  => -36  

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  -  -  G  D  V  A  K  -  -  -  -  -  -  -  
score : 0  0  0 -6 -2  0  0 -1 -6 -6 -6  0  0  0  0  => -27  

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  -  -  -  G  D  V  A  K  -  -  -  -  -  -  
score : 0  0  0  0 -3 -2 -4 -2 -1 -6 -6  0  0  0  0  => -24

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  -  -  -  -  G  D  V  A  K  -  -  -  -  -  
score : 0  0  0  0 -6  0 -2 -4  5  1 -6  0  0  0  0  => -12

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  -  -  -  -  -  G  D  V  A  K  -  -  -  -  
score : 0  0  0  0 -6 -6  6  6  0 -1  2  0  0  0  0  => 1    

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  -  -  -  -  -  -  G  D  V  A  K  -  -  -  
score : 0  0  0  0 -6 -6 -6 -2 -2 -3 -2 -6  0  0  0  => -33

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  -  -  -  -  -  -  -  G  D  V  A  K  -  -  
score : 0  0  0  0 -6 -6 -6 -6  0  1 -3 -6 -6  0  0  => -38   

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  -  -  -  -  -  -  -  -  G  D  V  A  K  -  
score : 0  0  0  0 -6 -6 -6 -6 -6 -3 -2 -6 -6 -6  0  => -53

cow   : -  -  -  -  P  A  G  D  A  E  R  -  -  -  -  
horse : -  -  -  -  -  -  -  -  -  -  G  D  V  A  K  
score : 0  0  0  0 -6 -6 -6 -6 -6 -6 -3 -6 -6 -6 -6  => -63

5. Computing all pairwise alignments

In order to create the binary tree to visualize the similarity between all of the organisms, you need to compute the similarity between each pair of organisms. The similarity score is symmetric, which means that similarity score between a bee and a fly is the same as the similarity score between a fly and a bee, so you don't need to compute both.

Let's first see how to store similarity scores in your scores list, by looking at an example. Assume you are comparing the dromedary camel (whose sequence should be in position 0 in your list of sequences) against the elderberry (whose sequence should be in position 10 in your list of sequences). Remember that the score of comparing one organism to another is the maximum score of all possible alignments of the amino acid sequences representing the two organisms. Once you have determined the maximum alignment score between the camel and the elderberry, you should store the score in scores[0][10], since that represents the score between organism 0 and organism 10. Also, remember that similarity scores are symmetric, so you can also store this same score in scores[10][0].

Now that you can store scores in your list, you need to do the pairwise comparisons. For each organism that you read in, you need to compare it to every other organism and store the score in the scores list. Note that this will take a while to complete, so you should start by first testing your code with the file fakeData.txt. You should obtain the following scores matrix using this data:

[[0, 1, 9], [1, 0, 4], [9, 4, 0]]

You may want to write a function to print the scores more nicely like this:

  0   1   9
  1   0   4
  9   4   0
Once your code is working on the fake data, next try using a small subset of the real data provided in the file /usr/local/lib/genetics/cytochrome_c-10.txt.

Once your program successfully handles these smaller test cases, you can try it on the full data set.

6. Drawing the binary tree

Once you have created the pairwise alignments, you can display a binary tree representing the similarity between the amino acid sequences for Cytochrome c in these organisms. You should find that organisms with more similar Cytochrome c sequences are more related than they are to organisms with more dissimilar sequences.

To draw the tree, you will use the drawTree function that you imported from the genetics library. The drawTree function takes two parameters: the two-dimensional list of similarity scores, and the list of common names (or binomial names, your preference) your read in earlier. For example:

drawTree(pairwiseScores, names)

For the data containing only 10 species your tree should look similar to the following:

Submit

Once you are satisfied with your program, hand it in by typing handin21 at the unix prompt.