CS68: Lab 7

Overview

The goal of this week's lab is to obtain experience implementing probabilistic sequence models. Specifically, you will implement a profile HMM to obtain a multiple sequence alignment for a family of proteins. As with the previous lab, there is a good deal of complexity to this lab. I strongly recommend that you work with a partner on this lab and that you spend significant time thinking about the design of your implementation. When your lab is complete, you will hand it in using handin68.

Your program will need to be well-designed, so read this document thoroughly before beginning to program. Your program, at a high-level, will have the following phases:

Reading Input Files- your program will need to load in the list of protein sequences from file assuming FASTA format.
Construct a profile HMM - before you training your parameters, you need to create a profile HMM with the appropriate number of states/transitions.
Parameter Estimation/Training - your program will use the list of genes as training data to estimate the parameters of your HMM using the Baum-Welch algorithm
Computing an MSA - next, you will use the trained model to produce the optimal multiple sequence alignment using the Viterbi algorithm.
Output - you will report the resulting alignment ~~the probability of that alignment, as well as an evaluation of that alignment using both entropy and sum of pairs.~~

Program Requirements and Input Files

Your main program should be called pHMM-MSA (or pHMM-MSA.py). You should also create supplementary files for any classes or functions you define - for example, a ProfileHMM class to represent your profile hidden Markov models

Your program will take the following two arguments on the command line:

The name of the FASTA file containing all sequences
The number of iterations to use for training

The training data for your HMM is provided in a FASTA file with the following format:

>sequence 1 name
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>sequence 2 name
YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY

where the repeating X and Y is replaced with the actual protein sequence.

Output

Your outputs should occur in two sections: 1) Forward and Backward values for each sequence for each iteration of Baum-Welch and 2) the final results in the format of a multiple sequence alignment.

The first outputs allows me to see your intermediate calculations. You should format as follows. First, at the beginning of each iteration of Baum-Welch, output the current iteration number. Second, after completing the calculation of forward and backwards calculations for the E step, output the P(x) according to both algorithms (they should be the same). For example, your output will look similar to:

Iteration 1:
P(x_1) =  = 
P(x_2) =  = 
...

Iteration 2:
P(x_1) =  = 
P(x_2) =  = 
...

Once Baum-Welch is complete, your program should output the resulting MSA of the input sequences. Each line should begin with sequence name, followed by the alignment of that sequence. To keep things nicely formatted, you can use the following string formatting print state for Python:

for i in range(len(sequences)):
        print names[i]
        print "\t\t" + viterbi[i]

where names is a list of all sequence names and viterbi is a list of all Viterbi results for each sequence. Details on creating the Viterbi trace are below.

Constructing and training a profile HMM

Once you have loaded the input files, you should construct your HMM. Assume the number of match state is equal to the number of amino acids in the first sequence. Using the profile HMM construction from class, you should also add the appropriate insert state, delete states, and begin/end state.

Your HMM will be a sparse graph; that is, you should only train transitions specified in the standard profile HMM diagram. For example, you should not have a connection from the begin state to match state 4.

You should initialize your transition parameters as follows:

Transitions from match states (and begin): 0.8 to next match state; 0.1 to next insert; 0.1 to next delete
The transition from the last match state: 0.9 to the end state; 0.1 to the last insert state
Transitions from insert states: 0.2 to the next match state/end state; 0.8 for self transitions
Transitions from delete states (except the last one): 0.8 to next match state; 0.2 to next delete state. For simplicity, we will omit transitions from delete states to insert states and vice versa with the understanding that two gaps are more costly than a substitution. This is different than what we discussed in lab but is an engineering choice that many implementations (including this lab) will employ
Transition with probability 1 from last delete state to end state

For emission probabilities:

All insert states will share the same insert probabilities. You should initialize these to a uniform (0.05) probability
Each match state has its own set of emission probabilities. You should initialize these two uniform probabilities as well.

You will use Baum-Welch to train your model for the specified number of iterations provided in the input. You should use Laplace estimates estimates. Be careful to not allow transitions that are not in the graph

Compute an MSA

After training the parameters of your model, your program should compute a multiple sequence alignment for the given protein sequence. You should implement Viterbi to do this and use sum of log probabilities instead of multiple probabilities.

When creating the Viterbi trace, you should follow the points from the END state at time step N until you reach the BEGIN state at time step 0. Along the way, you will accumulate a string with the following pattern depending on what state you are in:

If you are in a DELETE state, prepend a "-"

If you are in an INSERT state at time step i, prepend a lower case version of the ith character
(e.g., sequence[i-1].lower())

If you are in a MATCH state at time step i, prepend the upper case version of the ith character
(e.g., sequence[i-1].upper().

For END or BEGIN states, accumulate nothing.

This following paragraph is now optional:
In addition, to obtain a proper format, if a sequence uses an insert state at position i, all sequences that did not also use an insert at that position should insert a gap.

Sample Output

For the short test example (short.fasta) run using 2 iterations:

$ ./pHMM-MSA.py short.fasta 2

There is a larger example in globin.fasta which is taken from the Pfam list of globin proteins. Pfam utilizes a profile HMM similar to the one you constructed to identify protein families. I have taken 34 proteins identified as being characteristic of the globin family. The results after 10 iterations of Baum-Welch is available here. Note that I have implemented the optional alignment fix for inserts so your final alignment may have fewer gap characters.

Submitting your work

When you are finished with your lab, submit your code using handin68. If you know your lab is not correct, also submit a README file describing what parts of your algorithm do not work and what you attempted to try to solve the problem.

CS68 Lab 7: Profile HMMs for Multiple Sequence Alignment

Output