CS68: Lab 5

Overview

The goal of this week's lab is to explore the use of Markov models for the task of finding genes in a large genome. Specifically, you will implement several variations of Markov models discussed in class and then use your models to test for genes in an E. coli genome. This lab is more complex from a design perspective than previous assignments. As such, I strongly recommend you work with a partner on this lab. When your lab is complete, you will hand it in using handin68.

Your program will construct the following models:

A second-order, homogenous Markov chain model to represent non-coding DNA (i.e., our null/background model).
A second-order, inhomogenous Markov chain model of coding regions of DNA. There should be three sets of states to model each position in the codon separately
A fifth-order inhomogenous Markov chain model of coding regions of DNA

The user will specify which of the latter two models to use for modeling coding regions when they run the program. I recommend first getting your program to work for the first two models and then extending the program to use fifth-order models.

Your program will need to be well-designed, so read this document thoroughly before beginning to program. Your program, at a high-level, will have the following phases:

Reading Input Files- your program will need to load in the E. coli genome, a list of known genes, and a list of potential genes
Parameter Estimation - your program will use the list of genes as training data to estimate the parameters of each of the Markov chain models discussed above
Classification - next, you will use the trained models to determine whether the hypothetical gene locations proposed are more likely to be non-coding (i.e., background) or coding sequences.
Output - you will report the result of classification including the accuracy of predictions made.

Program Requirements and Input Files

Your main program should be called geneFinder (or geneFinder.py). You should also create supplementary files for any classes or functions you define - for example, a Markov class to represent your Markov chain models

Your program will take the following four arguments on the command line:

An integer specifying the order of the coding Markov model (e.g., 2 or 5)
A sequence file name (e.g., ecoliStrainM54.txt)
A file name for the list of known genes (e.g., ecoliGeneIndex.txt)
A file name for the list of test sequences (e.g., ecoliTestIndex.txt)

The training data for your two Markov models are provided in ecoliStrainM54.txt and ecoliGeneIndex.txt. The first file contains the complete E. coli genome sequence (M54 is a popular laboratory strain of E. coli). If using Python, be careful what methods you use when loading the sequence (and reversing it). Since strings are immutable, certain operators can be very inefficient. It should not take more than a few seconds to run your whole program. See the class email for details.

The gene index file contains the start and stop index of known genes. Each line represent one gene. It is not a complete list, and only covers the first 223,048 base pairs in the sequence. The first column of each line is an identifier for the gene, and the second column lists the type of gene. Both of these can be ignored. The last two indices represent the start and stop index of the sequence so that you can extract genes from the original sequence. The last character is the orientation and represents which strand the gene is found. A > indicates that the gene is the sequence in the file. A < means the gene is on the complementary strand, so you will need to take the reverse-complement of the sequence. Some things to keep in mind when processing the genes:

The indices use 1-based indexing (i.e., the first character in the sequence is 1).
The gene is inclusive of both values. For example, 190-255 means the 190th character is part of the gene as is the 255th. The array slice in 0-based indexing would be [189:255].
All index boundaries refer to the indexing of the original sequence. That is, the indices (1-5) on the reverse strand are found by taking the first 5 characters in the original sequence and reverse complementing them.
The file is in sorted order by the left index. So, when you are training your model, you do not need to backtrack in the sequence. You can assume the next non-coding and coding region will be further than the current gene.
The gene-index only covers the training portion of the sequence. Anything to the right of the last identified gene (i.e., past 223408) is a potential test case for your models and therefore should not be used for training either model.

The test file contains a list of sequence locations in the original strain that you want to test for genes. Each line describes one subsequence to test. The first column is the start/left index of the sequence, and the second column is the stop/right index of the sequence and follows the specification from above. The last column is the correct identification of whether the subsequence is a gene (+) or non-coding (-). You will use this evaluate the accuracy of your models. Note that all of these are on the main strand template, none are on the complementary strand. Also, you can assume that the provided file will be sorted in order by the left index.

Output

Your program should output the result of classification on each of the test sequences in the test file. Specifically, it should output the following 6 columns: start index, stop index, actual classification (these three are already in the file), your model's prediction, and the log probabilities for the non-coding model and the coding model. You do not need to worry about formatting these columns but there must be some whitespace between each.

In addition, your program should output at the end the accuracy of your program in terms of precision and recall. (See the Classification section for details).

Parameter Estimation

Once you have loaded the input files, you will need to construct two Markov models - one for non-coding regions (a second-order homogenous model) and for coding regions (either a second or fifth-order inhomogenous model). As discussed in class, you will need train your model to detect these two specific positions by using parameter estimation. For this lab, you will use Laplace estimates for estimating the parameters. Be sure to use double precision to represent probabilities (python does this for you, C++ requires you to use double instead of float for the type). Your models do not need to model the probability of beginning or ending with certain characters. You should estimate your 0th order probabilities using all characters in the appropriate subsequence.

To train your gene model, extract all genes from the sequence using the genes identified in the gene-index input file. Be careful to use the correct strand (main strand or complement) as discussed above.

To train your non-coding mode, use all non-gene sequences on both strands. You should not use any sequence past the last gene. You should also not use any sequence if there is a gene in the region on the opposite strand. As example, assume we have a sequence of length 1000, with the first 100 designated as training. Assume the only gene boundary is (5,10) on the original strand. Bases in position 5 through 10 should be used to estimate the coding model, while the sequences (1,4), (11,100) and the reverse complement of these two should all be used to estimate parameters in the non-coding model.

This process is fairly simple if designed well. An example design is to store all genes in a list and store the right-most index of the previous gene as a cursor. When you load a gene, first train on the non-coding model on the sequence from the cursor to the left index on both the original and complementary strand. Then, train your coding model on the sequence from the left index through right index only on the appropriate strand. Be sure to update the cursor only if it increases in value (some genes overlap, so you don't want to backtrack the cursor.

Classification

You will use the test file provided to evaluate your models. For each subsequence defined in the test file, you will:

Run the sequence through your non-coding model to obtain the log probability of the sequence being produced by the non-coding model
Do the same for the coding model
Decided if the subsequence is more likely to be a gene or not

Note that you should use log probabilities as discussed in class and in the textbook. Multiplying small values repeatedly can lead to underflow errors on the CPU, so instead we will take the sum of log probabilities. E.g.

P(x) = P(x1)P(x2|x1)....P(xn|xn-1)

log P(X) = log P(x1) + log P(x2|x1) + .... + P(xn|xn-1)

When outputting your classifications, you will output the information from the file in the first three columns and then either a + or - depending on your programs decision of whether the sequence is a gene or not. You should also output the log probabilities for each model (non-coding first).

In addition, you will report the cumulative statistics of precision and recall. To determine those value, you will keep track of sum of accuracy of your individual predictions using four metrics. If you predict + and the correct answer from the file is +, that counts as a true positive. If it was supposed to be negative, that is a false positive. Conversely, if you predict a - and it is -, that is a true negative while an incorrect prediction is a false negative.

Precision is the ratio of correctly identified genes versus total genes predicted to be true. Recall is the number of genes correctly found versus the total number of genes in the test file:

  precision =          true positives
              --------------------------------
              true positives + false positives
  
  recall =             true positives
              --------------------------------
              true positives + false negatives

For example, let us say you have a test file of 200 test sequences containing 50 actual genes and 150 non-coding sequences. If your program correctly identifies 25 of those genes, your recall is 50% (25 true positives, 25 false negatives for the genes not found). In addition, if you incorrectly predicted 10 non-coding regions as genes, your precision is 25/(25+10) = 71.4%

In the labs directory, I have added a sample toy set. A sample run would like as follows:

$ python geneFinder.py 2 smallSequence.txt smallGeneIndext.txt smallTestIndex.txt

15 18 - - -5.48063892334 -5.48063892334
23 27 - + -7.33693691371 -6.64378973315
25 31 + + -10.1095256359 -5.83731386728
Precision: 0.5
Recall: 1.0
Overall Accuracy: 0.666666666667

More detailed outputs of the counts after training (including Laplace estimates) can be found here.

Update: I have posted the classification results for the first 10 test cases for the 2nd order coding Markov model and 5th order model. You are responsible for coming up with test cases to fully evaluate your methods.

Submitting your work

When you are finished with your lab, submit your code using handin68. You should submit, at a minimum:

geneFinder source code and executable (if applicable)
The output of your program on the given test files for a second-order inhomogenous Markov model as the file test2.out
The output for a fifth order Markov model as the file test5.out

CS68 Lab 5: Gene Finding with Markov Models

Output