CS68 Lab 1: Dynamic Programming and Pairwise Sequence Alignment

Due by 11:59 p.m., Friday February 10, 2017. If handing in a physically, your problem set is due by 5pm.

The goals of this week's lab:

This week's lab is broken into two segments: a problem set, followed by a programming assignment to implement the local pairwise alignment algorithm from class. Please see instructions below for submitting solutions for each portion. You may work with one partner for this lab.


Problem Set

Problem set questions are available here, and your solutions should be handed in on paper.


Programming Assignment: Local alignment with linear gap penalty

Implement a program that uses dynamic programming to compute a pairwise local alignment using a linear gap penalty function. The program should work on protein sequences (you can generalize it to DNA sequences as well, but this is not required).

Getting Started

Find your git repo for this lab assignment with name format of Lab1-id1_id2 (id1 being replaced by the user id of you and your partner): CS68-S17

Clone your git repo with the lab 1 starting point files into your labs directory:

$ cd cs68/labs
$ git clone [the ssh url to your your repo)
Then cd into your Lab1-id1_id2 subdirectory. You will have the following files (those in blue require your modification): You are allowed to add additional python files as for importing into localAlign.py.

Usage

Your program should take in 5 command-line arguments in this specific order:
  1. The name of a file containing the first sequence to align
  2. The name of a file containing the second sequence to align
  3. The name of a file containing the substitution matrix
  4. the gap penalty (g)
  5. The name of the output file to save the alignment
For example:
python localAlign.py input/sample1.fasta input/sample2.fasta blosum-62 -2 sampleAlign.txt
If the user specifies an incorrect number of arguments, print a usage statement and exit the program. You should also error check all parameters to ensure that the files exist and the gap penalty is negative. If anything is amiss, you can simply exit the program. For example:
$ python localAlign.py
Error, incorrect number of arguments
Usage: python localAlign.py seq1 seq2 matrixFile g output
  seq1, seq2 - fasta files containing two sequences to align (1 per file)
  matrixFile - file containing substitution matrix
  g - integer specifying penalty for a gap, must be negative
  output - name of output file for raw alignment

Breaking Ties

To be consistent, and obtain the same output as I have below, break ties using the high road protocol:
  1. End alignment (if 0)
  2. Insert gap in y (i.e., output character in first sequence, gap in second)
  3. Match the characters in x and y
  4. Insert gap in x
If there are multiple substrings with the same maximum score, output them all. You should worry about this requirement after you get your program working for just one maximum.

Input

I have provided three sample files in your directory: Input files will follow a similar format to lab 0: the FASTA format where the first line is a comment, and then each line after that is part of the sequence.

Output

At a minimum, your output should contain:

While these are the minimum, part of your final grade will involve going beyond the minimum to present visually appealing results. My sample output shows an example - you do not need to conform to the exact formatting if you have diffferent preferences for visualizing results. Take a look at tools online, such as BLAST and EMBOSS to get ideas on visualizing alignments. One suggestion is to break the alignment into smaller segments (50 in my sample output). Another is to use the matplotlib library to visualize the scoring matrix of optimal scores (see images below).

Sample output

Below our the outputs from the provided examples. My standard output of the alignment includes outputting 50 characters per line for each sequence, with the start and ending character indices indicated as well as (optional) markers to indicate perfect alignments ('|') and substitutions ('.'). Below are the outputs for several of the examples:

Additional requirements


Tips and Hints

Command-line arguments

To use command-line arguments in Python, first, import the sys library.
from sys import *
You can now access a list argv containing all of the arguments. Recall that argv[0] is the name of the program, while argv[1] is the first argument provided by the user.

Substitution Matrix

Reading in the substitution matrix will require some careful thought and testing. One method is to simply store the values as a 20 by 20 two-dimensional array of integers and then have a separate function that maps amino acid characters to an index.

Another technique is to use a two-dimensional dictionary. Each row in the file will translate into a dictionary. They key is the character at the beginning of the line. The value it maps to is itself a dictionary containing all of the (column,value) pairs for that row. If I do this in python for just the first two rows and columns and print the resulting dictionary of dictionaries, I get the following value:

{'A': {'A': 4, 'R': -1}, 'R': {'A': -1, 'R': 5}}
The first pair in the dictionary has a key 'A' and value which itself is a dictionary {'A': 4, 'R': -1}. If I want the substitution value for replacing an A with an R, I simple call:
>>> print s['A']['R']
-1
where s is the dictionary structure I stored the values in. It's a bit more complicated to think about, but easier to use since you will not need to map the amino acids to integers.

A third approach is to stick with the two-dimensional array, but place it in a class SubstitutionMatrix such that you can create an object and call a getScore method that does all of the mapping for you. Either technique is acceptable, but your code should be readable and well designed. If using C++, either of the two- dimensional array options are probably easier to implement.

Submitting your work

The problem set and paper response should be handed in on paper. You can hand them in class or place them in the mailbox outside my door by the due date.

For the programming portion, be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded.