CS68 Lab 2: Multiple Sequence Alignment

Due by noon, Monday October 6, 2014

The goals of this week's lab:

To accomplish these goals, the lab is broken into three parts:

  1. A tutorial on using ClustalX (a GUI version of ClustalW).
  2. An analysis and short 1-page response to a research paper on the multiple sequence alignment algorithm, MUSCLE.
  3. A problem set to practice and analyze algorithms we have discussed in class.
You may work closely in discussing these problems with one other person, but your answers must be submitted individually. Please write the name of your co-worker at the top of the assignment. You may hand in your solution either (neatly) written or typed or electronically as a PDF. In both cases, be sure to fill and submit the README file in this week's directory. You should also submit any supplementary materials (e.g., code) using handin68.


ClustalX

To begin, read this tutorial on using ClustalX. I have placed two sets of sequences you can use to test out ClustalX in your labs directory: DnaB.txt which is a small set of medium-length protein sequences, and sequences12.txt - a large set of myosin heavy chain proteins. You will run ClustalX by entering the following command:

$ /usr/swat/bin/clustalx-2.1/clustalx
Note: you will probably want to increase the font size after starting the program so the sequences/results are easier to read.

You can begin by using the quick start guide on your DnaB.txt file. Then, read the rest of the document testing out different parameters to see how they alter your results.

Once you complete the document and analyze both sequences, answer the following questions:

  1. Briefly describe the trade-offs between the Fast and Slow approaches. Which option is suitable for your DnaB.txt example? How about the sequences12.txt file? Thinking of the intent of a user interested in asking biological questions; why might you want to combine both options? That is - attempt a fast run followed by a slow run.

  2. After producing an alignment, one must assess the quality. After one run, can you determine that you have the best alignment possible? What strategies do the authors suggest for a) assessing quality and b) finding the best alignment (HINT: read the section "Assessing the quality of alignment")?

  3. Attempt to interpret your results. As a biologist, how does the color coding aid you in analysis? Specifically, think of a column which has varied content (i.e., medium to high entropy) yet consistently stays the same color. How might this alter your assessment of the "conservation" of a column.

  4. After running ClustalX on one of the input sets, head on over to the WebLogo page for producing representations of the entropy of a sequence. Load the alignment from one of your examples and choose a range of approximately 25 amino acids. Your range should include example of highly conserved regions and low-conserved regions. Save the output file and either included it in your written solutions or add it as a file named logo.png in this week's lab directory and submit using handin68.


Further Reading and Response Paper

As noted in class, ClustalW is one of the most widely used technique, but is not necessarily the most advanced method available. Another popular program is MUSCLE. Read the brief introduction of the paper, available here. Discuss the paper with your classmates (at least one other person) and reflect on the following aspects:

With your homework, submit a short paper on one interesting aspect of the paper. The paper should be short - less than 1 page single-spaced. Rather than addressing all of the questions above, your paper should provide a detailed analysis on a specific aspect of the paper . This should stem from your discussion above (e.g., the use of distance functions; the relationship to phylogenetic trees; the choice of statistical methods for evaluation, etc.) or can explore some other aspect you find interesting. You should a) summarize the aspect of the MUSCLE paper you are discussing, b) provide a detailed analysis of this aspect (e.g., compare and contrast to in-class methods; explain the relationship between the biology and the algorithm, etc.) and c) pose questions of further inquiry that the paper did not cover. The provided paper is a shortened version of the full detailed algorithm. You may want to examine this longer version to get more details for your written response.


Problem Set
  1. You are given the following sequences:
    1. CTATTAATAC
    2. TATTAATAC
    3. CTATTAT
    4. CTATTAATC
    5. ATTAATAC
    Use the Star algorithm to identify a multiple sequence alignment. Pick the string that has the maximal average alignment score to select the center. To perform your alignments, use a global alignment algorithm with linear gap penalty. Matches should have a score of +1, substitutions -1, and gaps -3. You can do this by hand or write a short program. Show your work if done by hand, or submit your code as star.py. At a minimum, you should show the average/sum of alignment scores for each sequence and the full multiple sequence alignment.

  2. Do the same for the Feng-Doolittle algorithm. (HINT: You should be able to use your intermediate results from the Star Alignment problem above). Be sure to submit your initial distance matrix (use the equation in the book to convert your alignment scores to distance scores) and also your guide tree. Use the following values for a random alignment:
    Srand(0,1) = -6
    Srand(0,2) = -7
    Srand(0,3) = -4
    Srand(0,4) = -7
    Srand(1,2) = -8
    Srand(1,3) = -6
    Srand(1,4) = -3
    Srand(2,3) = -8
    Srand(2,4) = -6
    Srand(3,4) = -5
    
    The S_max for the 5 sequences are respectively 10, 9, 7, 9, 8.

    When updating the distance matrix, it is acceptable to either take the average distance of all pairs or to pick the minimum. So, for example, if you merge sequences x and y into one cluster, the distance between {x,y} and z is the either the average or minimum of the individual pairwise distances d_xz and d_yz. Use the minimum distance between the two groups. Also, please use natural log when calculating distances to make grading a bit easier.


  3. In class, we discussed two scoring metrics: sum of pairs (SP) and minimum entropy. Consider a column m_i that contains the following characters: A,A,A,A and another that contains A,A,A,D. Using the BLOSUM62 matrix, compute the sum of pairs and minimum entropy scores for each column. Note that I have given columns and not sequences. s(A,A) = 4 and s(A,D)=-2.

  4. Do the same for column A,A,A,A,A versus column A,A,A,A,D. Compare your results to the previous problem. How does this support our discussion on the relative advantages/disadvantages to the scoring metrics.

  5. In class, we discussed DNA as a linear sequence of nucleotides. Not all DNA exists in this double helix form, however. The genomes of many bacteria and even some eukaryotes (e.g., mitochondrial DNA in humans) exist in ciruclar form. That is, there is no "start" or "end" to the sequence. Devise an efficient algorithm to perform optimal local alignment on two circular DNA sequences. This problem is conceptually difficult because there is not a 0th character in the sequence. A O(nm) algorithm exists, but if you cannot think of it, describe your best attempt and its O() run time. As a hint, your solution should build off of your existing local/global alignment algorithms from class.


Submitting your work

You must hand in an individual solution by the due date. You must also write the name of the individual you discussed problems in detail with. You may talk about high-level concepts with as many individuals as you like (for example, the paper on MUSCLE). But you are only allowed to work on the specific problem solutions with one person in the class. Supplementary material should be handed in using handin68. Do not forgot to submit your README for this week, as well.