CS68 Lab 3: Inferring the Tree of Life

Due by 11:59pm, Thursday October 23, 2014

The goal of this week's lab is to:

You may work with a partner on this lab. Obtain starting point files using update68 and submit your solutions using handin68.


Practice Problems

We will be doing the following problems in-lab. Given the matrix in your labs/3/input directory titled nj_inclass.dist:

  1. Show whether the matrix is ultrametric
  2. Show whether the matrix is additive
  3. Run the UPGMA algorithm on the distance matrix
  4. Run the neighbor joining algorithm
Neighbor Joining

Neighbor Joining is the most popular distance-based method for inferring tree structures given its ability to handle non-ultrametric data. You will write a program NJTree.py with the following objective:

An example run for your program would be:

$ ./NJTree.py distances
Where distances is the name of the file containing the distance matrix.

Input File Format

Your distance file will contain one line per pairwise distance in the format:

A B d_A,B
where A and B are two different taxa (e.g., genes, organisms) and d_A,B is a positive valued float.

Naming Parent Clusters

To name your clusters, simply merge the names of the two sub-clusters with a hyphen in between, e.g. merging A and B yields a parent cluster A-B. Merging parent cluster A-B with C-D would yield A-B-C-D (the ordering of letters does not matter).

Output Text Representation of Tree

Your program should output a list of edges in your tree in a similar format to the input, but using distances from the resulting tree as well as including inferred parent clusters:

A A-B d_A,A-B
B A-B d_B,A-B
You should only write one line per edge; that is, there should be no distance between A and B since they are not directly connected by an edge.

Output Image File

Your program should output a visualization of your tree using the GraphViz package. This is a popular tool used throughout engineering and science disciplines to draw graphs. Python has a library that makes integrating GraphViz into your program fairly easy.

To use in your program, first import the Python GraphViz library:

import pygraphviz as pgv
NOTE: recall that a tree is a special type of graph, so you should use the graph object in GraphViz to represent your tree. To create a graph T and add nodes and edges:
T = pgv.AGraph() #Constructs a graph object
T.add_node("A") #The string is both the label and hash key
T.add_node("B") 
T.add_edge("A","B",label="1.0", len=1.0) #Takes in two node keys, an edge label, and the length of the edge

Your tree should have exactly one node for each sequence in the set, and then add one additional node for every parent cluster (i.e., on every merge). The program should have no edges initially, but will add two edges for every merge operation (an edge from each child to the new parent cluster). The edge label and len should be the distance between the child and parent; i.e., d_ik between the two clusters i and k. I suggest setting the len field to a minimum value of 1.0 to prevent overlapping nodes.

Lastly, you will output your graph to a file by using the draw command:

T.draw("tree.png",prog="neato")  #neato is the algorithm for drawing the graph
You can use your favorite image viewer to view your results.

Sample Output

Using the in-lab example:

$ ./NJTree.py input/nj_inclass.dist
A A-B 6.0
B A-B 1.0
A-B A-B-C 3.0
C A-B-C 2.0
D A-B-C 5.0
The resulting image: In-class Example

Analysis

You should see an input file primate.dist in your input directory. Run your program on this matrix and submit the resulting png as primate.png in your lab directory These distances are scaled distances of mitochondrial DNA found in several primate organisms. I have abbreviated the names to improve readability of the graph image (the nodes become very large with long labels). The mapping is as follows:

In the README file, answer the following questions:
  1. Which two species were joined first by the neighbor-joining algorithm? Are these two species the closest pair of species in your data set according to the original distance matrix?
  2. If so, explain why the algorithm guarantees this result. If not, explain why this pair was chosen first
  3. For the next set of questions, take a look at the following image produced by the neighbor joining algorithm for studying HIV. The numbers in the nodes map to various strains of HIV (human immunodeficiency virus) and SIV (simian immunodeficiency virus). You can take a look at the mapping and original distance matrix.

    AIDS (Acquired Immune Deficiency Syndrome) is caused by two variants of HIV in humans - HIV-1 and HIV-2. You will use the tree to answer questions about the evolution of these two variants, in particular focusing on their relationship with various strains of SIV.


  4. Taking a look at the graph and provided mapping, which virus strain is most likely the outgroup?
  5. Draw the corresponding rooted tree using your answer to the previous question. Do not worry about drawing branch lengths to scale. You can use some ASCII art in your README, or drop a pdf or image of your tree with the file name rootedHIV.pdf/png
  6. Given the tree you have produced, draw some rough conclusions about the evolution of HIV-1 and HIV-2. Does it appear that they evolved from one common simian source or independently from different simian groups (i.e., are they clustered together and separate from strains of SIV? Or are the two HIV variants clustered more closely with SIV than each other)?

Weighted Parsimony

For the second part of your lab, you will approach tree inference using parimony. In a file called weightedParsimony.py, write a program with the following objective:

An example run of your program:
$ ./weightedParsimony.py tree substitution_matrix leaf_sequences

Input File Format

Leaf Sequences: Contains a list of all of the K sequences representing the leaves of the tree. Each line will be in the format:

taxa_id: sequence
where taxa_id is the name of the organism or gene, and shoulod match the a leaf in the tree input file, and sequence is a DNA sequence. Using the example we have from class, I have labeled each leaf sequence as L1 through L4 in input/wp_inclass.seq

Tree File: The tree file will define the topology of the phylogenetic tree you are attempting to score. The first line specifies the root. Each line after that is in the format:

child parent
For example, the simple tree from class (input/wp_inclass.tree) would be:
N1
N2 N1
N3 N1
L1 N2
L2 N2
L3 N3
L4 N3
Note that the root has been labeled N1. It has two children N2,N3. The leaves connect to either N2 or N3.

Weighted Cost Matrix: Specifies the substitution value for each pair of states possible at each position in the sequence. This has the exact same format as the distance matrix in the Neighbor Joining algorithm; that is, each line contains a pair of characters followed by their substitution score. You can find the example from class in input/wp_inclass.cost. The diagnol is not defined, but you should set it to 0 when you create your substitution function.

Output

Your program should print the total tree score as well as the inferred state of all common ancestors after running the weighted parimony algorithm. Each line should be specified as:

identifier: sequence
similar to the input sequence file. The in-lecture example would appear as follows:
$./weightedParsimony.py wp_inclass.tree wp_inclass.cost wp_inclass.seq 
Score: 9.0

Inferred States: 
L4: G
L2: C
L3: T
L1: A
N1: T
N2: T
N3: T
For debugging purposes, you should output intermediate calculations. You can see my intermediate outputs here. When submitting, please remove this intermediate prints or ensure the required output appears at the end of the file.

Tip: Representing Trees

While you can construct a tree data structure to help implement your program, it is probably much easier to just use a dictionary instead. For example you can have a dictionary where the key is the identifier (e.g., "N1") and the value is a list of the children (e.g., ["N2","N3"]). A leaf node simply has an empty list for its value. If you need to represent parents, you can have a companion dictionary where the keys and values are reversed.

Multi-Character Sequences

You should first get your program to work on single-character sequences as practiced in class. Once that is complete, you will want to improve your algorithm to calculate the cumulative score and trace for a set of sequences of length N. You can assume that each leaf sequence is of the same length. Also, since we assume independence of positions, you should be able to just place a loop around the function you already created (you are practicing modularity, right?). I have given a second set of example files beginning with the prefix mammals. A sample run will output:

$./weightedParsimony.py mammals.tree mammals.cost mammals.seq 
Score: 11.0

Inferred States: 
Aardvark: CAGGTA
Chimp: CGGGTA
Dog: TGCACT
Bison: CAGACA
Elephant: TGCGTA
C3: CGGGTA
C2: TGCGTA
C1: CAGGTA
C4: CGCGTA
An output of intermediate values can be found here.

Submitting your work

When you are finished with your lab, submit your code using handin68. You should submit, at a minimum:

  1. NJTree.py and associated source code
  2. primate.png showing your GraphViz solution to neighbor joiningon the primate mitochondrial DNA
  3. The answer to the analysis questions in your README file
  4. weightedParsimony.py and associated source code.