CS68: Lab 0

Overview

The goal of this week's lab is to reinforce basic concepts of biology and bioinformatics. Specifically, Part 1 will have you explore existing bioinformatic databases, discovering what information is available for a particular gene of interest and answering questions along the way. Part 2 will have you implement a short program that will simulate many aspects of the central dogma and explore the mechanisms more in detail.

This lab can be done in pairs, but is simple enough that you may also do it individually if you prefer. You can obtain your starting directory structure and any starting files by running update68. You may discuss concepts with a fellow classmate, especially if you are having difficulty with the details of transcription or translation. You may not share code, however.

You will be responsible for handing in your solutions to both parts, due Thursday, September 11 before midnight.

Part 1: Working with Databases should be handed in as a text file README
Part 2: Central Dogma you will hand in two files. The first, sequences.py will contain class definitions for DNA, RNA, and Protein objects. The second, dogma.py will have you implement a main program that allows a user to read in a DNA sequence and convert it to a protein using transcription and translation functionality. The user will be able to search a large genome for potential proteins of interest.

The programming for this week's lab may seem basic upon first read, but that is partially because we haven't covered any algorithms in class yet! It is designed to you get you back in the practice of using Python and to get you to see how transcription and translation work.

Part 1: Working with Databases

One of the most well studied proteins in molecular biology is the green fluorescent protein (GFP). It's discovery was recently awarded a Nobel Prize in Chemistry in 2008 for redefining how fluorescent microscopy is utilized in biology. It's also being used to create a breed of glow in the dark pets that may give you nightmares.

In this portion of the lab, you will learn about GFP using three well known databases for genomics: GenBank (for nucleotide sequences), UniProt (for protein sequences), and the Protein Data Bank (PDB; for protein structures). Along the way, you will answer questions that you will submit in your README file.

Genbank:

Genbank is a database of nucleotide sequences. It can be accessed at the NCBI website (National Center for Biotechnology Information) at http://www.ncbi.nlm.nih.gov/genbank. In the search pull down menu at the top, make sure Nucleotide is selected. In the text box at the top of the screen where it solicits input for searching, type "GFP" and hit the Go button.
This search will bring up over 1000 results. To narrow the search, click on Advanced just below the search box. Type gfp in box one, selecting Gene Name from the Field Menu. In box two, type Aequorea victoria and select Organism. Click Search and then sort results by date using the Display Settings
These two entries, M62653 and M62654, are from a seminal 1992 paper. Click on M62653 and appear at the bottom of the list. Look over the Genbank record of M62653, and answer the questions below in README

Before continuing, answer these questions in your README file:

How long is the nucleotide sequence? Amino acid sequence? (HINT: the field CDS details the encoding, or protein sequence. Either do the math or click on the protein_id for the number of amino acids.)
How many variations of the gene are found in the species population (HINT: read the abstract of the original paper. You should find a link about 12 lines down labeled PUBMED)
What is the Latin name of the organism whose DNA was sequenced for this GFP?
How many bases at the beginning of the sequence are not involved in encoding the protein? At the end of the sequence? (HINT: the FEATURES table describes the segments of the gene, including the CDS tag which states which portions are encoded. The Graphics view at the top is also helpful.)

UniProt:

UniProt is a database of amino acid sequences that can be accessed at UniProt. At the UniProt homepage, type gene:GFP in the search box and click the Search button. The first link should be GFP_AEQVI P42212. Click on the link.
Examine the web page for this protein, noting the wide range of information made available. You will need this information to answer the questions below.

Before continuing, answer these questions in your README file:

This UniProt record has links to other databases. Pfam (Protein Families) is a database of multiple alignments that help research proteins with related structure/function. Pfam accession numbers begin with the letters PF, followed by five numbers (e.g. PF12345). What is the Pfam accession number for GFP_AEQVI?
Ontologies are categorizations, or labelings, for functions of a protein. In particular, Gene Ontologies (GO) are a standardized set of functions that can be assigned a protein to help organize the large amounts of data about genes. What is the primary GO term on the listing for GFP? Who is the first author of the paper cited for detailing this ontology?
GFP is commonly used by biomedical engineers for various purposes. If the fluorescence is not strong enough, what site on the amino acid sequence can be mutated to increase fluorescence?

Protein Data Bank:

The PDB (Protein Data Bank) is a database of protein structures at http://www.rcsb.org/pdb. Type GFP into the search text box and click the Search button.
Note that the GFP was once the molecule of the month! Click on the story and read it, it contains a nice history of the protein.
Back to the search results. Sort by release data (increasing) and click at the result 1EMA (should be first or second).
Notice that a lot of the sequence and annotations in the other databases are also accessible here. This is a recent modification to the PDB, making it a great resource for known structures (genes with no known structures will not be here).
If you have Java applets enabled, you can view the molecule. For now, you can at least see a static image of the molecule. This is known as a ribbon representation where instead of atoms, the shape of the protein indicates the type of secondary structure.
On the far right at the top, click on Display File, then click on the link to display the structure file in PDB file format.
In this file the majority of lines are ATOM lines. Scroll down until you see those lines and note how the atoms are numbered (in this case, 1 to 1771). In many cases, the full structure is not known so some atoms will be missing. Answer the questions for this section.

Before continuing, answer these questions in your README file:

On the original page (before click on the PDF file format), there is a section titled Structure Validation. Why is this information important? Pick one of the metrics provided and describe what it means.
For atom #16 (in the PDB file), what type of atom is it (HINT: look at column three. This is an abbreviation, e.g., O is oxygen, N is nitrogen, anything beginning with a C is a carbon)? What type of amino acid is it (column 4 has a three letter abbreviation; look up the full name)? What are the (X,Y,Z) coordinates for this atom (columns 7,8,9)?
Note where this amino acid is in the sequence. On the original page for GFP, click the Sequence tab at the top. What type of secondary structure is the amino acid from question #2 in?

Part 2: Central Dogma

In this portion of the lab, you will create a Python library and main program to simulate operations described in the central dogma in order to better understand the link between a DNA sequence and resulting protein sequence(s).

First, you will construct 3 class definitions, one each for DNA, RNA, and Protein. I will describe the main functionalities that are expected, you can feel free to add additional information/methods. All three should be defined in a file sequences.py.

Sequence classes

First, define a DNA class. Your class should have, at a minimum, the following functionality:

A constructor that takes in a string, strand. It should create class variable to store the strand and initialize any other data members you want to maintain.
An __str__ method for converting the object to a string. It should return a strand summary, including directionality. That is, the start of the string should be "5' " and the end should be "3' ". If the strand is longer than 30 bases, print the first 15 bases, a series of dots, and then the last 15 bases. E.g., "5' TTTGAGCAAGTCAAA...TTTTATTCGTGTGTA 3'
An __len__ method to get the length of the sequence
An invert() method to replace the current strand with its reverse complement. That is, the other half of the DNA double strand. You should always think of sequences as 5' to 3', so you will not only need to find the complement of each base, but also reverse the sequence. For example, AAGG should become CCTT.
A getStrand() method to return the raw sequence
A getSubStrand() method to retrieve a portion of the sequence. This should take in a start and stop index and return a string containing all bases from start up to the stop index.
A transcription() method that returns a list of RNA objects. Each RNA object will represent the sequence between one pair of start/stop codons in the same reading frame. That is, the distance between them is evenly divisible by three. A naive way to implement this method is to search for all possible start codons (ATG). For each start codon, search the rest of the strand incrementing by three for a stop codon (TAG, TGA, or TAA). If there is no stop codon, do not add the encoding to the list. There may be overlaps in encodings (ATG can code for a regular Methionine or a start one). Be sure to substitute for U's for T's when constructing the RNA object. You should pass the index of the first nucleotide after the start codon and the last index before the stop codon to each RNA objects constructor.

Next, define an RNA class. Your class should have the following methods:

A constructor that takes in a strand, as well as start and stop indices for where the encoding can be found in the original DNA sequence. You should store these three items as well as any other data members you see fit.
An __str__ method similar to above, but it should also print out the indices e.g., "16-21: 5' AUGCCA 3'"
An __len__ method to return the length of the sequence
A getStrand() as above
A translate() method that returns a Protein object containing the translation of the mRNA sequence. This method should take in a codon table as input and use this to produce the translation. You should pass the start/stop index in to the Protein constructor.

Lastly, you should create a Protein class. This class will look exactly the same as the RNA class minus the translate method. The constructor will take in an amino acid sequence and a start and stop index for finding the original encoding region in the DNA sequence. You do not need to print out directions for a protein sequence (i.e., there is no 5' to 3' designation).

Main program

You will define your main program in dogma.py. At a high level, your program should:

Greet the user
Prompt the user for a sequence file; load the sequence as a DNA object and print the sequence's summary
Prompt the user for the codon table file; load the table into a dictionary
Go into the program's main loop for allowing a user to interact with the sequence. The loop should exit when the user selections option "0"

The main loop can be as creative as you like. At a minimum, you should define behavior for the following options:

Print the raw DNA sequence (entire sequence, no 5' or 3' labels using getStrand())
Display a subsequence of the DNA strand. This should print the user for a start and stop location and display just the nucleotides between these two indices.
Allow the user to invert the DNA sequence, and then print the sequence summary (i.e., use its str() method). Any RNA or Protein sequences that have been stored should be cleared as they no longer apply.
Transcribe the DNA sequence. As described above, this should produce a list of all mRNA strands that could be produced from the sequence. You should print the number of mRNA molecules produced and their summary.
Print all of the raw mRNA sequences (entire sequence, no directionality)
Translate each mRNA sequence to a protein (make sure you clear out any previous proteins from your list); print a summary of each protein
Print raw protein sequences to an output file, one protein per line

Hints and Tips

Reading FASTA file

FASTA is a standardized format used across the field to represent DNA and/or protein sequences. You can read in detail about the format at the NCBI manual page. For this lab, you only need to know that there are two types of lines in the file: description lines and sequence lines. For example:

>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE
KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS
VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP
FLFLIKHNPTNTIVYFGRYWSP

The first line describes the gene and can be ignored for this lab. The next four lines are the gene's protein sequence. When loading your file, you can ignore description lines. The first character on a description line will be the greater than symbol ">". Each line below the description line is part of the sequence, with 80 characters per line. Simply finish reading the file line-by-line concatenating the lines together to create one large string for the sequence.

Reading Codon Table

A codon table maps three-letter RNA codons to a single-letter amino acid that it produces. Look at the codon.txt file and note that each line contains the amino-acid abbreviation first, and then a list of all codons that map to that amino acid. You should load this file into a dictionary data structure (go here to read up on using the built-in dictionary class in Python). You should map codons to their amino acid equivalent. E.g., codonTable["AUG"] = 'M'

Program Requirements

In addition to the requirements listed above, you should ensure your code satisfies these general guidelines

Use good design principles. In fact, your solution should be very short (about 125-150 lines in dogma.py) if you design your solution well.
Make sure to practice defensive programming. Make sure the user enters in valid file names and numeric choices for the menu.
Be sure to comment non-trivial sections of your code

Sample Runs

In your labs directory, I have placed two sample sequence files, test.fasta and gfp.fasta. The latter is the sequence for the green flourescent protein, while the former is a toy example for which I have results below. Try your code on the test file first, and then see what happens with your GFP gene (can you recover the protein sequence you find in Part 1?). If you want to try a large example, try running your code on the E. coli UTI89 genome in ecoli_uti89.fasta. It is located at /home/soni/public/cs68/ecoli_uti89.fasta. DO NOT COPY this file, it is quite large. Note that your program will take awhile to run for certain operations since it is a large sequence.

Welcome to the gene translator

Enter FASTA file name: test.fasta
Enter Codon Table file name: codon.txt

DNA sequence of length 126 successfully loaded: 
5' TTAATAGCGTGGAAT...CATTTTATTTTAAAA 3'

Options: 
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 1

Entire DNA sequence: 
TTAATAGCGTGGAATGATCCTTATTAAAGAGTGTCACGAAGAGTCGGAATAGAATATGGAGGCGACAGTCGAGGGTGGGATAGAGTCCTAAAGATAACATTAAGTGTTAATCATTTTATTTTAAAA

Options: 
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 3

2 Resulting mRNA sequences: 
16-48: 5' AUCCUUAUUAAA...CACGAAGAGUCGGAA 3'
58-87: 5' GAGGCGACAGUCGAGGGUGGGAUAGAGUCC 3'

Options: 
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 4

mRNA Sequence 0
AUCCUUAUUAAAGAGUGUCACGAAGAGUCGGAA
mRNA Sequence 1
GAGGCGACAGUCGAGGGUGGGAUAGAGUCC

Options: 
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 5

2 Resulting protein sequences: 
16-48: ILIKECHEESE 
58-87: EATVEGGIES 

Options: 
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 6
Enter output filename: test.pro

File output complete

Options: 
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 2

DNA sequence successfully inverted:
5' TTTTAAAATAAAATG...ATTCCACGCTATTAA 3'

Options: 
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 3

1 Resulting mRNA sequences: 
15-23: 5' AUUAACACU 3'

Options: 
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 5
1 Resulting protein sequences: 
15-23: INT 

Options: 
0) Exit
1) Print raw DNA sequence
2) Invert DNA sequence
3) Transcribe DNA sequence and print summary
4) Print raw RNA sequences
5) Translate RNA sequences and print summary
6) Print raw sequences to file

Enter choice: 0

The output protein files for the other test cases are available as well:

Submitting your work

Once you are satisfied with your program, hand it in by typing handin68 at the unix prompt. If you work with a partner, be sure to select option p in the menu.

You may run handin68 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.

About the Data

Thanks to Mark Goadrich for sharing his test example sequence for part 2.

CS68 Lab 0: Databases and Central Dogma

Genbank:

UniProt:

Protein Data Bank:

Sequence classes

Main program

Reading FASTA file

Reading Codon Table

Program Requirements

Sample Runs