CS21B Lab3: Transcription and Open Reading Frames

Due 11:59pm Tuesday, February 7

Run update21 to get the starting point file for this week's lab, which will appear in cs21b/labs/03/. The program handin21 will only submit files from this directory.

Introduction

Biologists often get a piece of DNA sequence and want to know what's in it. One of the most obvious questions to ask is, does it contain a gene? Because genomes of organisms consist of many non-coding regions, it's not clear that a random piece of DNA will always have a gene. And if there is a gene, where does it begin and end? A simple strategy for finding genes is to look for open reading frames. An open reading frame is the section of a sequence between a start codon and a stop codon.

Gene expression involves the processes of transcription and translation. Last week we focused on implementing translation. This week we will implement transcription, and then search for an open reading frame in the resulting mRNA. To simplify the program we will only search for an open reading frame at offset 0. If an open reading frame is found, we will translate it into the appropriate amino acid sequence.

You will write a single program called transcribeAndTranslate.py to perform all of the necessary steps. You should create your solution incrementally by following the instructions given below. Test each step of your partial solution. Do not go onto the next step until the previous step is working correctly.

Below is a sample run of the program in which an open reading frame was found. The start and stop codons have been highlighted in red for clarity, but your program need not do this.

This program simulates both transcription and translation.

Enter length of random DNA string as a multiple of 3: 300

antisense strand of DNA:
CGTCCGAGGCTGTGGCCAATTACTGTCAACTGCAGAGTGTACGTGATATGAATGATTTAGATG
GGGCCTCCTTGGACGTCGTGCGGTGAGAGAGGCGAGCACAGATAGGTACTGGAAATGTACCGT
TCTAGTTCGTGATTTACGCACGTGGAGATTGCCGCGTCGTCCACAATAGTGGCACGAGAATGT
TCGGAGTTAAGACTTAATATCAAGAACAAGATGTCGCGGAGGACAGCGGGTCTGAAAGATGCG
CTTTATCAGACACCTGCATGACCCTATGATACCTGAAACCTACTGGGA

mRNA:
GCAGGCUCCGACACCGGUUAAUGACAGUUGACGUCUCACAUGCACUAUACUUACUAAAUCUAC
CCCGGAGGAACCUGCAGCACGCCACUCUCUCCGCUCGUGUCUAUCCAUGACCUUUACAUGGCA
AGAUCAAGCACUAAAUGCGUGCACCUCUAACGGCGCAGCAGGUGUUAUCACCGUGCUCUUACA
AGCCUCAAUUCUGAAUUAUAGUUCUUGUUCUACAGCGCCUCCUGUCGCCCAGACUUUCUACGC
GAAAUAGUCUGUGGACGUACUGGGAUACUAUGGACUUUGGAUGACCCU

Found an open reading frame starting at 39 and ending at 54
MetHisTyrThrTyrSTOP

Here is a sample run of the program in which an open reading frame was not found:

This program simulates both transcription and translation.

Enter length of random DNA string as a multiple of 3: 75

antisense strand of DNA:
GACAAGCCTCCGCTTAGTCTTTTTCCGTGTTGCGTGGAGTTACTTGACTATTATAAAAGGCGT
TATCCGTTACAG

mRNA:
CUGUUCGGAGGCGAAUCAGAAAAAGGCACAACGCACCUCAAUGAACUGAUAAUAUUUUCCGCA
AUAGGCAAUGUC

No open reading frame found

1. Generate a random antisense strand of DNA

DNA is composed of two strands, termed sense and antisense. The antisense strand is a complement of the sense strand as shown in the small example here:

AGAATGGCCTGGTAAGGC  sense strand of DNA
TCTTACCGGACCATTCCG  antisense strand of DNA
Generate a random antisense strand of DNA represented as a string. Use the choice function from the random library to randomly select from the bases T, A, G, and C. [Please note: The sense strand of DNA is provided above for illustration only. You do not need to generate the sense strand. Your program begins with generating the antisense strand of DNA.]

2. Transcribe the antisense strand of DNA into mRNA

When transcribing the antisense strand of DNA, T becomes A, A becomes U, G becomes C, and C becomes G. The transcription of the antisense strand is almost identical to the original sense strand except that the T's are replaced with U's. For instance, if we continue with the previous example:

TCTTACCGGACCATTCCG  antisense strand of DNA
AGAAUGGCCUGGUAAGGC  transcription into mRNA
Create an mRNA string representing the transcription of the anitsense strand from the previous step.

3. Search for an open reading frame in the mRNA

Only particular sections of the mRNA can code for proteins and these are termed open reading frames. An open reading frame begins with a particular start codon: AUG. The open reading frame can end with several different stop codons: UAA, UGA, or UAG. For example, the following mRNA contains a short open reading frame that begins at position 3 (starting from 0) and ends at position 12 (the first letter of the stop codon).

AGAAUGGCCUGGUAAGGC  open reading frame found in mRNA

Write a for loop to find the position of the first start codon in the mRNA string, if one exists. Use a break statement to exit from the loop as soon as a start codon is found.

Write another for loop to look for the first stop codon beginning from the location of a found start codon, if one exists. Use a break statement to exit from the loop as soon as a stop codon is found.

If both a start and stop codon were found, report the locations of the open reading frame.

4. When an open reading frame is found, translate it

If an open reading frame was found, use the translateName function from the genetics library to translate the open reading frame into amino acids. For example the open reading frame found in the previous step translates to:

MetAlaTrpSTOP
If no open reading frame was found, report this.

Hints and Tips
  1. Be sure to attack the problem as 4 subproblems. You should begin with step 1 (generating random DNA) before moving to step 2. Be sure to thoroughly test each subproblem before moving on.
  2. Remember to import the random library to use choice, which takes in a sequence (or list) of items and randomly selections amongst those options.
  3. Testing out your solution with a random string can be difficult. You can always test it out with a fixed, small DNA sequence first. For example use a variable testDNA that is set to the example above
    testDNA = "TCTTACCGGACCATTCCG"
    
    and then test our program to see if you get the same output as in the example. Once that is working, you can remove testDNA and use the randomly generated sequence.
Optional extensions

These suggestions are not required.

If you're interested in making your program more complete, you can enhance it so that it searches for open reading frames at offset 0, 1, and 2 and reports the first one found.

A further enhancement is to report the longest open reading frame found at any of the possible offsets.

Submit

Once you are satisfied with your programs, hand them in by typing handin21 at the unix prompt. Recall that you may run handin21 as many times as you like, and only the most recent submission will be recorded.