12/16/2014
Issaquah High School
Project 2: DNA Analysis
Due Dates:
Checkpoint 1
Final Due Date
1/7/14
1/12/14
10%
Students will write a program that uses arrays and files to analyze DNA sequences and determine if they represent proteins. Special thanks to Stuart Reges and Marty Stepp of UW for use of this assignment.
I.
Background
Deoxyribonucleic acid (DNA) is a complex biochemical macromolecule that carries genetic information for cellular life forms and some viruses. DNA is also the mechanism through which genetic information from parents is passed on during reproduction. DNA consists of long chains of chemical compounds called nucleotides. Four nucleotides are present in DNA: Adenine (A), Cytosine (C), Guanine (G), and
Thymine (T). Certain regions of the DNA are called genes. Most genes encode instructions for building proteins (they're called "protein-coding" genes). These proteins are responsible for carrying out most of the life processes of the organism. Nucleotides in a gene are organized into codons. Codons are groups of three nucleotides and are written as the first letters of their nucleotides (e.g., TAC or GGA). Each codon uniquely encodes a single amino acid, a building block of proteins.
The sequences of DNA that encode proteins occur between a start codon (which we will assume to be
ATG) and a stop codon (which is any of TAA, TAG, or TGA). Not all regions of DNA are genes; large portions that do not lie between a valid start and stop codon are called intergenic DNA and have other
(possibly unknown) function. Computational biologists examine large DNA data files to find patterns and important information, such as which regions are genes. Sometimes they are interested in the percentages of mass accounted for by each of the four nucleotide types. Often high percentages of
Cytosine (C) and Guanine (G) are indicators of important genetic data.
In this assignment, you will write a program the reads named nucleotide sequences from an input file and performs analysis on the sequences. You will perform several calculations and analyses with the end goal of determining whether or not the given nucleotide sequence represents a protein. The results will be output to a file, not to the console.
II.
Details
Behavior
i. Program Operation
Your program should being by welcoming the user and providing a brief description of the computations and analysis the program will perform. You will then prompt the user for an input file and an output file (see below for required file formats). For each nucleotide sequence in the input file, your program will compute and output the following:
● the number of each nucleotide (A, C, G, T) in the sequence
● the percentage of the sequence’s total mass accounted for by each nucleotide
Page 1 of 4
AP Computer ScienceMr. Brett Wortzman
12/16/2014
Issaquah High School
● the list of codons present in the sequence
● whether or not this sequence represents a protein (according to our rules)
For our purposes, a nucleotide sequence is a protein gene if:
●
●
●
●
it begins with a valid start codon (ATG), it ends with a valid stop codon (TAA, TAG, or TGA), it contains at least 5 codons total (including the start and stop codons), and
Cytosine (C) and Guanine (G), combined, account for at least 30% of the sequence’s mass
Note that these are not the actual constraints used by computational biologists to identify proteins; they are approximations for our assignment.
The masses for each nucleotide, used for calculating the mass percentages, are as follows:
● Adenine (A) – 135.128 g/mol
● Cytosine (C) – 111.103 g/mol
● Guanine (G) – 151.128 g/mol
● Thymine (T) – 125.107 g/mol
● Junk (-)
– 100.000 g/mol
ii. Input File Format
Input files for your DNA program will consist of a series pairs of lines. The first line in each pair will be a name, and the second will be a nucleotide sequence. You can assume that all input files will contain an even