NCSI Talks

   

DNA Sequence Analysis Learning Scenario


Shodor > NCSI Talks > Excel > DNA Sequence Analysis Learning Scenario

Learning Scenario - DNA Sequence Analysis (Excel)

Basic Model:

Description

This is an excel worksheet that can be used to analyze strands of DNA based on their relative quantities of base pairs. Given a string of DNA, the worksheet will iterate over each value in the string and record the number of each base pair. Parameters in this model - Target and Target2 - determine which base pairs are being counted. The user can also set the inputs in the form of sequences of DNA base pairs to be analyzed. Multiple strings can be analyzed at once, allowing for easy comparison.

Background Information

Understanding the human genome is one the most important projects in medical and biological science today. Although the human genome was first sequenced in its entirety back in 2003, scientists today still struggle to understand the function of the vast majority of genes. The benefits of achieving such understanding would be legion. A completely mapped human genome would allow doctors to quickly identify and cure genetic disorders, produce medicines specifically tailored to the individual genome of the patient, and even modify human genes to select for desirable qualities.

One of the most basic methods of understanding this intricate double helix is to look for patterns in base pairs. The patterns that are found can then be correlated with other observable characteristics of individuals to get an idea of what function those specific genes might have. Information about the function of genes can be used to help patients make informed decisions about their health. For instance, many people are carriers of a single gene for Sickle-Cell disease. On its own, this gene is harmless and even beneficial, as it confers some resistance to malaria. However, if two carriers of the sickle cell gene have a child, there is a chance that that child will inherit both genes and suffer from Sickle-Cell disease.

Science/Math

The fundamental principle behind this model is that the odds of an event occurring are equal to the number of different ways that the event could occur relative to the number of ways it could not occur. If the arrangement of DNA base pairs were completely random, then we could easily calculate the expected frequency of any given pattern or patterns of base pairs. When the actual frequency of such a pattern differs significantly from its predicted frequency that is evidence that the pattern is a gene with some specific function.

The excel worksheet works on very simple properties. During every iteration, the following things happen:

  1. The current position increases by 1
  2. For each string, if the value at the current position is equal to one of the values that the user is searching the string for, then the counter for that value is increased by 1
  3. If the position is higher than the length of a string, then the counter for the string is not updated

Teaching Strategies

An effective way of introducing this model is to start with "pure" statistics first to ensure that students understand the mathematics behind it. Ask students to enumerate all potential results that can occur in the following random simulations (order matters):

  1. Flip 1 coin
  2. Flip 2 coins
  3. Flip 3 coins
  4. Flip 4coins
  5. Roll one 4-sided number cube
  6. Roll two 4-sided number cubes
  7. Roll three 4-sided number cubes
  8. Roll one 6-sided number cube
  9. Roll two 6-side number cubes

Next, ask the following questions.

  1. How many different pairs of results did you get for each simulation above?
  2. Is there a relationship between the number of possible outcomes for running a random simulation once, twice, or more times? If so, what might that relationship be?
  3. If I asked you to tell me the number of possible pairs of results of rolling 5 4-sided number cubes, what would that be? You shouldn't need to enumerate them one by one.
  4. If I increased the number of rolls to 6, what would happen to the number of possible pairs? How do you know?

Implementation

How to use the Model

There are three main parameters that can be manipulated to change the way this model runs:

  1. The Target and Target2 parameters determine which base pairs to search for as the worksheet iterates through the DNA strands
  2. Users can also input an unlimited number and length of custom DNA strands to be analyzed

All of these parameters can be manipulated simply by typing the new value into the relevant box. Once the parameters are set to your satisfaction the model can be run via the following procedure:

  1. Ensure that Excel is set to iterate circular functions only once (Preferences->Calculation->Set to Calculate "Manually" and to check the box "Limit Iterations" and set maximum iterations to 1)
  2. Change the value of "Start" to 1. The position value should automatically update to 1 as well
  3. Press [Ctrl]+[=] to step forward one iteration. The position value should now be 2
  4. Continue iterating until the position value is larger than any of the lengths of the DNA strands

For more information on Excel, reference the Excel tutorial at: http://shodor.org/tutorials/excel/IntroToExcel.

Learning Objectives

  1. Modify the excel spreadsheet to search for strings of any length, rather than just single letters
  2. Understand how to calculate the probability of different types of sequences occurring in a string of a fixed length

Objective

To accomplish this objective, have students try to understand how the "C" and "G" counters work by looking at the equation in the cell. Students should pay close attention to the cell references in this function, as they provide an easy way to trace back the chain of events that determines how the counters move. Ask the following questions:

  1. What cells do the base counters depend on? How do you know?
  2. What is the meaning of each of the constants in this function? How do you know?
  3. For your target, in place of a "C" type in "CA" and run the spreadsheet again. What happens? Why?
  4. The counters only ever look at one base at a time, so they aren't ever detecting a two-base sequence like "CA". How can you modify the counters to look for two-base sequences?
  5. 5) Changing the constant in the MID formula will change how the counters analyze the DNA, but it's a lot of effort to change it each and every time you search for a new string. How can you change the constant to a variable that automatically updates with the length of the string?

Objective

To accomplish this objective, have students look back at the analysis they did with the rolling dice earlier, and then ask the following questions:

  1. When you roll two number cubes, how many different ways are there to have the two number cubes add up to seven? How many total pairs are possible? Based on that, what is the probability of getting a seven?
  2. How many ways are there to get exactly 3 heads and 1 tail when flipping four coins? What is the probability of that occurring?
  3. If you assume that DNA base pairs have a random distribution, what is the probability of getting the string GATC? On average, how many different groups of 4 base pairs would you need to look at to find that string once?
  4. There are 44 different groups of base pairs in the DNA strings provided in the Excel worksheet. How many times would you expect to find GATC in these strings? How many times do you actually find GATC in those strings when you run the simulation? What might this imply?

Extensions

  1. Use statistical tools to analyze observed v. expected frequencies
  2. Expand the spreadsheet to be able to analyze other types of strings

Extension 1

Have students choose several different combinations of base pairs to search for and run the simulation on each, recording the total number of that combination that they received. Then, have them analyze the results to see if they have significant evidence that DNA is non-random. Ask the following questions

  1. How many of each string did you find in the DNA strand? How many did you expect to find, on average? The average can be calculated by taking the probability of getting the string and multiplying it by the number of substrings of that size (59 for size 1, 54 for size 2, 49 for size 3, 44 for size 4).
  2. 2) The standard deviation of your results is equal to the square root of the number of substrings, times the probability of getting that string, times one minus the probability of getting that string, or sqrt(np(1 - p)) where n is the number of substrings and p is the probability that an arbitrary substring will match your requirements. What were the standard deviations for each of your tests?
  3. 3) If an experimental result is more than 2 standard deviations away from the average, it is called a significant result because there is less than a 5% probability that it occurred by chance alone. Are any of your results significant? How do you know? What does that imply about these DNA strands?

Related Models

Pseudo Random Numbers

This applet introduces the concept of numbers that seem random but are actually generated by determinate processes. In this model, various types of complicated mathematical functions generate "random" numbers starting from a certain "seed", which helps to ensure that you get different random numbers each time you run the model. Oddly enough, despite the fact that random numbers are inherently unpredictable, it is actually possible to analyze and rate the "randomness" of different functions. One of the most common ways to do so is through frequency analysis: in true random numbers, the quantity of each digit and string of digits should be exactly what is predicted by statistics. This model is a great way to introduce the fact that, while DNA can seem random at first glance, deeper analysis clearly reveals how it differs from "random" data.

Problematic Pattern in Random Noise

This applet challenges users to separate the signal from the noise in partially random data. Although the data has a component that is completely random and unpredictable, it also has a determinate component that can be isolated and modeled through regression equations. There are a number of important parallels between this model and DNA sequence analysis. Like the regression data, DNA contains random or partially random components due to mutations, measurement errors, and other environmental disturbances. However, it is important not to let the noise obscure the signal beneath, which is highly meaningful and significant. Though it is not possible to use a regression equation directly on DNA, many of the same principles apply, so this model is a great way to encourage students to start thinking about separating partially random data into a random and a non-random part.