Learning Scenario - Problematic Patterns in Random Noise (Excel)

Basic Model

Description

This system model calculates a series of random numbers and plots them on a graph with a line of best fit. The y-coordinate of the regression line, also known as y-hat or ŷ, and the deviation from the observed value are also calculated and displayed on the spreadsheet. The model attempts to teach students the idea of error in any experiment and how to calculate how far detached the actual results are from hypothetical values. Students will learn how to interpret a residuals graph and apply these ideas to different experiments. In addition, they will learn when to use and when not to use a linear fit graph and how to infer from results whether linear regression was the right choice.

Background Information

A regression line explains how close a fit experimental values reflect a model. The initial values are sent through a series of tests and equations to calculate the line of best fit. A linear regression is based off of the equation ŷ=a+bx, where b is the slope of the line ( ) and a is the y intercept ( ). The random data points in the model are run through a regression and calculated for its similarity to a linear model. The r2 value explains the variance of the data points from the linear model, while the residual graph will show the difference in the actual data points from the model. Students are able to run multiple regressions on random data points in order to understand the process that the model uses to calculate regressions.

Science/Math

The fundamental principle behind this model is HAVE = HAD + CHANGE. The random data points undergo a change through regression. The following processes and steps occur:

Data points are plotted on the graph
An equation for the best fit line in the form ŷ=a+bx is calculated with and
The r2 value is calculated
The residuals are calculated (y - ŷ) and plotted on a deviations graph

In this model, the CHANGE is represented by the regressions that the data points undergo. Initially, the user has a set of random data points, but through regression, one might find how close a relationship the data points have to a linear model. Statisticians and economists use this idea in the real world in order to predict future outcomes.

Teaching Strategies

An effective way of introducing this model to students is by first displaying a graph with data points plotted on it. The data points should resemble a linear trend. Ask students the following questions:

What are the overall trends in the data points?
Can you draw a line that is as close as possible to every data point on the line?
How close is the line to actually fitting the data set? Would you say that this line is a good predictor for what the actual trend is? Why or why not?

Following this introduction, it may be helpful to open the model and use some of the random data points to construct a line of best fit by finding the slope from the two points and solving for the y-intercept. Compare this equation with the calculated equation after the regression is run.

Implementation

How to use the model

While there are no parameters that can be changed, this model allows the user to refresh the random variables and run the regression again. In order to do both of these, simply press [Ctrl] + [=] (for Macs) or [F9] (for PC). Immediately, new data points will be calculated, run through the linear regression, and plotted on the graph.

A second sheet is also available on Sheet 2 with quadratic data that may be run through a linear regression. To get to the second sheet, click the "Quadratic" tab in the bottom left hand corner of Excel. The regression is initiated the same way as the first page.

**Note: Make sure that under Excel > preferences > calculation: be sure to select calculate sheets "Manually", to check the box marked "Limit Iteration" and set "Maximum Iterations" = 1. For more information on Excel, reference the Excel tutorial at: http://shodor.org/tutorials/excel/IntroToExcel.

Learning Objectives

Learn how a regression works and what its purpose is
Understand the coefficient of determination and how it helps to tell how close a fit the linear model is for the data
Understand residuals and how they help to tell how close a fit the linear model is for the data

Objective 1

For this objective, students should begin by studying the output on the graph. After running the regression, have students look at the equation of the best-fit line. Compare this to the class calculations from before the regression. Explain how the linear model for the graph is the set of predicted values for each value of x. The "Quadratic" sheet should provide contrast with a bad linear fit. Ask the following questions to guide the students:

How does the initial class calculation for the line of best fit compare to the equation of the line output by the computer? How does the computer calculate this equation? Why is it more accurate?
What does the line represent in general? What would a specific point on the line represent?
Interpolation is using the value on the line to predict an actual outcome within the domain of the model. Extrapolation is using the value on the line to predict an actual outcome outside the domain of the model. Which should be more accurate? Why?
Compare the fit of the line in the first graph to that of the "Quadratic" data set. Is the linear regression appropriate in both cases? Why or why not? What regressions might be used instead?

Objective 2

The coefficient of determination is calculated and displayed in the top right hand portion of the graph. To accomplish this objective, the coefficient of determination and its square root, the correlation coefficient, should be studied in relation to the visual representation of how close a fit the model is for the data. Ask the following questions:

How closely does the line appear to represent the data? How far removed are the data points from the line itself? Note the R2 value.
Run the regression again. Compare the two graphs: Which graph had data points that were further from the line? What do the R2 values seem to represent?
The name for the R2 value is the "coefficient of determination". Explain how this name is appropriate.
Take the square root of the coefficient of determination and mark down the number. Run the regression again and compare the square roots of the Coefficients of Variation. How closely do the data points of each graph seem resemble the same slope as the line?
The name for the R-value is the Correlation Coefficient. Explain how this name is appropriate.
Compare some correlation coefficients on the first sheet with those on the second sheet. How do the numbers change? Why is that?

Objective 3

This objective is focused on the deviations of the data points from the linear model. Residuals are the vertical distances between the data point and the line. This would be calculated by subtracting the y value on the line (ŷ) from the actual value (Y) at the same x point. These distances are then plotted on the residual graph with the residuals on the y-axis and the x-axis the x-value of the residual. The residual graph is most useful when comparing to the quadratic sheet. Ask the following questions:

What points are most removed from the line in the Linear Fit graph? Calculate the linear distance between the point and the line. Find point at the same x-value on the Deviations graph. What is the distance between the point and the x-axis?
What is the x-axis analogous to in this situation?
Switch to the "Quadratic" sheet and run a regression. Are there any patterns in the Deviation graph? What would this imply?
Are there any trends in the Deviations graph or do the points appear random? What would this imply?

Extensions:

Understand the concept of regressions as applied to other models, such as Quadratic regressions
Apply the idea of linear regressions to word problems
Understand the application of linear regressions to economics and the stock market

Extension 1

Ask students to think about what type of functions could fit the data in Sheet 2 (Quadratic). Review the basic formula for a quadratic equation (y=ax2+bx+c). Extend the idea of linear regressions to other models. A graphing calculator will be necessary for basic regressions. Ask students to plug in the data from Sheet 2 into their calculators and run a quadratic regression. Ask the following questions when they are appropriate:

What is the overall shape of the data points' distribution? What type of equation do you know with a shape could fit the data?
Plug the data into your graphing calculator and run a quadratic regression. What does the difference in R2 values represent?
Can you think of any real-life situations that would require a quadratic regression?

Supplemental Materials: Prentice Hall Guide to Quadratic Regressions on the TI-83 - http://www.prenhall.com/esm/app/calculator/medialib/Technology/Documents/TI-83/desc_pages/quad-reg.html

Extension 2

Studying how to interpret and manipulate regressions leads to only a partial understanding of their purpose and usefulness. Ask the students to work on word problems, just as economists and stockbrokers do. Ask the following questions after the students have attempted the word problems.

What is something you learned about the application of regressions to real life situations?
Can you think of other jobs besides those mentioned that might require the use of regressions to solve a problem?

Supplemental Materials: Multiple Linear Regression Exploration Questions - http://www.shodor.org/interactivate/media/worksheets/Multiple_Linear_Regression_Exploration_Questions

Extension 3

Ask students to research the application of regressions to everyday life. This can include fields such as economics, business, law, etc. Ask the following questions to guide the students in their explorations:

What are some fields that require regressions? What are their purposes?
Are regressions used in some part of your life that you were not aware of until now? Explain?

Related Models

Histogram

This model displays iterative data in a histogram. Students who have an understanding of regressions and their purpose may be able to use that knowledge in understanding the trends in the histogram. The model shows similarities to the normal curve. Those who have completed the Problematic Patterns in Random Noise model will be able to explain the normal curve's application to the histogram better.

Diffusion in a Box (NetLogo) - Linear

The Diffusion in a Box model simulates randomly moving gas particles and their eventual diffusion into other containers within the box. When graphing the number of particles and their movement into other boxes, there is a linear trend involved, which students who have completed the Problematic Patterns in Random Noise model will understand better than those who have not. The model presents a chemistry application for regressions and an opportunity for comparison of expected and observed results.

NCSI Talks

Problematic Patterns in Random Noise Learning Scenario (Web)