Linear Regression
In this lab you will write a program that performs a linear regression analysis on a given dataset and plots the result as an image.
Linear regression is an idea borrowed from statistics. Roughly speaking, it is a method for going from this:

to this:

where the line it draws is the best-fit line, the line that passes that closest to each point. It also gives you a number called the correlation coefficient, usually written r, that tells you how well the best-fit line actually fits the data.
Part 0 Load the image.ss teachpack.
Part 1 Write data definitions for the data that will be useful in this lab: some representation of a point that has x and y coordinates (I'll call it point in the rest of this lab), some representation of a list of points (which I will call a dataset), and a representation of a line in slope-intercept form (i.e., a line is defined by a slope m and an intercept b in the equation y = mx + b), which I will call an equation.
Also write examples and templates for each of these kinds of data.
(Note: you do not necessarily need to define structures for each of these. All you need to do is write down how they will be represented in your program, which might be a structure you define, a pre-existing structure, or something else entirely.)
Part 2 Write the function best-fit-line : dataset -> equation.
The slope of the best-fit line for a particular dataset is given by the formula

The intercept of the best-fit line for a particular dataset is given by the formula

In these formulas, n is the number of points in the dataset, and x and y refer to the x and y-coordinates of the points in the dataset. For instance the notation Σ x means "the sum of the x-coordinates of each point in the dataset."
To make things a little simpler, you may assume that all denominators are always non-zero.
(Note: you may write this function any way you see fit, but be warned that it might require lots of small helper functions. For instance, my version has 7 helper functions and none of them has a body larger than three lines long.)
Part 3 Write the function plot-dataset : dataset Nat Nat Nat Nat -> image that makes an image of the given dataset on a graph that has the given minimum and maximum x and y coordinates. Then write the function plot-regression : dataset Nat Nat Nat Nat that does the same thing as plot-dataset but also draws the best-fit line on the graph. For this second task you may find the function add-line provided by the image.ss teachpack useful (look it up in Help Desk).
The outputs should look something like the images at the top of this lab. I used this dataset to generate them:
(list
(make-posn 0 0)
(make-posn 1 1.5)
(make-posn 1.5 0.5)
(make-posn 2 2.5)
(make-posn 2.9 1)
(make-posn 3 3.5)
(make-posn 4 3.5)
(make-posn 4.5 2)
(make-posn 4.5 4))
There is a subtlety here: when plotting data it's traditional for the x axis to point to the right as x coordinates get larger and for the y axis to point up as y coordinates get larger. But in computer graphics generally and in the image.ss teachpack specifically, the y axis points down as the x-axis gets larger. Furthermore, if you want your graph to be big enough to see, you'll probably want the point (1,1) to be more than one pixel away from the point (1,2). Just keep that in mind as you write this function.
Oh, and one more thing: a good way to make this problem easier for yourself is to define a function that computes a background image for the given minimum and maximum x and y coordinates, and puts the pinhole in the position that corresponds to (0,0).
Part 4 (Optional) Write the function correlation-coefficient : dataset -> number that computes how well the best-fit line explains the given dataset.
The correlation coefficient is determined by the following formula:

where the notation
means "the average of all the values of j" (so, "the average of all the values for x," "the average of all the values for y," or "the average for all the values of x times y" as appropriate), and σj means "the standard deviation of the values of j" (defined below).
The standard deviation of a set of numbers is the square root (in Scheme: sqrt) of the variance of that set. The variance of a set of numbers is the average of the squares minus the square of the averages; that is:

In general, data that are perfectly linear will have a correlation coefficient of 1, and perfectly non-linear data will have a correlation coefficient of 0. The dataset given above has r = .746. (r2 is often more useful than r itself, since you can interpret r2 as the fraction of the variability in the dataset that can be explained by the best-fit line. In the above dataset, r2 = .557 so a little more than half of the variation in the data can be explained by the linear hypothesis.)
Jacob Matthews
jacobm at cs.uchicago.edu
Hinds 026