The augmented data includes the following variables for 506 census tracts in Boston and near-by towns:
Linear regression fits a line to a set of data. The figure below, for example, shows a scatter plot of the average number of rooms per dwelling versus median price along with a regression line for the same data (roughly, f(x) = 9.10*x - 34.66). We refer to the pair (9.10, 34.66) as the model created by the regression. We can use this model to predict the value of y, given x. Is it accurate: how close does f(xi), for some xi in the data set, come to the actual value of yi?
Getting started
We have seeded your repository with a directory named hw7. It contains a file named housing.py that you will modify for this assignment and a directory named data that contains two files: boston-training.csv and boston-test.csv. Use the data in boston-training for Tasks 1-4. The test data will only be used in Task 5.
To simplify your tasks, we have provided code for performing a linear regression and for applying the resulting model. See this API for details.
Task 1: Building and Visualing a Model
Write a function singleVariable(data, x, y) that takes the training data and two column numbers (for example, column 0 holds the town number for each entry and column 16 holds the median price of a house) and performs three tasks (1) produce a scatter plot of (xi, yi) for each of the x,y pairs, (2) compute a linear regression for predicting y from x, and (3) plot the resulting line.
Use your function to compare three different variables of your choice to the median value of a home (MEDV).
Task 2: Evaluating Several Models Using R2 Values
The line shown in our example above is not a particularly accurate model. What does it mean for a model to be accurate? For each xi we now have the true yi and a computed y*i = axi + b. The error for any particular data point is the difference between the predicted value for y and the actual value: ei=y*i-yi. To compute the total error for all i we could sum up the ei but then positive and negative errors might cancel and a potentially poor model might have a small total error. Instead, we add the sum of the squares of the errors Σi ei2. In this case our error oddly increases as we add more points. We can normalize this by the sum of squares of deviations from the mean to acheive a good, normalized, evaluation of how well a model reflects the underlying data.
This concept is called the coefficient of determination or more commonly the R2 value and is a measure of the "goodness" of the fit. R2 is a value between 0 and 1, where 1 indicates a perfect fit (zero error) and 0 represents no correlation between the data and the model. For a more precise discussion (and an exact formula) see the appropriate wikipedia article. The R2 value for our sample line is .51 (corrected from the original version), so it is not a particularly good model for the data.
Write a function computeR2(model, xs, ys) that computes and returns the R2 value for the given model. Use this function to generate a table of R2 values for how well each variable (on its own) predicts the median value (MEDV).
From this table, you will see that none of the variables individually are particularly good predictors for median housing value. In subsequent tasks, we will construct models using multiple variables. For example, we could predict median housing price using a linear combination of the number of rooms and the crime rate. That is, we want to find values for a, b, c in the equation f(x0,x1) = a*x0 + b*x1 + c that will allow us to predict median price (f(x0, x1)) from the average number of rooms (x0) and the crime rate (x1) accurately.
Task 3: Building and Selecting Bivariate Models
In Task 2 we found that none of our single variable models predicted housing prices very accurately (they all had low R2 values). In this and future sections we're going to build more complex models which use multiple variables to predict median housing prices
Discover the best 2-variable model to predict median housing prices. Use two nested for loops to go through all choices of pairs of variables and create a model for each pair and store them in a list. Evaluate each of the models and find which one works best. How does it compare to the single-variable models in task 2? Which pair of variables perform best together? Does this make sense given the results from task 2?
To evaluate these models you may have to enhance your computeR2 function from task2 to handle multivariable models. Make sure that it still works on the singlevariable models in part 2.
Before we move on, make sure that your computeR2 function is sufficiently general by building a model that uses all of the variables (except median housing values of course). What is the complete model's R2 value? How about the model that uses the columns ZN (5), RM (9), and AGE (10).?
Task 4: Building Models of Arbitrary Complexity
How do we determine how many and which variables will generate the best model? It turns out that this question is unsolved and is of interest to both computer scientists and statisticians. To find the best model of three variables we could repeat the process in Task 3 with three nested for loops instead of two. For k variables we would have to use k nested for loops. Nesting for loops quickly becomes computationally expensive and infeasible for even the most powerful computers. How then do we find optimal models of several variables? Keep in mind that models in genetics research easily reach into the thousands of variables
We'll get around this open question by using heuristics. We are going to split this task into two pieces: (1) determine the best K variables to use for a fixed K and (2) determine the best value for K.
How do we determine which K variables will yield the best model for a fixed K? The naive approach, to test all possible combinations as in Task 3 is infeasible for large K. An alternative simpler approach is choose the K variables with the highest R2 values in table from Task 2. Another approach is to start with a set that contains the variable with the highest R2 value and then repeatedly add variables until the set contains K variables. At each step, pick the unused variable that when added to the set of variables in the model, yields the model with the best R2 value and add it to the set. Repeat What other method could you use to pick the variables?
Write a function, discoverModel(data, K), that generates a model for predicting median house value from the training data using the specified number of variables, K. Your function should return a tuple containing the model, its R2 value, and a list of the variables (column numbers) used to build it. You can use any method you like for choosing the variables. You must explain your method and your rationale for choosing it in comments. We will assign extra credit for creative approaches.
Write a function that calls discoverModel(data,K) with different values of K (1 < K < M-1) to determine how many and which variables yield the best regression for predicting median housing value given the input data. Plot the optimum R2 values as a function of K. How does R2 behave as we increase the complexity of our model (increase K)? This function should return the tuple for the model with the best R2 value.
Task 5: Training vs. Test Data
Until now, you have evaluated models using the training data that was used to create them. The resulting model may be quite good for that particular training dataset, but it may not be particularly good at predicting novel data. We call this "over-fitting". The file boston-housing-test.csv contains housing data from the same source that was not included in boston-housing-training.csv. Evaluate your model using this test data as in Task 4. How do the R2 values compare? Try plotting the R2 values from the training and test data on the same plot.
The more complex you make your model, the more it is able to fit itself to the training data. This is why we see the R2 values increase with K on the training set. While some complexity is good it is often best to find the simplest possible model which explains most of the data. Adding several variables to your model for only a slight incease in accuracy (R2) is usually a poor idea.
Go:
Each task asked you to write a collection of functions and compute various quantities. Your solution should include a function called go(trainingFn, testingFn) that takes two filenames, one for the training data and the other for the testing data, and contains calls to the appropriate functions to generate the requested values.
Your program should read each file once and only once.
Make sure your names are at the top housing.py and then check your code into PhoenixForge.
We strongly recommend that you check your code into PhoenixForge at regular intervals!