Regression

COMP4670/8600 - Statistical Machine Learning - Tutorial

In this lab, we will use linear regression to predict the value of a home and explore the impact of regularisation.

Assumed knowledge

  • Maximum likelihood solution to a linear regression problem, with and without regularisation (lectures)
  • Matrix calculations in numpy (lab and precourse material)
  • Theory behind regularisation (lectures)

After this lab, you should be comfortable with:

  • Practical linear regression problems
  • Picking an appropriate regularisation parameter for a given problem

$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\DD}{\mathscr{D}}$ $\newcommand{\grad}[1]{\operatorname{grad}#1}$ $\DeclareMathOperator*{\argmin}{arg\,min}$

Setting up the environment

In [ ]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

The data set

We will use a dataset on the price of housing in Boston (see description). We aim to predict the value of a home from other factors. In this dataset, each row is one house. The first entry is the value of the house and we will predict it from the remaining values which have been normalised to be in the range $[-1, 1]$. The column labels are

'medv', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat'

Download the dataset. Read in the data using np.loadtxt with the optional argument delimiter=',', as our data is comma separated rather than space separated. Remove the column containing the binary variable 'chas'.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Check that the data is as expected using print(). It should have 506 rows (examples) and 13 columns (1 label and 12 features). Check that this is the case.

Hint: use assert.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Regression without regularisation

Implement a function to find the maximum likelihood solution $w_{ML}$ assuming Gaussian noise for this linear regression problem. Remember from the lectures that this is equivalent to a linear regresion problem with the cost function set as the sum of squares error.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Training and testing

Use a fifth of the available data for training the model using maximum likelihood. The rest of the data is allocated to the test set. Report the root mean squared error (RMSE) for the training set and the test set. In this cases, use the identity map as the basis function, $\phi(x)=x$.

Note that the data may be sorted or ordered in some way we cannot predict. How will you account for this?

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Interpreting the model

Find the feature with the biggest magnitude of weight. Using matplotlib (docs for matplotlib.pyplot.plot), create a plot of this feature against the label for the datapoints in the training set. In a different colour, plot this feature against the predicted label. Create a similar plot for the test data.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Regression with regularisation

Implement a function to find the maximum likelihood solution $w_{reg}$ for some regularisation parameter $\lambda > 0$ assuming Gaussian noise for this linear regression problem.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

By calculating the RMSE on the training and test sets, evaluate the impact of regularisation for $\lambda = 1.1$.

What is the effect of regularisation?

Answer

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Picking a regularisation parameter

You will now explore picking a good regularisation parameter.

What would you expect to see if you were under-regularising (so the parameter was too small)? Over-regularising? Discuss with a partner.

Plot the RMSE on the training and test sets against the regularisation parameter $\lambda$ for a range of values of $\lambda$. What is a good range of values of $\lambda$ to check? What do you think is the best value?

Hint: You may find you want to plot against $\log(\lambda)$. The functions np.arange and np.linspace could be useful here (use whichever you think is more applicable).

Answer

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Basis Functions

We want to use basis functions to improve our performance. Implement subroutines for polynomial basis function of degree 2. See the feature map based on the binomial formula .

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Apply this to your train and test sets, and repeat the above exercise with these new features. Report what differences you see.

Answer

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate