Kernel Methods

COMP4670/8600 - Statistical Machine Learning - Tutorial

Textbook Questions

These questions are hand picked to both be of reasonable difficulty and demonstrate what you are expected to be able to solve. The questions are labelled in Bishop as either $\star$, $\star\star$, or $\star\star\star$ to rate its difficulty.

  • Question 6.1: Understand dual formulation (Difficulty $\star\star$, simple algebraic derivation)
  • Question 6.2: Dual formulation for perceptron learning algorithm (Difficulty $\star\star$, simple algebraic derivation)
  • Question 6.11: You may want to use Taylor expansion to represent term $\exp(\mathbf{x}^T\mathbf{x}'/2\sigma^2)$ (Difficulty $\star$)
  • Question 6.12: To prove $k\left(A_{1}, A_{2}\right)=2^{\left|A_{1} \cap A_{2}\right|}=\phi\left(A_{1}\right)^{T} \phi\left(A_{2}\right)$. (Difficulty $\star\star$)
  • Question 6.13: Use chain rule to represent $g(\varphi(\theta), x)$ (Difficulty $\star$)
  • Question 7.6: (Difficulty $\star$, simple algebraic derivation)
  • Question 7.7: (Difficulty $\star$, simple algebraic derivation)

In this lab, we will predict the median value of houses in Boston using a quadratic basis and an equivalent kernel method.

Assumed knowledge

  • Linear regression (week 2)
  • Kernel methods (week 7 lectures)

After this lab, you should be comfortable with:

  • Applying machine learning techniques with a non-linear basis
  • Using kernel methods instead of a new basis
  • Evaluating when using a kernel method would or would not be sensible

$\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\dotprod}[2]{\langle #1, #2 \rangle}$

Setting up the environment

In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

%matplotlib inline

Loading the dataset

In this lab, we will use the same dataset as in week 2, which describes the price of housing in Boston (see description). We aim to predict the value of a home from other factors. In this dataset, each row represents the data of one house. The first column is the value of the house, which is the target to predict. The remaining columns are features, which has been normalised to be in the range $[-1, 1]$. The corresponding labels of columns are

'medv', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat'

Download the dataset. Read in the data using np.loadtxt with the optional argument delimiter=',', as our data is comma separated rather than space separated. Remove the column containing the binary variable 'chas'.

Check that the data is as expected using print(). It should have 506 rows (examples) and 13 columns (1 label and 12 features). Check that this is the case.

Hint: use assert.

In [ ]:
names = ['medv', 'crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat']
In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Before coding: Explain kernel methods, dual representations

Please spend 5 minutes to explain to your neighbours what are kernel methods and dual representations.

Then find a piece of paper, derive how to from the linear regression model with regularised sum-of-squares error $$J(\mathbf{w}) = \frac{1}{2} \sum_{n=1}^N(\mathbf{w}^T \phi(\mathbf{x_n}) - t_n)^2 + \frac{\lambda}{2} \mathbf{w}^T\mathbf{w},$$ get to the dual representations $$J(\mathbf{a}) = \frac{1}{2} \mathbf{a}^T \mathbf{KKa} - \mathbf{a}^T \mathbf{Kt} + \frac{1}{2}\mathbf{t}^T \mathbf{t} + \frac{\lambda}{2} \mathbf{a}^T\mathbf{Ka},$$ where $\mathbf{K=\Phi \Phi}^T$ with elements $K_{nm} = \phi(\mathbf{x_n})^T\phi(\mathbf{x_m})$

Overview

This notebook contains two parts. Part one is to implement linear regression with a basis function, for example, a polynomial basis function of degree 2 as mentioned in week 2, we call it quadratic basis function below. The basis function can be understood as a feature mapping from raw input space into the new feature space. Part two is kernel regression, i.e. linear regression with kernel function $k(\mathbf{x,y}) = (\dotprod{\mathbf{x}}{\mathbf{y}} + 1)^2$. We'll see the kernel function implementation is equavilant of the quadratic basis function implementation.

Refresh: Linear regression

First remind yourself of what linear regression is and use your code from week 2 to implement linear regression with regularisation on the Boston dataset. Use 80% of the available data for training the model using maximum likelihood with regularisation (assuming Gaussian noise). The rest of the data is allocated to the test set. Alternatively you can use the train_test_split function in sklearn.

Report the root mean squared error (RMSE) for the training set and the test set. We'll compare the results with the linear regression with basis function, and the linear regression with kernel function later on.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Part 1: Linear regression with quadratic basis function

Let $X \in \RR^{n \times d}$ be the data matrix containing n datapoints $\mathbf{x_i} \in \RR^{1 \times d}$. We can choose to train and test using the raw data $X$ as the input to our model as the above method. Alternatively, we could use $$\Phi= [\mathbf{\phi(x_1)}, \mathbf{\phi(x_2)}, ..., \mathbf{\phi(x_n)}]^T \in \RR^{n \times m},$$ where $\mathbf{\phi(x_i)}$ is some transformation of $\mathbf{x_i}$, m is the dimension of $\mathbf{\phi(x_i)}$.

For this lab, write $\mathbf{x_i} = (x_{i,1},\dots,x_{i,d})$. Let $$ \phi(\mathbf{x_i}) = (x_{i,1}^2, x_{i,2}^2, \ldots, x_{i,d}^2, \sqrt{2}x_{i,1} x_{i,2}, \ldots, \sqrt{2}x_{i,d-1}x_{i,d}, \sqrt{2}x_{i,1}, \ldots, \sqrt{2}x_{i,d}, 1). $$ Note that $\phi(\mathbf{x_i})$ is all quadratic functions of elements of $\mathbf{x_i}$ and 1 (The $\sqrt{2}$ coefficients are for normalisation for later in the lab). We say that we are using a quadratic basis function.

Then your task is:

  1. write a function called $\textit{phi_quadratic}$ with a single datapoint $\mathbf{x_i}$ as input and the quadratic basis function $\phi(\mathbf{x_i})$as output.
  2. write a function called $\textit{feature_map}$ with raw data matrix $X$ as input and $\Phi$ as output by using $\textit{phi_quadratic}$ for each datapoint.
  3. write a function called $\textit{quadratic_lr}$ with raw data matrix $X$ as input and root mean squared error (RMSE) for the training set and the test set as output. Inside of the function, make use of the $\textit{feature_map}$ to map raw data into new feature space. Other setting of linear regression remains same as above.
  4. train and test the data with $\textit{quadratic_lr}$, report the RMSE for the training set and the test set.
In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Part 2: Linear regression with a kernel

Computing a basis transformation as an inner product

Define $k(\mathbf{x,y}) = (\dotprod{\mathbf{x}}{\mathbf{y}} + 1)^2$, where $\mathbf{x,y} \in \mathbb{R}^2$. One way to verify $k(\mathbf{x,y})$ is a kernel function is to write this as an inner product of a verctor valued function evaluated at $\mathbf{x}$ and $\mathbf{y}$. That is, show we have $k(\mathbf{x}, \mathbf{y}) = \dotprod{\phi(\mathbf{x})}{\phi(\mathbf{y})}$ and specify what is $\phi(\mathbf{x}), \phi(\mathbf{y})$.

Answer

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

Then convince yourself that, for $X \in \RR^{N \times D}, Y \in \RR^{M \times D}$, then $K = (X Y^T + 1)^2 \in \RR^{N \times M}$ (addition and square term are element-wise) contains elements as kernel function between each pair of datapoints of X and Y, $$K_{nm} = \phi(\mathbf{x_n})^T \phi(\mathbf{y_m}) = k(\mathbf{x_n}, \mathbf{y_m})$$

Note that here in order to keep consistent to slides, $\phi(\mathbf{x_n})$ is a column vector, while it is a row vector in the last section's notation.

Kernelised Regression

Write a function called $\textit{kernel_quadratic}$ which takes X, Y as input and K as output.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Write a function called $\textit{kernelised_lr}$ with raw data matrix $X$ as input and root mean squared error (RMSE) for the training set and the test set as output. Inside of the function, make use of the $\textit{kernel_quadratic}$ to apply dual representation. Other setting of linear regression remains same as above.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Time complexity

Compare the above two methods (quadratic basis function, kernel method) in terms of time complexity.

In terms of time complexity, is using a kernel method suitable in this case? What are potential advantages of using a kernel method? Disadvantages?

Answer

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

In [ ]: