Lagrange multipliers: PCA and SVM¶

COMP4670/8600 - Statistical Machine Learning - Tutorial¶

In this lab we will apply Lagrange multipliers to solve and implement both principal component analysis (PCA) and support vector machines (SVM).

Textbook Questions¶

These questions are hand picked to both be of reasonable difficulty and demonstrate what you are expected to be able to solve. The questions are labelled in Bishop as either $\star$, $\star\star$, or $\star\star\star$ to rate its difficulty.

Question 11.1: (Difficulty $\star$, simple algebraic derivation)
Question 11.2: (Difficulty $\star$, simple algebraic derivation)
Question 11.3: (Difficulty $\star$, simple algebraic derivation)
Question 11.4: This question is easier if you've completed the earlier 3 questions. Do not attempt first.(Difficulty $\star \star$, simple algebraic derivation)
Question 11.5: This question will give you a good understanding of how the co-variance matrix works. (Difficulty $\star$, simple algebraic derivation)
Question 11.7: This is basically just another version of Q11.3 (Difficulty $\star$, simple algebraic derivation)
Question 11.9: This is a fun question to verify your understanding. Do this last. Basically, you can invent a dataset and apply your python program to it (Difficulty $\star \star$, but in reality this is easy if you understand rejection sampling enough to implement it)
Question 11.10: (Difficulty $\star$, simple algebraic derivation)

Assumed knowledge:¶

Optimisation in Python (lab)
PCA (lecture)
SVM (lecture)

After this lab, you should be comfortable with:¶

Applying Lagrange multipliers to optimisation problems
Implementing solutions with Lagrange multipliers

In [ ]:

import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt

%matplotlib inline

Data loading¶

For this lab we will use a spam email dataset. This dataset contains 57 features which are word and character frequencies, as well as whether an email is spam or not. The last column of the dataset is the label, which indicates whether or not something is spam.

Load and shuffle the data. Currently, the labels are an element of {0, 1}. Convert the labels into {-1 ,1} (as preparation for SVM). Then normalise the data to have zero mean and unit variance.

In [ ]:

import csv

# Load the data into features and labels.
data = np.genfromtxt('spambase.csv', delimiter=',',skip_header=1)
indices = np.arange(len(data))
np.random.shuffle(indices)
features = np.nan_to_num(data[indices, :-1])
# Convert binary labels into {-1, 1} (for the SVM later in the lab).
labels = data[indices, -1].astype(bool) * 2 - 1

# Normalise the data.
features -= features.mean(axis=0, keepdims=True)
features /= features.std(axis=0, keepdims=True)

N = len(features)

features.shape

Lagrange multipliers¶

One way to solve optimisation problems with equality constraints is to use Lagrange multipliers. We can write such optimisation problems in the following form:

\begin{align*} \underset{x}{\mathrm{maximise}}&\ f(x)\\ \mathrm{such\ that}&\ h(x) = 0. \end{align*}

We have an objective function $f : \mathbb{R}^n \to \mathbb{R}$, a set of (vectorised) equality constraints $h : \mathbb{R}^n \to \mathbb{R}^m$, and an optimisation parameter $x \in \mathbb{R}^n$.

Lagrange multipliers are additional parameters $\lambda \in \mathbb{R}^m$, which we introduce by defining the Lagrange function $\mathcal L$ of this problem. The Lagrange function is

$$ \mathcal L(x, \lambda) = f(x) - \lambda \cdot h(x). $$

Once we have a Lagrange function, we then need to find its stationary points, i.e. $\nabla \mathcal L_{x, \lambda}(x, \lambda) = 0$.

For an inequality constraint, we require our multipliers to be $\lambda \geq 0$.

Spend a few minutes dicuss with your neighbour why solving the lagrangian is equivalent to solving the original optimisation problem.

1. Principal component analysis¶

In PCA, we have $D$-dimensional observations $x_1, \dots, x_N \in \mathbb{R}^D$. We want to find a 1-dimensional subspace with maximum variance. We can represent this subspace by a normal unit vector $u \in \mathbb R^D$.

We can also write the set of observations in matrix form: $X \in \mathbb R^{N \times D}$.

Each observation is projected onto the subspace: for an observation $x$, the projected observation is $u^Tx$. The variance of the projected data is $u^TSu$ with covariance matrix $S = \frac{1}{N} \sum_{i = 1}^N \left(x_i - \bar x\right) \left(x_i - \bar x\right)^T$ where $\bar x$ is the mean observation.

We want to maximise the variance of observations projected onto $u$.

(If necessary, refer to week 6 lab for more detail)

1a.¶

Write down the optimisation problem in the form shown above, clearly showing the objective function and equality constraints.

Answer¶

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

1b.¶

Write down the Lagrange function for this problem.

Answer¶

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

1c.¶

By differentiating the Lagrange function with respect to $u$ and $\lambda$ and equating the derivatives to zero, show that the stationary points are unit eigenvectors of $S$.

Answer¶

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

1d.¶

Find the stationary point that maximises the variance.

Answer¶

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

1e.¶

In general, we can have a $k$-dimensional subspace. The $k$ principal components are the $k$ eigenvectors associated with the $k$ largest eigenvalues.

Using np.linalg, find the first two principal components of the spam email features, project the data onto these components, then scatter plot these components with matplotlib. Colour the points in the plot by their label.

In [ ]:

# replace this with your solution, add and remove code and markdown cells as appropriate

2. Support vector machines¶

Support vector machines (SVM) are a maximum-margin classifier. From the lectures (7B slides 'Maximum Margin Classifiers'), we can write an SVM as a quadratic problem:

\begin{align*} &\underset{w}{\mathrm{minimise}}\ \frac{1}{2} \langle w, w \rangle\\ &\mathrm{such\ that}\ t_i \langle w, x_i \rangle - 1 \geq 0, i = 1 \dots N \end{align*}

where we have data points $x_1, \dots, x_N$ with associated labels $t_1, \dots, t_N \in \{-1, 1\}$.

What assumption are we making about the feature map? Note bias is ignored as it does not affect the dual representation.

2a.¶

Write down the Lagrange function for this problem. Note the minimise. (7B slide 'Lagrange Multipliers for General Case')

Answer¶

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

2b.¶

By differentiating, find the stationary points of the Lagrange function. Use these as constraints to eliminate $w$ in the Lagrange function and hence show that the dual representation of the SVM is:

\begin{align*} &\underset{\lambda}{\mathrm{maximise}}\ -\frac{1}{2}\sum_{i = 1}^N \sum_{j = 1}^N \lambda_i \lambda_j t_i t_j \langle x_i, x_j \rangle + \sum_{i = 1}^N \lambda_i\\ &\mathrm{such\ that}\ \lambda_i \geq 0, i = 1 \dots N \end{align*}

Answer¶

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

2c.¶

The dual representation for this problem in the non-separable case, which we won't derive here, is:

for box parameter $C > 0$. We can vectorise this by introducing the matrix $Q$ such that

$$ Q_{nm} = t_n t_m \langle x_n, x_m \rangle. $$

The dual representation becomes

\begin{align*} &\underset{\lambda}{\mathrm{maximise}}\ -\frac{1}{2} \lambda^T Q \lambda + 1^T \lambda\\ &\mathrm{such\ that}\ 0 \leq \lambda_i \leq C, i = 1 \dots N \end{align*}

The gradient of the objective function is

$$ -\lambda^T Q + 1 $$

and the Hessian is $-Q$.

Implement a function make_Q(features, labels) that generates the matrix Q as defined above. Then implement the function svm_objective(l, Q) which calculates the value of the dual objective function for $Q$. Make sure to use the vector form of the objective function so that your function is efficient.

In [ ]:

# replace this with your solution, add and remove code and markdown cells as appropriate

In [ ]:

Q = make_Q(features, labels)
svm_objective(np.random.normal(size=N), Q)

2d.¶

Implement a function train(Q, C) which maximises the dual objective. Use opt.fmin_l_bfgs_b. Assume $C = 1$. Dont forget to supply the gradient to the optimisation function. Also note that the optimisation function finds fmin.

In [ ]:

# replace this with your solution, add and remove code and markdown cells as appropriate

In [ ]:

lambdas = train(Q, C=1)

2e.¶

The support vectors are all $x_n$ such that $\lambda_n > 0$. Repeat your PCA scatter plot, but highlight the support vectors.

In [ ]:

# replace this with your solution, add and remove code and markdown cells as appropriate

2f. Prediction¶

$\mathcal S$ is the set of support vectors and $N_{\mathcal S}$ is the number of support vectors.

We make predictions with an SVM using the equation

$$ f(x) = w^Tx +b $$

Hint what should this equation be when x is matrix?

Where the weight is computed by

$$ w = \sum_{n \in \mathcal S} \lambda_n t_n \phi (x_n) $$

$b$ is the bias, which is given by

$$ b = \frac{1}{N_{\mathcal S}} \sum_{n \in \mathcal S} \left(t_n - \sum_{m \in \mathcal S} \lambda_m t_m \langle x_m, x_n \rangle\right). $$

Write a function predict(lambdas, features, labels) which returns predictions.

Plot and colour-code the points by their predicted label.

In [ ]:

# replace this with your solution, add and remove code and markdown cells as appropriate

2g. Kernels¶

The data in our SVM only enters through the inner product $\langle x_m, x_n \rangle$. By replacing this with a kernel, we can kernalise the SVM.

Write a function make_Q_gaussian(features, labels, sigma=1) which generates the matrix $Q$ using a Gaussian kernel with bandwidth $\sigma$.

Gaussian Kernel is defined as:

$$k(x,x')=exp(-||x-x'||^2/2\sigma^2)$$

We now make predictions with an SVM using the equation $$ f(x) = \sum_{n = 1}^N \lambda_n t_n k (x, x_n) + b. $$

Write a new prediction fuction predict_k(lambdas, kernel, labels) that make use of the kernel in prediction and make prediction on the dataset using this new kernel.

Note that the feature space is infinite dimensional so the weight is generally intractable. However, the classifier can still be obtain using the kernel.

In [ ]:

# replace this with your solution, add and remove code and markdown cells as appropriate

In [ ]: