Linear Algebra and Optimisation¶

COMP4670/8600 - Statistical Machine Learning - Tutorial¶

In this lab we will practice minimising a cost function with gradient descent.

Assumed knowledge¶

• Linear algebra (see Sam Roweis' notes, linked below, for matrix calculus tips)
• Python programming
• Preferably: Using numpy for matrix calculations (precourse material)

After this lab, you should be comfortable with:¶

• Using numpy ndarrays for matrix calculations
• Using scipy.optimise routines to minimise a cost function, with and without a gradient
• Randomly generating input values for testing

Pre-lab notes¶

In this lab, you will apply linear algebra to to minimise a cost function in three steps: implementing the cost function, implementing a gradient function, and applying gradient descent. We will be doing this to solve problems throughout the course.

As in all labs, feel free to skip questions if you get stuck, and ask your tutor if you have any questions!

A note on style: in this course we emphasise functional decomposition in code style. Avoid using global variables, and remember that often splitting code off into separate functions can make it more readable and testable. (Jupyter notebooks let you call functions defined in previous cells.)


Setting up python environment (this cell contains Latex macros).

In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.optimize as opt
import time

%matplotlib inline


A cost function or loss function is the function we want to minimize in a given problem. For example, it might measure the error between the indicator values our models predicts and the true values for the training data. In this lab, we consider a toy example. We will define a cost function $f(X)$ where $X$ is a $n\times p$ matrix.

If $A$ is a square matrix, then we write $\trace{A}$ for its trace. Let $\Norm{A}_F = \sqrt{\trace{A^T A}}$, the Frobenius norm of a matrix.

Let our cost function $f(X)$ be defined for $n\times p$ matrices $X$ as follows. Let $C$ be a fixed symmetric $n\times n$ matrix (so $C = C^T$). Let $\mu$ be a scalar that is larger than the $p^{th}$ smallest eigenvalue of $C$. Let $N$ be a diagonal $p\times p$ matrix with distinct positive entries on the diagonal.

The cost function is defined as $$f(X) = \frac{1}{2} \trace{X^T C X N} + \mu \frac{1}{4} \Norm{N - X^T X}^2_F$$ where $X \in \RR^{n \times p}$, $n \ge p$.

Frobenious Norm¶

Implement a Python function frobenius_norm which accepts an arbitrary matrix $A$ and returns $\Norm{A}_F$ using the formula given. (Use numpy.trace and numpy.sqrt.) We represent matrices and vectors as numpy ndarrays.

1. Given a matrix $A \in \RR^{n \times p}$, what is the complexity of your implementation of frobenius_norm using the formula above?
2. Can you come up with a faster implementation, if you were additionally told that $p \ge n$ ?

Extension: Can you find an even faster implementation than in 1. and 2.?

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate


Implementing the cost function¶

Write a Python function, cost_function_for_matrix, which implements the function $f(X)$ defined above.

Hint: What should the arguments to this function be?

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate


Cost function with vector argument¶

The standard optimisation functions we will be using work only for cost functions that take a vector as the varying argument. Write a new function, cost_function_for_vector, that takes $X$ represented as a vector of length $np$ rather than a matrix of dimensions $n\times p$. What arguments will this function take?

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate


Minimising the cost function¶

At this point, we have two main choices in how we minimise the cost function using gradient descent functions from scipy.optimize. First, we can use fmin, which takes a function to minimize and an initial value. Second, we can use fmin_bfgs, which takes an additional argument: the gradient of the function. As a result, (we would expect to find that) fmin_bfgs has substantially faster convergence.

Minimizing with fmin¶

Implement a function minimise_f_using_fmin that, for given values of $C$, $N$ and $\mu$, finds the matrix $X$ that minimizes $f(X)$ using fmin. You will likely need the docs for fmin. Check if your function converges for some (randomly generated) values of $C$, $N$ and $\mu$.

Summary of the docs: if you have a cost function $g(x, y)$ with a fixed value of $y$ and wish to find the value of $x$ that minimizes it, the syntax for calling fmin would be fmin(g, x0, args=(y)) where x0 is an initial guess for the value of $x$, and args=(y) gives fmin the rest of the values to pass to the cost function. Note that it is necessary that the variable that can change is the first argument to the cost function.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate


Calculating the gradient of the cost function¶

To use fmin_bfgs, which is substantially more time efficient, we need to compute the gradient of $f(X)$ with respect to $X$. Calculate this gradient, then implement a function to calculate it. You may want to use Sam Roweis' Matrix Identities and/or the Matrix Cookbook as a reference for matrix calculus. As our cost function uses its main argument $X$ represented as a vector, also implement a function gradient_for_vector which returns the gradient represented as a vector.

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate


Minimizing the cost function using the gradient¶

Write a function minimise_f_using_fmin_bfgs to minimise $f(X)$ using fmin_bfgs. Have a look at the docs to find the correct syntax. Again, have a try of your function to check that it converges.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate


Time for convergence¶

We wish to check whether fmin_bfgs is actually faster than fmin.

First, we need a way of randomly generating input parameters for our cost function.

Construction of a random matrix $C$ with given eigenvalues¶

A diagonal matrix has the nice property that the eigenvalues can be directly read off the diagonal. Given a diagonal matrix $C \in \RR^{n \times n}$ with distinct eigenvalues, how many different diagonal matrices have the same set of eigenvalues?

Given a diagonal matrix $C \in \RR^{n \times n}$ with distinct eigenvalues, how many different matrices have the same set of eigenvalues?

Recall that if $\Lambda$ is a diagonal matrix with eigenvalues $\mathcal{E} = \{e_1, \dots, e_n \}$ then (for any invertible matrix $B$) the matrix $B \Lambda B^{-1}$ also has eigenvalues $\mathcal{E} = \{e_1, \dots, e_n \}$. We call this matrix the conjugation of $\Lambda$ by $B$. Show that if $B$ is an orthogonal matrix then $B \Lambda B^{-1}$ is symmetric.

Use the result of the previous question to write a Python function random_matrix_from_eigenvalues which takes a list of eigenvalues $E$ and returns a random symmetric matrix $C$ having the same eigenvalues. Hint: one way to find a random orthogonal matrix is using the QR-decomposition of a random matrix.

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate


Checking convergence time¶

Is fmin_bfgs actually faster than fmin? Write some code to find out, using time.clock() (or time.process_time() on newer versions of Python).

Make sure to check this for relatively small and relatively large values of $n$ and $p$. Use random_matrix_from_eigenvalues to generate your $C$ parameter.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate


Minima of $f(X)$¶

Compare the columns $x_1,\dots, x_p$ of the matrix $X^\star$ which minimises $f(X)$ $$X^\star = \argmin_{X \in \RR^{n \times p}} f(X)$$

with the eigenvectors related to the smallest eigenvalues of $C$.

What do you believe this means about $f(X)$?

# replace this with your solution, add and remove code and markdown cells as appropriate