In this lab we will practice minimising a cost function with gradient descent.

- Linear algebra (see Sam Roweis' notes, linked below, for matrix calculus tips)
- Python programming
- Preferably: Using numpy for matrix calculations (precourse material)

- Using numpy ndarrays for matrix calculations
- Using scipy.optimise routines to minimise a cost function, with and without a gradient
- Randomly generating input values for testing

In this lab, you will apply linear algebra to to minimise a cost function in three steps: implementing the cost function, implementing a gradient function, and applying gradient descent. We will be doing this to solve problems throughout the course.

As in all labs, feel free to skip questions if you get stuck, and ask your tutor if you have any questions!

A note on style: in this course we emphasise *functional decomposition* in code style. Avoid using global variables, and remember that often splitting code off into separate functions can make it more readable and testable. (Jupyter notebooks let you call functions defined in previous cells.)

$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\DD}{\mathscr{D}}$ $\newcommand{\grad}[1]{\operatorname{grad}#1}$ $\DeclareMathOperator*{\argmin}{arg\,min}$

Setting up python environment (this cell contains Latex macros).

In [ ]:

```
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.optimize as opt
import time
%matplotlib inline
```

A *cost function* or *loss function* is the function we want to minimize in a given problem. For example, it might measure the error between the indicator values our models predicts and the true values for the training data. In this lab, we consider a toy example. We will define a cost function $f(X)$ where $X$ is a $n\times p$ matrix.

If $A$ is a square matrix, then we write $\trace{A}$ for its trace. Let $ \Norm{A}_F = \sqrt{\trace{A^T A}} $, the *Frobenius norm* of a matrix.

Let our cost function $f(X)$ be defined for $n\times p$ matrices $X$ as follows. Let $C$ be a fixed symmetric $n\times n$ matrix (so $C = C^T$). Let $\mu$ be a scalar that is larger than the $p^{th}$ smallest eigenvalue of $C$. Let $N$ be a diagonal $p\times p$ matrix with distinct positive entries on the diagonal.

The cost function is defined as \begin{equation} f(X) = \frac{1}{2} \trace{X^T C X N} + \mu \frac{1}{4} \Norm{N - X^T X}^2_F \end{equation} where $ X \in \RR^{n \times p} $, $ n \ge p $.

Implement a Python function `frobenius_norm`

which accepts an arbitrary matrix $ A $ and returns
$ \Norm{A}_F $ using the formula given. (Use `numpy.trace`

and `numpy.sqrt`

.) We represent matrices and vectors as numpy ndarrays.

- Given a matrix $ A \in \RR^{n \times p} $, what is the complexity of your implementation of
`frobenius_norm`

using the formula above? - Can you come up with a faster implementation, if you were additionally told that $ p \ge n $ ?

Extension: Can you find an even faster implementation than in 1. and 2.?

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Write a Python function, `cost_function_for_matrix`

, which implements the function $f(X)$ defined above.

Hint: What should the arguments to this function be?

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

The standard optimisation functions we will be using work only for cost functions that take a vector as the varying argument. Write a new function, `cost_function_for_vector`

, that takes $X$ represented as a vector of length $np$ rather than a matrix of dimensions $n\times p$. What arguments will this function take?

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

At this point, we have two main choices in how we minimise the cost function using gradient descent functions from `scipy.optimize`

. First, we can use `fmin`

, which takes a function to minimize and an initial value. Second, we can use `fmin_bfgs`

, which takes an additional argument: the gradient of the function. As a result, (we would expect to find that) `fmin_bfgs`

has substantially faster convergence.

`fmin`

¶Implement a function `minimise_f_using_fmin`

that, for given values of $C$, $N$ and $\mu$, finds the matrix $X$ that minimizes $f(X)$ using `fmin`

. You will likely need the docs for `fmin`

. Check if your function converges for some (randomly generated) values of $C$, $N$ and $\mu$.

Summary of the docs: if you have a cost function $g(x, y)$ with a fixed value of $y$ and wish to find the value of $x$ that minimizes it, the syntax for calling `fmin`

would be `fmin(g, x0, args=(y))`

where `x0`

is an initial guess for the value of $x$, and `args=(y)`

gives `fmin`

the rest of the values to pass to the cost function. Note that it is necessary that the variable that can change is the first argument to the cost function.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

To use `fmin_bfgs`

, which is substantially more time efficient, we need to compute the gradient of $f(X)$ with respect to $X$. Calculate this gradient, then implement a function to calculate it. You may want to use Sam Roweis' Matrix Identities and/or the Matrix Cookbook as a reference for matrix calculus. As our cost function uses its main argument $X$ represented as a vector, also implement a function `gradient_for_vector`

which returns the gradient represented as a vector.

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Write a function `minimise_f_using_fmin_bfgs`

to minimise $f(X)$ using `fmin_bfgs`

. Have a look at the docs to find the correct syntax. Again, have a try of your function to check that it converges.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

We wish to check whether `fmin_bfgs`

is actually faster than `fmin`

.

First, we need a way of randomly generating input parameters for our cost function.

A diagonal matrix has the nice property that the eigenvalues can be directly read off the diagonal. Given a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues, how many different diagonal matrices have the same set of eigenvalues?

Given a diagonal matrix $ C \in \RR^{n \times n} $ with distinct eigenvalues, how many different matrices have the same set of eigenvalues?

Recall that if $ \Lambda $ is a diagonal matrix with eigenvalues $ \mathcal{E} = \{e_1, \dots, e_n \} $ then (for any invertible matrix $B$) the matrix $B \Lambda B^{-1}$ also has eigenvalues $ \mathcal{E} = \{e_1, \dots, e_n \} $. We call this matrix the conjugation of $\Lambda$ by $B$. Show that if $ B $ is an orthogonal matrix then $ B \Lambda B^{-1} $ is symmetric.

Use the result of the previous question to write a Python function `random_matrix_from_eigenvalues`

which takes a list of eigenvalues $ E $ and returns a random symmetric matrix $ C $ having the
same eigenvalues. Hint: one way to find a random orthogonal matrix is using the QR-decomposition
of a random matrix.

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Is `fmin_bfgs`

actually faster than `fmin`

? Write some code to find out, using `time.clock()`

(or `time.process_time()`

on newer versions of Python).

Make sure to check this for relatively small and relatively large values of $n$ and $p$. Use `random_matrix_from_eigenvalues`

to generate your $C$ parameter.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Compare the columns $x_1,\dots, x_p$ of the matrix $X^\star$ which minimises $ f(X) $ \begin{equation} X^\star = \argmin_{X \in \RR^{n \times p}} f(X) \end{equation}

with the eigenvectors related to the smallest eigenvalues of $ C $.

What do you believe this means about $f(X)$?

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```