Setting up the environment

In [ ]:

```
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt
import pickle
%matplotlib inline
```

For a dataset $X$ with $N$ examples and $D$ features, we can represent it as a matrix. What is the dimensions of this matrix $X$? The covariance matrix $C$ is the matrix representing the variance and covariance between each pair of features. What is the size of this matrix $C$?

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

Generate a data matrix $X$ using `gen_data`

from Tutorial 1a. Compute the covariance matrix $C$ and its eigenvalue decomposition using `np.linalg.eigh`

.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Covariance matrices are an example of a class of matrices which are *positive semidefinite*. A matrix $A$ is called symmetric if $A_{ij}=A_{ji}$. Another way to say this is that $A=A^\top$. A matrix $A\in\mathbb{R}^{n\times n}$ is called **positive semidefinite**, if for all vectors $x\in\mathbb{R}^n$,
$$
x^\top A x \geqslant 0.
$$
Show that the eigenvalues of a positive semidefinite matrix are non-negative.

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

You will see more about principal component analysis later in the course. For now, we will treat PCA as an exercise in matrix manipulation.

The **Singular Values** of a square matrix $A$ is defined as the square root of the eigenvalues of $A^T A$. Given a matrix $X$, the singular value decomposition (SVD) is given by
$$
X = U S V^T
$$
where $U$ and $V$ are orthogonal matrices containing the left and right singular vectors respectively. And $S$ is a matrix with the singular values along the diagonal.

Using the definition of the covariance matrix $C$:

- Substitute the singular value decomposition of $X$
- Simplify the resulting expression. You should have an expression of $C$ in terms of $U$ and $S$.
- Recall from Tutorial 2, the definition of an eigenvalue decomposition.
- What is the matrix that contains the eigenvectors corresponding to the $k$ largest eigenvalues of $C$?

Recall that PCA considers the covariance matrix of a data matrix $X$. Using the definition of SVD above, derive expressions for:

- the eigenvectors
- the projection of $X$ onto the $k$ largest eigenvalues

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

Implement the principal component analysis method, using `numpy.linalg.svd`

. Your function should take the data matrix and return two matrices:

- The projection of the data onto the principal components
- The actual components (eigenvectors) themselves.

*Hint: do not forget to center the data by removing the mean*

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Use the code from Tutorial 1 to generate toy data with 100 samples in 5 dimensions. Recall that the data is from two Gaussians with unit variance, centered at $\mathbf{1}$ and $-\mathbf{1}$ respectively.

Obtain the projection of the toy data to its first two principal components. Plot the results. You should be able to see that the first principal component already gives you the axis of discrimination. Revisit the question of the effect of dimension on two Gaussians with unit variance.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

The toy dataset is generated from spherical Gaussians. Explore the following effects on PCA:

- Multiply each feature by a different scaling factor.
- Write a new function for generating data from more than 2 Gaussians, placed at different locations.
- Write a new function for generating data that generates Gaussians which are not spherical.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

The aim of this section of the tutorial is to see that in some cases, the principal components can be human interpretable.

The images below are of Colin Powell, resized to a smaller image, from LFW. Download the images from the course website.

In [ ]:

```
# Visualising images
def plot_gallery(images, titles, h, w, n_row=2, n_col=6):
"""Helper function to plot a gallery of portraits"""
plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
```

In [ ]:

```
lfw_colin = pickle.load(open('lfw_colin.pkl', 'rb'))
# introspect the images array to find the shapes (for plotting)
n_samples, h, w = lfw_colin['images'].shape
plot_gallery(lfw_colin['images'], range(n_samples), h, w)
```

Use the `pca`

function you wrote above to find the first 15 principal components. Visualise them. Discuss what the components potentially capture, for example lighting from the right.

*Hint: Images need to be converted into a vector for PCA, and the results need to be converted back*

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

In [ ]:

```
```