In this lab, we will use dimensionality reduction techniques to explore a dataset of pictures.

- Neural net implementation (lab)
- Autoencoders and PCA (lectures)

- Implementing PCA
- Implementing an autoencoder
- Visualising features derived from dimensionality reduction

**12.1:**You'll need to use both the orthogonality of all u_i as well as the property that all u_i are eigenvectors of S to simplify the deriviative of the Lagrangian. (Difficulty: **, extension of textbook proof with a minor trick necessary.)**12.28:**$f(x)$ will need to be*strictly*monotonic (i.e. $0 < f'(x)$). From this, substitute into 1.27 and change variables to solve for $|f'(x)|$. (Difficulty: **, simple property, but requires some rigour.)

There are less textbook questions than usual this week. You should use this time to work through the proofs in this lab, as they will serve a similar purpose for your learning.

In [ ]:

```
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt
%matplotlib inline
```

For this lab, we will use a dataset of images of Pokemon sprites.

Load the dataset from the file `06-dataset.csv`

using `np.loadtxt`

. The datafile represents a 2d array where each row is a 64 by 64 pixel greyscale picture. The entries are floats between 0 and 1, where 0 is white and 1 is black.

Note that while the images are 64 by 64 entries, the dataset you load has rows of size 4096 (which is $64\times 64$) to allow the data to be saved as a 2D array.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

For debugging, it is useful to also have a simple dataset that we know is one-dimensional with some noise. You can use this to test your functions produce sensible output.

Write a function that generates data from two Gaussians in $\mathbb{R}^n$ with unit variance, centered at $\mathbf{1}$ and $-\mathbf{1}$ respectively. (Note: $\mathbf{1}$ is the vector $(1, 1, 1, ..., 1)$ in $\mathbb{R}^n$.)

Use the function to generate 100 samples from each Gaussian, with $n=5$.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Remember from lectures that the goal of PCA is to linearly project data points onto a lower dimensional subspace such that the variance of the projected data is maximised.

Let the data be the set of data points $\{\mathbf{x}_n\}_{n=1}^N$, $\mathbf{x}_n\in\mathbb{R}^d$, with mean $\bar{\mathbf{x}}=\frac{1}{N}\sum_{n=1}^N\mathbf{x}_n$ and covariance matrix $\mathbf{S}=\frac{1}{N}\sum_{n=1}^N(\mathbf{x}_n-\bar{\mathbf{x}})(\mathbf{x}_n-\bar{\mathbf{x}})^T.$

From lectures, we derived that if we are linearly projecting onto a subspace $m<d$, then the $m$ directions to linearly project on are given by the $m$ eigenvectors of $\mathbf{S}$ whose eigenvalues are the $m$ largest, and the variance along each direction is equal to that eigenvalue.

Let us assume that $\bar{\mathbf{x}}=\mathbf{0}$. Then $\mathbf{S}=\frac{1}{N}\sum_{n=1}^N\mathbf{x}_n\mathbf{x}_n^T$. However, it turns out that $$\sum_{n=1}^N\mathbf{x}_n\mathbf{x}_n^T=X^TX$$ where $X\in\mathbb{R}^{N\times d}$ is the data matrix. Thus to find the eigenvalues and vectors of the covariance matrix, we need to find the eigenvalues and vectors of $\frac{1}{N}X^TX$.

It turns out that if the SVD of $X$ is $X=U\Sigma V^T$, then the eigenvectors of $\mathbf{S}$ that correspond to its $k$ largest eigenvalues are the column vectors of $V$ that correspond to the $k$ largest singular values of $X$.

Show the left out steps (the two parts where it says "it turns out").

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

Implement principal component analysis. Your function should take the data matrix and the number of components you wish to calculate and return two matrices:

- The projection of the data onto the principal components
- The actual components (eigenvectors) themselves.

Hint: Do not forget to center the data by removing the mean so that you can use the above method. You may find `np.linalg.svd`

useful.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Calculate the projection of the toy data onto the first two principal components. Plot the results, with the datapoints coloured according to which Gaussian they were drawn from.

- Does PCA pick up the two Gaussians?

- What are the eigenvalues associated to these principal components? What do they tell you about how much variance these components explain?

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

If we perform PCA on a dataset, we expect the principal components to lie in the neighbourhood of our datapoints. In particular, if we do this on a dataset of images, we can interpret the principal components as images.

The following function plots a gallery of images.

In [ ]:

```
# Visualising images
def plot_gallery(images, titles, h, w, n_row=2, n_col=6):
"""Helper function to plot a gallery of portraits.
Arguments: images: a matrix where each row is an image.
titles: an array of labels for each image.
h: the height in pixels of each image.
w: the width in pixels of each image.
n_row: the number of rows of images to print.
n_col: the number of columns of images to print."""
assert len(images) >= n_row * n_col
assert len(titles) >= n_row * n_col
plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
plt.show()
```

Use `plot_gallery`

to plot the first 30 pokemon images.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Perform PCA on the Pokemon dataset to find the first 200 principal components. Visualise the first 100 using `plot_gallery`

.

What do you notice about the first few principal components? What are they detecting? Plot the associated eigenvalues. How can you interpret these?

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Plot the reconstructions of the first 30 images using 200 principal components, and using the first 15 principal components. Don't forget to add the mean back in. How good are these reconstructions?

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Another method of dimensionality reduction is by using an autoencoder. We will be implementing a **stacked undercomplete autoencoder**. We will be working off the neural network implementation from last week, but will make some changes.

We are also going to try to get a much better result than PCA, in particular try to encode the Pokemon into **a subspace of dimension 4**. This means that the bottleneck hidden layer, or *code*, is of length 4. In order to get such good results, we are going to have to use some of the tricks from the lectures, in particular **Xavier Initialisation**, **ReLU activations** and **Mini-batch training**.

We are going to make two explicit network classes, an encoder and a decoder. Each will have 3 fully connected layers, we will be going from 4096 to 512 to 64 and then 4, and vice-versa for the decoder.

The first part we will implement is the FullyConnectedLayer(). The only part we are going to change is the initialisation method, from normally distributed weights with standard deviation 1 to using **Xavier initialisation**. As explained in lectures, this is to help saturation (see follow up question below). In lectures, the standard deviation deviation for Xavier initialisation of normally distributed weights is
$$\sqrt{\frac{1}{n_{in}}}$$
but we are going to change this to the more commonly used variant (considers making sure the backpropagated signal is also not saturated, see here for details)

Here we have denoted $n_{in}$ as the dimension of the input to the layer, and $n_{out}$ the dimension of the output of the layer.

We will also consider the bias as another input, and thus use the same initialisation (with $n_{in}$ increased by 1).

What is saturation, and why is it especially important in this network compared to the one we implemented last week?

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

In [ ]:

```
class FullyConnectedLayer():
"""
This is a class skeleton provided.
It should perform y = Xw + b and its correspongding gradient.
If you never defined any classes in python before, you probably want to read other tutorials.
"""
def __init__(self, in_features, out_features):
"""
This is the init function where you have all the attributes needed defined.
You don't have to modify any thing in this function but you should read it carefully.
What each represents will be explained in the next few functions.
"""
self.in_features = in_features
self.out_features = out_features
self.weight = np.zeros((in_features, out_features))
self.bias = np.zeros((out_features, 1))
self.g_weight = np.zeros((in_features, out_features))
self.g_bias = np.zeros((out_features, 1))
self.input = None
def init_weights(self):
"""
Currently, the weight and bias of this layer is initilized to zero, which is terrible.
You want to re-initilize the weight and bias with Xavier initialisation as explained above.
After you finish, comment out raise NotImplementedError.
No return value is needed.
"""
###############
#YOU CODE HERE#
###############
raise NotImplementedError
def forward_pass(self, X):
"""
Take the output of last layer as X and return the result.
Don't forget to save the input X to self.input. You will need the input for gradient calculation.
After you finish, comment out raise NotImplementedError.
If you are new to python/numpy, you probably want to figure out what is broadcasting
(see http://cs231n.github.io/python-numpy-tutorial/#numpy-broadcasting).
"""
self.input = X
out = np.dot(X, self.weight) + self.bias.T
return out
def backward_pass(self, g_next_layer):
"""
g_next_layer is the gradient passed from next layer (the layer after current layer in forward pass).
You need to calculate 3 things.
First, the gradient with respect to bias, self.g_bias.
Second, the gradient with respect to weights, self.g_weight.
Third, the gradient with respect to last layer (the layer formed by the current weight and bias), g_last_layer.
Save the gradient with respect to bias and the gradient with respect to weight.
Return the gradient with respect to last layer.
"""
self.g_weight = np.dot(self.input.T, g_next_layer)
self.g_bias = np.sum(g_next_layer, axis=0, keepdims=True)
g_last_layer = np.dot(g_next_layer, self.weight.T)
return g_last_layer
def update(self, learning_rate):
"""
Update the weight and bias use the gradient that you just calculated.
No return is needed.
"""
self.weight -= learning_rate*self.g_weight
self.bias -= learning_rate*self.g_bias.T
```

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Next we are going to implement our activation functions. We already have our Sigmoid layer from last week. We need to use this at the end of our network, as our outputs (the resulting image) needs to be within $[0,1]$. However, for the inner activation functions, we are going to use the **ReLU** function which works better in practice.

What is the benefit of ReLU as an activation function? What is its gradient?

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

In [ ]:

```
def sigmoid(X):
"""
Make sure that you function works with X being matrix.
Use functions in numpy instead of functions in math.
"""
return 1/(1 + np.exp(-X))
class Sigmoid():
def __init__(self):
"""
This is the init function where you have all the attributes needed defined.
You don't have to modify any thing in this function but you should read it carefully.
"""
self.input = None
def forward_pass(self, X):
"""
Apply sigmoid function to input and save the input for later.
"""
self.input = X
return sigmoid(X)
def backward_pass(self, g_next_layer):
"""
Calculate the gradient with respect to the input.
g_next_layer is the gradient passed from next layer (the layer after current layer in forward pass).
Return the gradient with respect to the output of the last layer.
"""
g_last_layer = sigmoid(self.input) * ( 1 - sigmoid(self.input)) * g_next_layer
return g_last_layer
def update(self, learning_rate):
"""
There is no parameter to update for this layer, but we still define this function to maintain a uniform interface.
"""
pass
class ReLU():
def __init__(self):
"""
This is the init function where you have all the attributes needed defined.
You don't have to modify any thing in this function but you should read it carefully.
"""
self.input = None
def forward_pass(self, X):
"""
Apply sigmoid function to input and save the input for later.
"""
self.input = X
###############
#YOU CODE HERE#
###############
raise NotImplementedError
return None
def backward_pass(self, g_next_layer):
"""
Calculate the gradient with respect to the input.
g_next_layer is the gradient passed from next layer (the layer after current layer in forward pass).
Return the gradient with respect to the output of the last layer.
"""
###############
#YOU CODE HERE#
###############
raise NotImplementedError
g_last_layer = None
return g_last_layer
def update(self, learning_rate):
"""
There is no parameter to update for this layer, but we still define this function to maintain a uniform interface.
"""
pass
```

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

We are also going to change our loss function, as we care about the actual values of the resulting sigmoid output, we are not thresholding anything to classify into classes. Thus we are going to use the Mean Squared Error loss (MSE).

In [ ]:

```
class MSELoss():
def __init__(self):
"""
This is the init function where you have all the attributes needed defined.
You don't have to modify any thing in this function but you should read it carefully.
"""
self.input_y = None
self.input_t = None
self.input_N = None
def forward_pass(self, y, t):
"""
y: batch_size * 1 0 <= y <= 1, the predictions
t: batch_size * 1 , the targets
(make sure y and t have the same shape. (N,) and (N,1) are different!)
Save the input y, t and batch size N and calculate the loss.
Return the mean of the loss (a scalar).
"""
###############
#YOU CODE HERE#
###############
raise NotImplementedError
return None
def backward_pass(self, g_next_layer = 1):
"""
Nomrally, loss layer is the last layer in a neural network. Thus, we set the g_next_layer to 1.
Calculate the loss of the input y.
"""
raise NotImplementedError
g_last_layer = None
return g_last_layer
def update(self, learning_rate):
"""
There is no parameter to update for this layer, but we still define this function to maintain a uniform interface.
"""
pass
```

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Finally we are going to implement the two neural networks, encoder and decoder. To do this, we will extend the Network class, and then just add what we need to `self.sequential`

. Note that we have also added a return value in the backward pass, as we want to pass the gradient from the decoder to the encoder.

The **encoder** should have 3 fully connected layers, going from 4096 to 512 to 64 and then to 4. The activation functions should all be ReLU.

The **decoder** should have 3 fully connected layers, going from 4 to 64 to 512 and then to 4096. The activation functions should all be ReLU, except for the last layer's activation which should be Sigmoid to get the correct range.

If you have time, experiment with changing these. Does changing to Sigmoid reduce performance? Can you get better performance by changing the number of layers and hidden unit sizes?

In [ ]:

```
class Network():
def __init__(self):
self.sequential = []
def forward_pass(self, X):
for l in self.sequential:
X = l.forward_pass(X)
return X
def backward_pass(self, grad):
for l in reversed(self.sequential):
grad = l.backward_pass(grad)
return grad
def update(self, learning_rate):
for l in self.sequential:
l.update(learning_rate)
class Encoder(Network):
def __init__(self):
Network.__init__(self)
"""
Initialize all layers that you need and append them to the list in the correct order.
Don't forget to initilize the weights for fully connected layers.
"""
###############
#YOU CODE HERE#
###############
raise NotImplementedError
class Decoder(Network):
def __init__(self):
Network.__init__(self)
"""
Initialize all layers that you need and append them to the list in the correct order.
Don't forget to initilize the weights for fully connected layers.
"""
###############
#YOU CODE HERE#
###############
raise NotImplementedError
```

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

We now initialise the classes.

In [ ]:

```
try:
np.random.seed(1)
encoder = Encoder()
decoder = Decoder()
mse = MSELoss()
implemented = True
except NotImplementedError:
implemented = False
print('you need to complete the implementation')
```

As we are doing dimensionality reduction, there is no need for a train-test split. We have set some hyper-parameters for you, e.g. epochs, batch_size and learning rate. In particular the batch size is much smaller than the training set size.

What is the benefit of mini-batch training (training with small batch size, not the whole training set)?

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

In [ ]:

```
if not implemented:
print('you need to complete the implementation')
else:
training_epoch = 50
# We have set a mini-batch set size of 30, having found that it works well in practice. Try setting it equal
# to the size of the traning set and see how well it performs. You may have to change learning rate in this case.
batch_size = 30
# batch_size = len(images)
for i in range(training_epoch):
# We want to randomise the batches that we get for each epoch
# Also helps for when the batch size does not divide nicely into the size of the dataset
random_idxes = np.random.permutation(len(images))
batches = [images[random_idxes[i*batch_size:(i+1)*batch_size]] for i in range(len(images)//batch_size)]
loss_vals = []
for (j, batch) in enumerate(batches):
"""
This is the main training loop.
You need to first run a forward pass on the encoder, get a code, then run the decoder on the code to get X_hat.
Put these into the loss function, calculate the loss and do a back pass.
Then update the network.
Append the loss into the list loss_vals so we can see the average loss of each epoch.
If you have used our hyperparamters, the learning rate we set should work for both the encoder and decoder.
If you have changed the structure, experiment to see if you can find a good learning rante
"""
learning_rate = 5e-3
X = batch
###############
#YOU CODE HERE#
###############
X_hat = X * 0 # you code here
train_loss = 0 # you code here
loss_vals.append(train_loss)
print("Epoch %d: Train_loss %f" %(i+1, np.mean(loss_vals)))
show_every = 10
if i % show_every == show_every - 1:
plot_gallery(X_hat, np.arange(batch_size), 64, 64, min(2, int(np.ceil(batch_size/6))), 5)
plot_gallery(X, np.arange(batch_size), 64, 64, min(2, int(np.ceil(batch_size/6))), 5)
```

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Visualise the autoencoder reconstructions for 30 images. Where they what you expect? Compare them to the reconstruction of 15 components. If you have time/are interested, try training for longer and/or the autoencoder with a higher bottleneck dimension.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

You can find the features the autoencoder is picking up by decoding a hidden unit vector of the form $(0,\dots,0,1,0,\dots,0)$. How do these features compare to those detected by PCA? Also try visualising scaled versions of the features.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Training an autoencoder with linear transition functions should result in projecting the dataset to the same subspace as in PCA. Check this for 4 principal component.

Do do this, make a new encoder and decoder with bottleneck of dimension 4, and use no activations functions (so you only need to use 1 FullyConnectedLayer per network). Train the network again, and again look at what the basis vectors for the bottleneck dimension return (i.e. repeat the "Visualising autoencoder features" question with this new model).

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

*--- replace this with your solution, add and remove code and markdown cells as appropriate ---*

In [ ]:

```
```