These questions are hand picked to both be of reasonable difficulty and demonstrate what you are expected to be able to solve. The questions are labelled in Bishop as either $\star$, $\star\star$, or $\star\star\star$ to rate its difficulty.

**Question 9.2**: ($\star$) This requires the Robbins-Monro sequential estimation described in 2.3.5. The derivative of w.r.t. $\boldsymbol{\mu}$ is the M-step to compute $\boldsymbol{\mu}$. Note that $\eta$ in (9.5) represents a constant.**Question 9.6**: ($\star$) There is a change from $\boldsymbol{\Sigma}_k$ to $\boldsymbol{\Sigma}$ for every $k$.**Question 9.10**: ($\star\star$) Hint: state the property of conditional probability density function $p({x}_a \mid {x}_b)$.**Question 9.12**: ($\star$) Hint: $\mathbb{E}[\boldsymbol{x}] = \int \boldsymbol{x} p(\boldsymbol{x}) dx$.**Question 9.15**: ($\star$) Show that M-step equation (9.59) can be derived from (9.55) w.r.t. $\boldsymbol{\mu}_k$. Please revisit Bernoulli distribution in 9.3.3.**Question 9.16**: ($\star$) Show that M-step equation (9.60) can be derived from (9.55) w.r.t. $\pi_k$. Lagrange multipliers for $\pi_k$ is required. Please revisit Bernoulli distribution in 9.3.3.**Question 9.13**: ($\star\star$) This question requires (9.48) - (9.60).**Question 9.23**: ($\star$) It is restricted to stationary points of the objective function. The optimization has converged in this case.**Question 9.25**: ($\star\star$) (9.70), (9.71), and (9.72) can be used to show the lower bound.

In this lab, we will use expectation-maximisation to model the eruptions of the Old Faithful geyser with a two-component Gaussian mixture model.

- Gaussian mixture models and expectation-maximisation (lectures)
- Normalising data

- Modelling data with a Gaussian mixture model
- Understanding why and how this model is an instance of EM
- Visualising the convergence of the Gaussian mixture model algorithm## Textbook Questions These questions are hand picked to both be of reasonable difficulty and demonstrate what you are expected to be able to solve. The questions are labelled in Bishop as either $\star$, $\star\star$, or $\star\star\star$ to rate its difficulty.

$\newcommand{\trace}[1]{\operatorname{tr}\left\{#1\right\}}$ $\newcommand{\Norm}[1]{\lVert#1\rVert}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\DD}{\mathscr{D}}$ $\newcommand{\grad}[1]{\operatorname{grad}#1}$ $\DeclareMathOperator*{\argmin}{arg\,min}$ Run this cell to set up $\LaTeX$.

In [ ]:

```
import numpy as np
import matplotlib.pyplot as plt
```

Given a collection of data $\{\mathbf{x}_1, \ldots, \mathbf{x}_n \}$, where $\mathbf{x}_i \in \mathbb{R}^D$ we assume that the data was drawn from a probability distrubtion $p$, modelled as a mixture of $K$ many Gaussians. $$ p(\mathbf{x}) := \sum_{k=1}^K \pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k ) $$

We assume that the data $\mathbf{X} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n \} \in \mathbb{R}^{N \times D}$ is known, the number of gaussians $K$ is known, and we would like to find the parameters $\pi_k, \mu_k, \Sigma_k$.

Note that the mixture weights must satisfy $$ \sum_{k=1}^K \pi_k = 1 $$

In practise, when we want to sample a point from $p(\mathbf{x})$, we sample a number $p$ uniformly from the interval $[0,1)$, and then if $p$ satisfies $\sum_{i=1}^{k-1} \pi_i \leq p < \sum_{i=1}^{k} \pi_k$, we then sample $\mathbf{x}$ from the corresponding $K^\text{th}$ gaussian $\mathcal{N}(x \mid \mu_k, \Sigma_k )$.

For example, given three gaussians, and the mixture coefficients $\pi_1,\pi_2, \pi_3 = 0.3, 0.3, 0.4$, we draw $p$ from the half-open unit interval, see which of the three intervals $$[0,0.3), [0.3, 0.6), [0.6, 1)$$ $p$ lies in, and then sample $\mathbf{x}$ from the corresponding gaussian.

For a data point $\mathbf{x}$, we define a latent variable $\mathbf{z}$, and denote the $k^\text{th}$ term in vector $\mathbf{z}$ as $z_k$. The latent varaible $\mathbf{z}$ is a 1-of-$K$ (or a one-hot encoding) indicating which gaussian the point $\mathbf{x}$ was sampled from. (A one-hot encoding means that one of the $z_k$'s will be one, and the rest will be zero.) By doing this, we can define the conditional probablity $$ p(\mathbf{x} \mid z_k = 1) := \mathcal{N}(X \mid \mu_k, \Sigma_k) $$

In practise, for each data point $\mathbf{x}_n$, we introduce a corresponding latent variable $\mathbf{z}_n$ that represents which gaussian that $\mathbf{x}_n$ came from. Of course, we don't know what the $\mathbf{z}_n$'s are, otherwise the problem would already be solved. We let $(z_n)_k$, or simply $z_{nk}$, to denote the $k^\text{th}$ term in the $n^\text{th}$ latent variable, corresponding to the data point $\mathbf{x}_n$. So, we should write the above as $$ p(\mathbf{x}_n \mid z_{nk} = 1) := \mathcal{N}(X \mid \mu_k, \Sigma_k) $$

We define $$ \gamma (z_{nk}) := p(z_{nk} = 1 \mid \mathbf{x}_n ) $$ which denotes how "responsible" the component $k$ of is for explaining the observation $\mathbf{x}_n$. By Bayes' Law, we can write \begin{align} p(z_{nk} = 1 \mid \mathbf{x}_n ) & = \frac{p(\mathbf{x}_n \mid z_{nk} = 1) p(z_{nk} = 1)}{p(\mathbf{x}_n)} \\ & = \frac{p(z_{nk} = 1) p(\mathbf{x}_n \mid z_{nk} = 1) }{ \sum_{k=1}^K p(z_{nk} = 1) p(\mathbf{x}_n \mid z_{nk} = 1)} \\ & = \frac{\pi_{nk} \mathcal{N}( \mathbf{x}_n \mid \mu_k, \Sigma_k ) } { \sum_{k=1}^K \pi_{nk} \mathcal{N}( \mathbf{x}_n \mid \mu_k, \Sigma_k)} \end{align} So, we have $$ \gamma (z_{nk}) = \frac{\pi_{nk} \mathcal{N}( \mathbf{x}_n \mid \mu_k, \Sigma_k ) } { \sum_{k=1}^K \pi_{nk} \mathcal{N}( \mathbf{x}_n \mid \mu_k, \Sigma_k)} $$

Now, the log likelihood is given by $$ \log p(\mathbf{x} \mid \pi, \mu, \Sigma) = \sum_{n=1}^N \log \left\{ \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \mu_k, \Sigma_k) \right\} $$ and we want to choose all the parameters $\mu, \Sigma, \pi$ to maximise the log likelihood.

Checking where the derivative of $\log p(\mathbf{x} \mid \pi, \mu, \Sigma)$ with respect to $\mu_k$ is zero,
we obtain
$$
0 = \sum_{n=1}^N \frac{\pi_{nk} \mathcal{N}( \mathbf{x}_n \mid \mu_k, \Sigma_k ) }
{ \sum_{k=1}^K \pi_{nk} \mathcal{N}( \mathbf{x}_n \mid \mu_k, \Sigma_k)} \Sigma_k^{-1}
(\mathbf{x}_n - \mu_k)
$$
and rearranging, we obtain
$$
\mu_k = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}) \mathbf{x}_n
$$
where
$$
N_k := \sum_{n=1}^N \gamma (z_{nk})
$$
is called the effective number of points assigned to the $k^\text{th}$ gaussian.
(**Exercise: verify this**)

Similarly, taking the derivative with respect to $\Sigma_k$, we obtain $$ \Sigma_k = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}) (\mathbf{x}_n - \mu_k) (\mathbf{x}_n - \mu_k)^T $$

and taking the derivative with respect to $\pi_k$ (using the method of Lagrange multipliers, we can include
the constraint $\sum_{k=1}^K \pi_k = 1$,
$$
\pi_k = N_k / N
$$
(**Exercise: verify this**)

Now, the problem is that $\gamma(z_{nk})$ is dependant on the parameters $\mu, \Sigma, \pi$, but those parameters in turn depend on the responsibilities $\gamma(z_{nk})$. This means that we don't have our paramters in closed form.

We use the EM algorithm to numerically find good values for the parameters $\mu, \Sigma, \pi$, that maximise the log likelihood.

Initialise the parameters $\mu_k, \Sigma_k, \pi_k$, and evaluate the log-likelihood.

Compute the responsibilities (which depend on the parameters) $$ \gamma(z_{nk}) := \frac{\pi_{nk} \mathcal{N}( \mathbf{x}_n \mid \mu_k, \Sigma_k ) } { \sum_{k=1}^K \pi_{nk} \mathcal{N}( \mathbf{x}_n \mid \mu_k, \Sigma_k)} $$

Compute the parameters (which depend on the responsibilities)

\begin{align} N_k & := \sum_{n=1}^N \gamma(z_{nk}) \\ \pi_k^{\text{new}} & := \frac{N_k^\text{new}}{N} \\ \mu_k^\text{new} & := \frac{1}{N_k^\text{new}} \sum_{k=1}^K \gamma(z_{nk}) \mathbf{x}_n \\ \Sigma_k^\text{new} & := \frac{1}{N_k^\text{new}} \sum_{n=1}^N \gamma(z_{nk}) (\mathbf{x}_n - \mu_k^\text{new}) (\mathbf{x}_n - \mu_k^\text{new})^T \end{align}Compute the log-likelihood, if it has no converges, goto 2. $$ \log p(\mathbf{x} \mid \pi, \mu, \Sigma) = \sum_{n=1}^N \log \left\{ \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \mu_k, \Sigma_k) \right\} $$

Variable | $\qquad \qquad$ Type | Definition | Known? |
---|---|---|---|

$\mu$ | $\mathbb{R}^{K \times D}$ | A matrix of all the gaussian means $\mu_1, \ldots, \mu_K$. | No |

$\Sigma$ | $\mathbb{R}^{K \times D \times D}$ | A 3-tensor of all the gaussian covariances $\Sigma_1, \ldots, \Sigma_K$. | No |

$\pi$ | $\mathbb{R}^{K}$ | A vector of all the gaussian mixture weightings $\pi_1, \ldots, \pi_K$. | No |

$\mu_k$ | $\mathbb{R}^D$ | The mean of the $k^\text{th}$ gaussian. | No |

$\Sigma_k$ | $\mathbb{R}^{D \times D}$ | The covariance of the $k^\text{th}$ gaussian. | No |

$\pi_k$ | $\mathbb{R}$ | The mixture weighting on the $k^\text{th}$ gaussian. | No |

$\mathbf{z}_n$ | $\{0,1\}^K$ | Latent variable, a one-hot encoding of which gaussian $\mathbf{x}_n$ was sampled from. | No |

$(z_n)_k, z_{nk}$ | $\{0,1\}$ | $k^\text{th}$ term in the $n^\text{th}$ latent variable $\mathbf{z}_n$. | No |

$\gamma_{z_{nk}}$ | $\mathbb{R}$ | The responsibility for the $k$ component to explain the observation $\mathbb{x}_n$. | No |

$N_k$ | $\mathbb{N}$ | The effective number of points assigned to the $k^\text{th}$ gaussian. | No |

$X$ | $\mathbb{R}^{N \times D}$ | Data matrix | Yes |

$\mathbf{x}_n$ | $\mathbb{R}^D$ | A data point | Yes |

$N$ | $\mathbb{N}$ | Number of data points | Yes |

$D$ | $\mathbb{N}$ | Dimension of data points | Yes |

$K$ | $\mathbb{N}$ | Number of gaussians | Yes |

Download the famous *old faithful* dataset from here:

https://machlearn.gitlab.io/sml2020/tutorials/07-dataset.csv

Ensure you have the dataset for this week. The data is a collection of two dimensional data points, and describes eruptions of the Old Faithful geyser in Yellowstone National Park, Wyoming, US. The two features are the waiting time between eruptions and the duration of the eruption. Load the data and normalise the features to have zero mean and unit variance.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Create variables `mu_0`

, `pi_0`

and `Sigma_0`

to store the initial parameters of the Gaussian mixture model for the data. Our mixture model will be made up of two gaussians, and will be initialised with means
$$
\mu_1 =
\begin{bmatrix}
-1\\+1
\end{bmatrix}
\qquad
\mathrm{and}
\qquad
\mu_2 =
\begin{bmatrix}
+1\\-1
\end{bmatrix}
$$
respectively.

The covariance matricies should be initialised to the identity matrix, and the mixture should start off with equal probability.

As a reminder, `pi_0`

is a vector of all the mixture weights initially,
`mu_0`

is a matrix of all the means initially, and `Sigma_0`

is a 3-tensor of the covariances initially.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

To help visualise the EM process, we will plot the data and ellipses representing our two Gaussian components.

The next cell provides some helper functions for this plotting.

In [ ]:

```
# plot_cov_ellipse was taken from here:
# http://www.nhsilbert.net/source/2014/06/bivariate-normal-ellipse-plotting-in-python/
def plot_cov_ellipse(cov, pos, volume=.5, ax=None, fc='none', ec=[0,0,0], a=1, lw=2):
"""
Plots an ellipse enclosing *volume* based on the specified covariance
matrix (*cov*) and location (*pos*). Additional keyword arguments are passed on to the
ellipse patch artist.
Parameters
----------
cov : The 2x2 covariance matrix to base the ellipse on
pos : The location of the center of the ellipse. Expects a 2-element
sequence of [x0, y0].
volume : The volume inside the ellipse; defaults to 0.5
ax : The axis that the ellipse will be plotted on. Defaults to the
current axis.
"""
import numpy as np
from scipy.stats import chi2
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
def eigsorted(cov):
vals, vecs = np.linalg.eigh(cov)
order = vals.argsort()[::-1]
return vals[order], vecs[:,order]
if ax is None:
ax = plt.gca()
vals, vecs = eigsorted(cov)
theta = np.degrees(np.arctan2(*vecs[:,0][::-1]))
kwrg = {'facecolor':fc, 'edgecolor':ec, 'alpha':a, 'linewidth':lw}
# Width and height are "full" widths, not radius
width, height = 2 * np.sqrt(chi2.ppf(volume,2)) * np.sqrt(vals)
ellip = Ellipse(xy=pos, width=width, height=height, angle=theta, **kwrg)
ax.add_artist(ellip)
def plot_components(mu, Sigma, colours, *args, **kwargs):
'''
Plot ellipses for the bivariate normals with mean mu[:,i] and covariance Sigma[:,:,i]
'''
assert mu.shape[1] == Sigma.shape[2]
assert mu.shape[0] == 2
assert Sigma.shape[0] == 2
assert Sigma.shape[1] == 2
for i in range(mu.shape[1]):
kwargs['ec'] = colours[i]
plot_cov_ellipse(Sigma[i], mu[i], *args, **kwargs)
import matplotlib.colors as mcol
br_cmap = mcol.LinearSegmentedColormap.from_list("MyCmapName",["b","r"])
def plot_data(redness=None):
if redness is not None:
assert len(redness) == data.shape[1]
assert all(_ >= 0 and _ <= 1 for _ in redness)
c = redness
else:
c = 'g'
plt.figure(figsize=(8,8))
plt.scatter(data[:,0],data[:,1], marker='.', s=8, linewidths=2, c=c, cmap=br_cmap)
plt.xlabel(data_labels[0])
plt.ylabel(data_labels[1])
plt.axis([-2,2,-2,2], 'equal')
```

The following cell is an example of using these functions. It should plot the data and two ellipses representing your initial Gaussian clusters.

In [ ]:

```
if 'mu_0' in locals() and 'Sigma_0' in locals() and 'data' in locals(): # check the student has defined these vars
plot_data()
plot_components(mu_0, Sigma_0, ['b','r'], 0.2)
plt.show()
```

As discussed in Lecture 6b, we can optimise the Gaussian mixture model with expectation-maximisation, and perform maximum likelihood estimation for $\mathbf{\mu}$, $\mathbf{\Sigma}$ and $\mathbf{\pi}$.

First implement functions that perform the E-step and the M-step. We suggest the following function signatures:

`def e_step(X, mu, Sigma, pi):`

and

`def m_step(X, gamma):`

Hint: You can use

`from scipy.stats import multivariate_normal as mvnorm`

`mvnorm.pdf(x, mu, sigma)`

to compute $$\mathcal{N}(\mathbf{x}_N \mid \mu_k, \Sigma_k)$$.

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Write a function to calculate the log-likehood of your model for given parameters for the Gaussian distribution

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Plot the log-likehood of your model against the number of updates made to it. Does it converge?

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Use `plot_data`

and `plot_components`

to plot the data and your current Gaussian distributions for a range of number of updates. Does the model change as you would expect?

In [ ]:

```
# replace this with your solution, add and remove code and markdown cells as appropriate
```

Refer to Bishop and/or the lecture slides, and explain with equations:

- the assumed generative data generating process for the mixture of Gaussians,

- what the parameters of the model are,
- what the latent (unobserved) variables are and why they make maximum likelihood parameter estimation non-trivial,
- how EM provides a solution to the maximum likelihood estimation problem.

Verify that maximization of the complete-data log likelihood \begin{align} \ln p(\mathbf{X},\mathbf{Z}|\mathbf{\mu},\mathbf{\Sigma},\mathbf{\pi}) & = \sum_{n=1}^N \sum_{k=1}^K z_{nk} \{ \ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n | \mathbf{\mu}_k, \mathbf{\Sigma}_k \}, \end{align} for a Gaussian mixture model leads to the result that the means and covariances of each component are fitted independently to the corresponding group of data points, and the mixing coefficients are given by the fractions of points in each group.

Show that if we maximize \begin{align} \mathbb{E}_\mathbf{Z} \left[ \ln p(\mathbf{X},\mathbf{Z}|\mathbf{\mu},\mathbf{\Sigma},\mathbf{\pi}) \right] & = \sum_{n=1}^N \sum_{k=1}^K \gamma(z_{nk}) \{ \ln \pi_k + \ln \mathcal{N}(\mathbf{x}_n | \mathbf{\mu}_k \mathbf{\Sigma}_k \}, \end{align} with respect to $\mathbf{\mu}_k$ while keeping the responsibilities $\gamma(z_{nk})$ fixed, we obtain the closed form solution \begin{align} \mathbf{\mu}_k & = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}). \end{align}

Show that if we instead maximize with respect to $\mathbf{\Sigma}_k$ and $\mathbf{\pi}$ while keeping the responsibilities $\gamma(z_{nk})$ fixed, we obtain the closed form solution \begin{align} \mathbf{\Sigma}_k & = \frac{1}{N_k} \sum_{n=1}^N \gamma(z_{nk}) (\mathbf{x}_n-\mathbf{\mu}_k)(\mathbf{x}_n-\mathbf{\mu}_k)^\top, \\ \pi_k &= \frac{N_k}{N}. \end{align}