Classification

COMP4670/8600 - Statistical Machine Learning - Tutorial

Textbook Questions

These questions are hand picked to both be of reasonable difficulty and demonstrate what you are expected to be able to solve. The questions are labelled in Bishop as either $\star$, $\star\star$, or $\star\star\star$ to rate its difficulty.

  • Question 4.4: If you are unfamiliar with lagrange multipliers, look at Appendix E of the textbook. (Difficulty $\star$, simple algebraic derivation)
  • Question 4.5: (Difficulty $\star$, simple algebraic derivation)
  • Question 1.24: Note that in the equation $L_{kj}=1-I_{kj}$, $I$ is the identity matrix, so if $k=j$ then $I_{kj}=1$ and $L_{kj}=1-1=0$. (Difficulty $\star\star$, requires good understanding of the formulation of how to minimise expected loss)
  • Question 1.25: This requires calculus of variations (used much more later in the course), which is in Appendix D, specifically the Euler-Lagrange result. Assume that everything is continuous and continuously differentiable so that you can bring the differentiation inside the integral sign. (Difficulty $\star$, simple extension of proof in textbook to multiple target variables)
  • Question 4.9: First state the likelihood. When maximising this, what constraints need to be set? Given such constraints, use lagrange multipliers to derive the results. (Difficulty $\star$, simple algebraic derivation)
  • Question 4.10: For the covariance matrix, you should be able to only use identities from Sam Roweis' Matrix Identities to derive the result. Note you can use the cyclic property on $$(x_n-\mu_k)^T\Sigma^{-1}(x_n-\mu_k)$$ as it is a square matrix (scalar). (Difficulty $\star\star$, covariance matrix derivation requires uncommon identities)
  • Question 4.11: (Difficulty $\star\star$, short derivation but requires understanding what the question setup allows you to apply)
  • Question 4.12: (Difficulty $\star$, simple algebraic derivation)

In this lab we will build, train, and test a logistic regression classifier.

Assumed knowledge:

  • Optimisation in Python (lab)
  • Regression (lab)
  • Binary classification with logistic regression (lectures)

After this lab, you should be comfortable with:

  • Implementing logistic regression
  • Practical binary classification problems
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt

%matplotlib inline

The data set

We will be working with the census-income dataset, which shows income levels for people in the 1994 US Census. We will predict whether a person has $\leq \$50000$ or $> \$50000$ income per year.

The data are included with this notebook as 04-dataset.tsv, a textfile where in each row of data, the individual entries are delimited by tab characters. Download the data from the course website Load the data into a NumPy array called data using numpy.genfromtxt:

numpy.genfromtxt(filename)

The column names are given in the variable columns below. The income column are the targets, and the other columns will form our data used to try and guess the income

In [ ]:
columns = ['income', 'age', 'education', 'private-work', 'married', 'capital-gain', 'capital-loss', 'hours-per-week']
In [ ]:
data_raw = np.genfromtxt("04-dataset.tsv")

Recap - Binary classification

The idea behind this lab is that for each person, we want to try and predict if their income is above the threshold of $\$50,000$ or not, based on a series of other data about their person: age, education,....

As per usual, for the $n^\text{th}$ row, the first entry is the target $t_n$, and the rest forms the data vector $\mathbf{x}_n$.

We have two classes, $C_1$ representing the class of $ <\$ 50,000$, which corresponds to a target of $t_n = 0$, and $C_2$, representing the class of $ >\$50,000$, corresponding to a target of $t_n = 1$. Our objective is to learn a discriminative function $f_{\mathbf{w}}(\mathbf{x})$, parametrised by a weight vector $\mathbf{w}$ that predicts which income class the person is in, based on the data given.

We assume that each piece of information $(t_n, \mathbf{x}_n)$ is i.i.d, and that there is some hidden probability distribution from which these target/data points are drawn. We will construct a likelihood function that indicates "What is the likelihood of this particular weight vector $\mathbf{w}$ having generated the observed training data $\left\{(t_n, \mathbf{x}_n)\right\}_{n=1}^N$".

Recap - Feature map, basis function

Now some classes are not linearly seperable (we cannot draw a line such that all of one class is on one side, and all of the other class is on the other side). But by applying many fixed non-linear transformations to the inputs $\mathbf{x}_n$ first, for some suitable choice of transformation $\phi$ the result will usually be linearly separable (See week 3, pg 342 of the lecture slides).

We let $$ \mathbf{\phi}_n := \phi(\mathbf{x}_n) $$

and work in this feature space rather than the input space. For the case of two classes, we could guess that the target is a linear combination of the features, $$ \hat{t}_n = \mathbf{w}^T \mathbf{\phi}_n $$ but $\mathbf{w}^T \mathbf{\phi}_n$ is a real number, and we want $\hat{t}_n \in \{0,1\}$. We could threshold the result, $$ \hat{t}_n = \begin{cases} 1 & \mathbf{w}^T \mathbf{\phi}_n \geq 0 \\ 0 & \mathbf{w}^T \mathbf{\phi}_n < 0 \end{cases} $$ but the discontinuity makes it impossible to define a sensible gradient.

Recap - Logistic Regression

(We assume that the classes are already linearly seperable, and use our input space as our feature space. We also assume the data is i.i.d).

Instead of using a hard threshold like above, in logistic regression we can use the sigmoid function $\sigma(a)$ $$ \sigma(a) := \frac{1}{1 + e^{-a}} $$ which has the intended effect of "squishing" the real line to the interval $[0,1]$. This gives a smooth version of the threshold function above, that we can differentiate. The numbers it returns can be interpreted as a probability of the estimated target $\hat{t}$ belonging to a class $C_i$ given the element $\phi$ of feature space. In the case of two classes, we define

\begin{align} p(C_1 | \phi ) &:= \sigma (\mathbf{w}^T \phi) \\ p(C_2 | \phi ) &:= 1 - p(C_1 | \phi) \end{align}

The likelihood function $p(\mathbf{t} | \mathbf{w}, \mathbf{x})$ is what we want to maximise as a function of $\mathbf{w}$. Since $\mathbf{x}$ is fixed, we usually write the likelihood function as $p(\mathbf{t} | \mathbf{w})$.

$$ \begin{align} p(\mathbf{t} | \mathbf{w}) &= \prod_{n=1}^N p(t_n | \mathbf{w}) \\ &= \prod_{n=1}^N \begin{cases} p(C_1 | \phi_n) & t_n = 1 \\ p(C_2 | \phi_n) & t_n = 0 \end{cases} \end{align} $$

Note that $$ \begin{cases} y_n & t_n = 1 \\ 1 - y_n & t_n = 0 \end{cases} = y_n^{t_n} (1-y_n)^{1-t_n} $$ as if $t_n = 1$, then $y_n^1 (1-y_n)^{1-1} = y_n$ and if $t_n = 0$ then $y_n^0 (1-y_n)^{1-0} = 1-y_n$. This is why we use the strange encoding of $t_n=0$ corresponds to $C_2$ and $t_n=1$ corresponds to $C_1$. Hence, our likelihood function is $$ p(\mathbf{t} | \mathbf{w}) = \prod_{n=1}^N y_n^{t_n} (1-y_n)^{1-t_n}, \quad y_n = \sigma(\mathbf{w}^T \phi_n) $$ This function is quite unpleasant to try and differentiate, but we note that $p(\mathbf{t} | \mathbf{w})$ is maximised when $\log p(\mathbf{t} | \mathbf{w})$ is maximised. \begin{align} \log p(\mathbf{t} | \mathbf{w}) &= \log \prod_{n=1}^N y_n^{t_n} (1-y_n)^{1-t_n} \\ &= \sum_{n=1}^N \log \left( y_n^{t_n} (1-y_n)^{1-t_n} \right) \\ &= \sum_{n=1}^N \left( t_n \log y_n + (1-t_n) \log (1-y_n) \right) \end{align} Which is maximised when $- \log p(\mathbf{t} | \mathbf{w})$ is minimised, giving us our error function. $$ E(\mathbf{w}) := - \sum_{n=1}^N \left( t_n \log y_n + (1-t_n) \log (1-y_n) \right) $$ We can then take the derivative of this, which gives us $$ \nabla_\mathbf{w} E(\mathbf{w}) = \sum_{n=1}^N (y_n - t_n) \phi_n $$

(Note: We also usually divide the error by the number of data points, to obtain the average error. The error shouldn't get 10 times as large just because there is more data avaliable, so we should divide by the number of error points to reflect that.)

Exercise

Take the derivative of $E(\mathbf{w})$, and show that it is equal to the above. Note that the derivative doesn't have any sigmoid functions. (Hint: Use the identity $\sigma'(a) = \sigma(a) \left( 1- \sigma(a) \right)$ to simplify).

Recap - $L_2$ regularisation, Gaussian prior

To help avoid overfitting, we can add a penalty term to the cost function of the form $\frac{\lambda}{2} ||\mathbf{w}||^2$. By tweaking the value of $\lambda$, we can indicate how much to penalise large terms in the weight vector $\mathbf{w}$. Don't forget to take the regularistion term into account when you compute the corresponding gradient $\nabla_\mathbf{w} E(\mathbf{w})$.

Exercise

Take the derivative of $E(\mathbf{w})$ again, accounting for the added regularisation term.

Explain logistic regression (10 minutes)

Find a partner in your lab (or groups of 3). Take turns to explain the topics above to each other, without referring to the lab sheet. Be as precise as possible, by writing down the relevant equations.

Classification with logistic regression

Implement binary classification using logistic regression and $L_2$ regularisation. Make sure you write good quality code with comments and docstrings where appropriate.

Use scipy.optimize.fmin_bfgs to optimise your cost function. fmin_bfgs takes as arguments the cost function to be optimised, and a tuple of extra arguments to the cost function:

scipy.optimise.fmin_bfgs(cost_function, initial_guess, args=())

By following equations in lectures, implement three functions:

  • grad(w, X, t, a), which calculates the gradient of the cost function,
  • train(X, t, a), which returns the maximum likelihood weight vector, and
  • test(w, X), which returns predicted class probabilities,

where

  • $w$ is a weight vector,
  • $X$ is a matrix of examples,
  • $t$ is a vector of labels/targets,
  • $a$ is the regularisation weight.

(We would use $\lambda$ for the regularisation term, but a is easier to type than lambda, and lambda is a reserved keyword in python, for lambda functions).

See below for expected usage.

We add an extra column of ones to represent the bias term.

Note

  • You should use 80% of the data as your training set, and 20% of the data as your test set.
  • You also may want to normalise the data before hand. If the magnitude of $\mathbf{w}^T \phi_n$ is very large, the gradient of $\sigma(\mathbf{w}^T \phi_n)$ will be very near zero, which can cause convergence issues during numerical minimisation. If each element in a particular column is multiplied by a scalar (say, all elements of the age column) then the result is essentially the same as stretching the space in which the data lives. The model will also be proportionally stretched, but will not fundamentally change the behaviour. So by normalising each column, we can avoid issues related to numerical convergence.
In [ ]:
assert data_raw.shape[1] == len(columns)
data = np.concatenate([data_raw, np.ones((data_raw.shape[0], 1))], axis=1)  # add a column of ones
data.shape
In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate
In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate
In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate
In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Performance measure

There are many ways to compute the performance of a binary classifier. The key concept is the idea of a confusion matrix:

    Label  
    0 1
Prediction 0 TN FN
  1 FP TP

where

  • TP - true positive
  • FP - false positive
  • FN - false negative
  • TN - true negative

Implement three functions:

  • confusion_matrix(y_true, y_pred), which returns the confusion matrix as a list of lists given a list of true labels and a list of predicted labels;
  • accuracy(cm), which takes a confusion matrix and returns the accuracy; and
  • balanced_accuracy(cm), which takes a confusion matrix and returns the balanced accuracy.

The accuracy is defined as $\frac{TP + TN}{n}$, where $n$ is the total number of examples. The balanced accuracy is defined as $\frac{1}{2}\left(\frac{TP}{P} + \frac{TN}{N}\right)$, where $T$ and $N$ are the total number of positive and negative examples respectively.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Accuracy vs balanced accuracy

What is the purpose of balanced accuracy? When might you prefer it to accuracy?

Answer

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

Putting everything together

Consider the following code, which trains on all the examples, predicts on the training set, and then computes the accuracy and balanced accuracy. Discuss the results.

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate

Answer

--- replace this with your solution, add and remove code and markdown cells as appropriate ---

Looking back at the prediction task

Based on your results, what feature of the dataset is most useful for determining the income level? What feature is least useful? Why?

In [ ]:
# replace this with your solution, add and remove code and markdown cells as appropriate
In [ ]: