## Classification : Logistic Regression

There is a fair bit of real math and science behind the audacious projects seen in HBO's hit TV show Silicon Valley. Of these projects, probably the most realistic and somewhat audacious project was seen in the 4th episode of season 4. The goal of this project was to build an AI app that can identify food based on pictures - a "Shazam for food" if you may. But, things break down hilariously when the app ends up only identifying hot dogs, everything else is 'not' hot dog. In fact, Tim Anglade, an engineer who works for the show made a real life Not Hotdog app, which you can find on Andriod and iOS. He also wrote a blog on Medium explaining how he did it and it involves some serious ingenuity. Fun Fact: He used Paperspace's P5000 instance on Ubuntu to speed up some of the experiments.

So, what is the ML behind the Not Hotdog app? Lets start from the basics.

## Classification

Classification is the task of assigning a label to an observation. A classification algorithm takes as input a set of labeled observations and outputs a classifier. This is a supervised learning technique as it requires supervision in the form of training data to learn a classifier.

The Not Hotdog app classifies images into one of two categories, hence it is an instance of a Binary Classifier. It was trained with a dataset consisting of several thousand images of hot dogs and other foods, in order to recognize hot dogs.

Getting into a little bit of math,

let $Dn = {(xi,y_i): i \in [n]}$ be a dataset of $n$ labeled observations.

Here $xi \in \mathbb{R}^k$ are features and $yi \in {0,1}$ are the labels.

A classifier $l:\mathbb{R}^k \to {0,1}$, is a function which predicts a label $\hat{y}$ for a new observation $x$ such that $l(x) = \hat{y}$.

A binary classification algorithm $A:\mathcal{D}\to\mathcal{L}$ is a function from $\mathcal{D}$, the set of data sets to $\mathcal{L}$, the set of binary classifiers.

Simply put, given a $Dn \in \mathcal{D}$, the algorithm $A$ outputs a binary classifier $l \in \mathcal{L}$ such that $A(Dn) = l$. But what criteria does it use to choose $l$?

Usually, $A$ searches within a set of parameterized classifiers $\mathcal{L}w$ for one which minimizes a loss function $L(Dn,l)$. By parameterized classifiers, we mean $\mathcal{L}{w} = {l(x,w): l \in \mathcal{L} \text{ and } w \in \mathcal{C}}$, where $\mathcal{C}$ is a set of parameters.

$$A(Dn) = \arg\min{l \in \mathcal{L}w}L(D_n,l)$$

To understand how this works, lets study one such binary classification algorithm called Logistic Regression.

## Logistic Regression

Logistic regression is probably one of the simplest binary classification algorithms out there. It consists of a single logistic unit, a neuron if you may. In fact, put several of these logistic units together, and you have a layer of neurons. Stack these layers of neurons on top of one another and you have a neural network, which is what deep learning is all about.

Coming back, the logistic unit uses the logistic function which is defined as:

$$\sigma(z) = \frac{1}{1+e^{-z}}$$

enter image description here

The graph $\sigma(z)$ looks like an elongated and squashed "S". Since the value of $\sigma(z)$ is always between 0 and 1, it can be interpreted as a probability.

If the observations $x$ are $k$ dimensional, then classifiers have $k+1$ real valued parameters which consist of a vector $w\in \mathbb{R}^{k}$ and a scalar $w_0 \in \mathbb{R}$. The set of classifiers considered here is :

$$\mathcal{L}{w,b}={\sigma(w\cdot x + w0) : w\in \mathbb{R}^{k}, w_0 \in \mathbb{R}}$$

Here $\cdot$ is the vector dot product. These classifiers do not offer hard labels (either 0 or 1) for $x$. Instead they offer probabilities which are interpreted as follows.

$$\sigma(w \cdot x + w_0 ) = \Pr(y=1|x)$$

$\sigma(w \cdot x + w0 )$ is the probability of $x$ belonging to class 1. The probability of $x$ belonging to class 0 would naturally be $1-\sigma(w \cdot x + w0 )$.

The loss function typically used for classification is Cross Entropy. In the binary classification case, if the true labels are $yi$ and the predictions are $\sigma(w \cdot xi + w0 ) = \hat yi$, the cross entropy loss is defined as:

$$L(Dn, w,w0) = \frac{-1}{n}\sum{i=1}^n (yi \log(\hat yi) + (1-yi)\log(1-\hat y_i))$$

Let $w^$ and $w0^$ be the parameters which minimize $L(Dn, w,w0)$. The output of Logistic regression must be the classifier $\sigma(w^\cdot x+w0^)$. But how does Logistic regression find $w^$ and $w0^$?

## Gradient Descent

Gradient Descent is a simple iterative procedure for finding a local minimum of a function. At each iteration, it takes a step in the negative direction of the gradient. A convex function always has a single local minimum, which is also the global minimum. In that case, gradient descent will find the global minimum.

If the function $L(Dn, w,w0)$ is convex in $w$ and $w0$, we can use Gradient Descent to find $w^$ and $w0^$, the parameters which minimize $L$ .

Observe that $w \cdot x + w0 = [w0,w]\cdot [1,x]$. Here $[w0,w]$ and $[1,x]$ are $k+1$ dimensional vectors obtained by appending $w0$ before $w$ and $1$ before $x$ respectively. To simplify the math, from now on let $x$ be $[1,x]$ and $w$ be $[w_0,w]$.

One way we could prove that $L(Dn, w)$ is convex is by showing that its Hessian is Positive Semi Definite(PSD) at every point, i.e $\nabla^2wL(D_n,w)\succeq0$. for all $w\in \mathbb{R}^{k+1}$.

Lets start differentiating $L$.

$$\nabla L = \frac{-1}{n}\sum{i=1}^n (\frac{yi}{\hat yi} - \frac{1-yi}{1-\hat yi}) \nabla \hat yi$$

Here $\hat yi = \sigma(w\cdot xi)$. Let $zi = w\cdot xi$. By the chain rule, $\nabla \hat yi = \frac{d \hat yi}{dzi}\nabla zi$. First lets find $\frac{d \hat yi}{dzi}$.

$$\hat yi = \frac{1}{1+e^{-zi}} = \frac{e^{zi}}{1+e^{zi}}$$

Rearranging the terms, we get

$$\hat yi + \hat yie^{zi} = e^{zi}$$

Differentiating wrt $z_i$,

$$\begin{align} \frac{d\hat yi}{dzi} + \frac{d\hat yi}{dzi}e^{zi} + \hat yie^{zi} &= e^{zi}\ \frac{d\hat yi}{dzi}(1+e^{zi}) &= e^{zi}(1-\hat yi)\ \frac{d\hat yi}{dzi} &= \frac{e^{zi}}{(1+e^{zi})}(1-\hat yi) = \hat yi (1-\hat yi) \end{align}$$

Now, $\nabla zi = \nabla^2w(w\cdot xi) = x_i$

Substituting back in the original equation, we get:

$$\nabla L = \frac{-1}{n}\sum{i=1}^n \frac{yi-\hat yi}{\hat yi (1-\hat yi)} \hat yi (1-\hat yi) xi=\frac{1}{n}\sum{i=1}^n(\hat yi - yi)x_i$$

$$\nabla^2 L = \frac{1}{n}\sum{i=1}^n xi^T \nabla \hat yi = \frac{1}{n}\sum{i=1}^n \frac{xixi^T} {yi(1-yi)} $$

$yi(1-yi) >0 $, since $yi \in (0,1)$. Each matrix $xi^Txi$ is PSD. Hence $\nabla^2 L\succeq0$, $L$ is a convex function and Gradient Descent can be used to find $w^*$.

Gradient Descent$(L, D_n,\alpha):$

Initialize $w$ to a random vector.

While $|\nabla L(Dn,w)|>\epsilon$:

$w = w -\alpha\nabla L(Dn,w)$

Here $\alpha$ is a constant called the learning rate or the step size. Gradient Descent is a first order method as it uses only the first derivative. Second order methods like Newton's method could also be used. In Newtons' method, $\alpha$ is replaced with the inverse of the hessian: $(\nabla^2wL(Dn,w))^{-1}$. Although second order methods require fewer iteration to converge, each iteration becomes costlier as it involves matrix inversion.

In the case of neural networks, the gradient descent procedure generalizes to the Back propagation Algorithm.

## Toy Not Hotdog in Python

The real Not Hotdog app uses a state of the art CNN architecture for running the neural network on mobile devices. We would not be able to do anything meaningful with just simple Logistic regression. Nevertheless, we can come close by using the MNIST dataset in a clever way.

The MNIST dataset consists of 70,000 28x28 images of handwritten digits. The digit "1" is the one which resembles a hot dog the most. So for this toy problem, lets say "1"s are hot dogs and the remaining digits are not hot dogs. It also somewhat resembles the imbalance of hot dogs and not hotdog foods, as "1"s account for only one-tenth of digits (assuming each digit occurs with equal probability).

First lets load the MNIST dataset.

```
from sklearn.datasets import fetch_mldata
import numpy as np
mnist = fetch_mldata('MNIST original')
```

Lets use the first 60,000 images for training and test on the remaining 10,000. Since pixel values range between $[0,255]$, we divide by 255 to scale it to $[0,1]$. We modify the labels such that "1" is labeled 1 and the other digits are labeled 0.

```
X_train = mnist.data[:60000]/255.0
Y_train = mnist.target[:60000]
X_test = mnist.data[60000:]/255.0
Y_test = mnist.target[60000:]
Y_train[Y_train > 1.0] = 0.0
Y_test[Y_test > 1.0] = 0.0
Lets do logistic regression using Sci-kit Learn.
from sklearn import linear_model
clf = linear_model.LogisticRegression()
clf.fit(X_train,Y_train)
Y_pred = clf.predict(X_test)
```

Now lets do it by implementing gradient descent with some help from numpy.

```
def logistic(x):
return 1.0/(1.0+np.exp(-x))
# The loss function
def cross_entropy_loss(X,Y,w,N):
Z = np.dot(X,w)
Y_hat = logistic(Z)
L = (Y*np.log(Y_hat)+(1-Y)*np.log(1-Y_hat))
return (-1.0*np.sum(L))/N
# Gradient of the loss function
def D_cross_entropy_loss(X,Y,w,N):
Z = np.dot(X,w)
Y_hat = logistic(Z)
DL = X*((Y_hat-Y).reshape((N,1)))
DL = np.sum(DL,0)/N
return DL
def gradient_descent(X_train,Y_train,alpha,epsilon):
# Append "1" before the vectors
N,K = X_train.shape
X = np.ones((N,K+1))
X[:,1:] = X_train
Y = Y_train
w = np.random.randn(K+1)
DL = D_cross_entropy_loss(X,Y,w,N)
while np.linalg.norm(DL)>epsilon:
L = cross_entropy_loss(X,Y,w,N)
#Gradient Descent step
w = w - alpha*DL
print "Loss:",L,"\t Gradient norm:", np.linalg.norm(DL)
DL = D_cross_entropy_loss(X,Y,w,N)
L = cross_entropy_loss(X,Y,w,N)
DL = D_cross_entropy_loss(X,Y,w,N)
print "Loss:",L,"\t Gradient norm:", np.linalg.norm(DL)
return w
# After playing around with different values, I found these to be satisfactory
alpha = 1
epsilon = 0.01
w_star = gradient_descent(X_train,Y_train,alpha,epsilon)
N,K = X_test.shape
X = np.ones((N,K+1))
X[:,1:] = X_test
Y = Y_test
Z = np.dot(X,w_star)
Y_pred = logistic(Z)
Y_pred[Y_pred>=0.5] = 1.0
Y_pred[Y_pred<0.5] = 0.0
```

In the Not Hotdog example and also in our toy example, there is severe class imbalance. The ratio of 1s to not 1s is about 1:9. That means we get 90% accuracy by just predicting not 1 all the time. Thus accuracy is not a robust measure of the classifier's performance. The f1 score of the smaller class is a better indicator of performance.

```
from sklearn.metrics import classification_report
print classification_report(Y_test,Y_pred)
```

For Sci-kit's Logistic regression:

```
precision recall f1-score support
0.0 1.00 1.00 1.00 8865
1.0 0.97 0.98 0.97 1135
avg/total 0.99 0.99 0.99 10000
```

For our implementation:

```
precision recall f1-score support
0.0 0.99 0.99 0.99 8865
1.0 0.94 0.93 0.94 1135
avg/total 0.99 0.99 0.99 1000
```

Both the classifiers have the same average precision, recall and f1-score. But Sci-kit's version has a better f1 for 1s.

Side Note: The goal of the original "Shazam for food" app would have been to build a multi class classifier (albeit with a very large number of classes), but it ended up doing binary classification. I'm not sure how this would be possible, the training procedures, loss function for these differ significantly. The real life Not Hotdog app however was trained to be a binary classifier.