How to Use Maximum Likelihood Estimation for Parametric Classification Methods

2 years ago   •   17 min read

In some previous tutorials that discussed how Bayes' rule works, a decision was made based on some probabilities (e.g. the likelihood and prior). Either these probabilities were given explicitly or calculated based on some given information. In this tutorial, the probabilities will be estimated based on training data.

This tutorial considers parametric classification methods in which the distribution of the data sample follows a known distribution (e.g. a Gaussian distribution). The known distribution is defined by a set of parameters. For the Gaussian distribution, the parameters are mean $\mu$ and variance $\sigma^2$. If the parameters of the sample's distribution are estimated, then the sample's distribution can be formed. As a result, we can make predictions for new instances that follow the same distribution.

Given a new unknown data sample, a decision can be made based on whether it follows the old samples' distribution or not. If it follows the old distribution, then the new sample is treated similarly to the old samples (e.g. classified according to the old samples' class).

Throughout this tutorial, parameters are estimated using the maximum likelihood estimation (MLE).

The outline of the tutorial is as follows:

• Steps to Estimate the Sample Distribution
• Maximum Likelihood Estimation (MLE)
• Bernoulli Distribution
• Multinomial Distribution
• Gaussian (Normal) Distribution

Let's get started.

Bring this project to life

Steps to Estimate the Sample Distribution

Based on Bayes' rule, the posterior probability is calculated according to the next equation:
$$P(C_i|x)=\frac{P(x|C_i)P(C_i)}{P(x)}$$
The evidence in the denominator is a normalization term and can be excluded. Thus, the posterior is calculated based on the following equation:
$$P(C_i|x)=P(x|C_i)P(C_i)$$
To calculate the posterior probability $P(C_i|x)$, first the likelihood probability $P(x|C_i)$ and prior probability $P(C_i)$ must be estimated.

In parametric methods, these probabilities are calculated based on a probability function that follows a known distribution. The distributions discussed in this tutorial are Bernoulli, Multinomial, and Gaussian.

Each of these distributions has its parameters. For example, the Gaussian distribution has two parameters: mean $\mu$ and variance $\sigma^2$. If these parameters are estimated, then the distribution will be estimated. As a result, the likelihood and prior probabilities can be estimated. Based on these estimated probabilities, the posterior probability is calculated and thus we can make predictions for new, unknown samples.

Here is a summary of the steps followed in this tutorial to estimate the parameters of a distribution based on a given sample:

1. The first step is to claim that the sample follows a certain distribution. Based on the formula of this distribution, find its parameters.
2. The parameters of the distribution are estimated using the maximum likelihood estimation (MLE).
3. The estimated parameters are plugged into the claimed distribution, which results in the estimated sample's distribution.
4. Finally, the estimated sample's distribution is used to make decisions.

The next section discusses how the maximum likelihood estimation (MLE) works.

Maximum Likelihood Estimation (MLE)

MLE is a way of estimating the parameters of known distributions. Note that there are other ways to do the estimation as well, like the Bayesian estimation.

To start, there are two assumptions to consider:

1. The first assumption is that there is a training sample $\mathcal{X}={{x^t}_{t=1}^N}$, where the instances $x^t$ are independent and identically distributed (iid).
2. The second assumption is that the instances $x^t$ are taken from a previously known probability density function (PDF) $p(\mathcal{x}|\theta)$, where $\theta$ is the set of parameters that defines the distribution. In other words, the instances $x^t$ follow the distribution $p(\mathcal{x}|\theta)$, given that the distribution is defined by the set of parameters $\theta$.

Note that $p(\mathcal{x}|\theta)$ means the probability that the instance x exists within the distribution defined by the set of parameters $\theta$. By finding the proper set of parameters $\theta$, we can sample new instances that follow the same distribution as the instances $x^t$. How do we find find $\theta$? That's where MLE comes into the picture.

According to Wikipedia:

For any set of independent random variables, the probability density function of their joint distribution is the product of their density functions.

Because the samples are iid (independent and identically distributed), the likelihood that the sample $\mathcal{X}$ follows the distribution defined by the set of parameters $\theta$ equals the product of the likelihoods of the individual instances $x^t$.

$$L(\theta|\mathcal{X}) \equiv p(\mathcal{X}|\theta)=\prod_{t=1}^N{p(x^t|\theta)}$$

The goal is to find the set of parameters $\theta$ that maximizes the likelihood estimation $L(\theta|\mathcal{X})$. In other words, find the set of parameters $\theta$ that maximizes the chance of getting the samples $x^t$ drawn from the distribution defined by $\theta$. This is called the maximum likelihood estimation (MLE). This is formulated as follows:

$$\theta^* \space arg \space max_\theta \space L{(\theta|\mathcal{X})}$$

The representation of the likelihood $L(\theta|\mathcal{X})$ can be simplified. Currently, it calculates the product between the likelihoods of the individual samples $p(x^t|\theta)$. Rather than calculating the likelihood, the log-likelihood leads to simplifications in doing the calculations, as it converts the product into a summation.

$$\mathcal{L}{(\theta|\mathcal{X})} \equiv log \space L(\theta|\mathcal{X})\equiv log \space p(\mathcal{X}|\theta)=log \space \prod_{t=1}^N{p(x^t|\theta)} \ \mathcal{L}{(\theta|\mathcal{X})} \equiv log \space L(\theta|\mathcal{X})\equiv log \space p(\mathcal{X}|\theta)=\sum_{t=1}^N{log \space p(x^t|\theta)}$$

The goal of the MLE is to find the set of parameters $\theta$ that maximizes the log-likelihood. This is formulated as follows:

$$\theta^* \space arg \space max_\theta \space \mathcal{L}{(\theta|\mathcal{X})}$$

In the Gaussian distribution, for example, the set of parameters $\theta$ are simply the mean and variance $\theta={{\mu,\sigma^2}}$. This set of parameters $\theta$ helps to select new samples that are close to the original samples $\mathcal{X}$.

The previous discussion prepared a general formula that estimates the set of parameters $\theta$. Next is to discuss how this works for the following distributions:

1. Bernoulli distribution
2. Multinomial distribution
3. Gaussian (normal) distribution

The steps to follow for each distribution are:

1. Probability Function: Find the probability function that makes a prediction.
2. Likelihood: Based on the probability function, derive the likelihood of the distribution.
3. Log-Likelihood: Based on the likelihood, derive the log-likelihood.
4. Maximum Likelihood Estimation: Find the maximum likelihood estimation of the parameters that form the distribution.
5. Estimated Distribution: Plug the estimated parameters into the probability function of the distribution.

Bernoulli Distribution

The Bernoulli distribution works with binary outcomes 1 and 0. It assumes that the outcome 1 occurs with a probability $p$. Because the probability of the 2 outcomes must be equal to $1$, the probability that the outcome 0 occurs is thus $1-p$.

$$(p)+(1-p)=1$$

Given that $x$ is a Bernoulli random variable, the possible outcomes are 0 and 1.

A problem that can be solved using the Bernoulli distribution is tossing a coin, as there are just two outcomes.

Probability Function

The Bernoulli distribution is formulated mathematically as follows:

$$p(x)=p^x(1-p)^{1-x}, \space where \space x={0,1}$$

According to the above equation, there is only a single parameter which is $p$. In order to derive a Bernoulli distribution of the data samples $x$, the parameter $p$ must be estimated.

Likelihood

Remember the generic likelihood estimation formula given below?

$$L(\theta|\mathcal{X})=\prod_{t=1}^N{p(x^t|\theta)}$$

For the Bernoulli distribution, there is only a single parameter $p$. Thus, $\theta$ should be replaced by $p$. As a result, the probability function looks like this, where $p_0$ is the parameter:

$$p(x^t|\theta)=p(x^t|p_0)$$

Based on this probability function, the likelihood for the Bernoulli distribution is:

$$L(p_0|\mathcal{X})=\prod_{t=1}^N{p(x^t|p_0)}$$

The probability function can be factored as follows:

$$p(x^t|p_0)=p_0^{x^t}(1-p_0)^{1-x^t}$$

As a result, the likelihood is as follows:

$$L(p_0|\mathcal{X})=\prod_{t=1}^N{p_0^{x^t}(1-p_0)^{1-x^t}}$$

Log-Likelihood

After deriving the formula for the probability distribution, next is to calculate the log-likelihood. This is done by introducing $log$ into the previous equation.

$$\mathcal{L}(p_0|\mathcal{X}) \equiv log \space L(p_0|\mathcal{X})=log \space \prod_{t=1}^N{p_0^{x^t}(1-p_0)^{1-x^t}}$$

When the $log$ is introduced, multiplication is converted into summation.

Due to the $log$ operator, the multiplication between $p_0^{x^t}$ and $(1-p_0)^{1-x^t}$ is converted into summation as follows:

$$\mathcal{L}(p_0|\mathcal{X}) \equiv log \space L(p_0|\mathcal{X})= \sum_{t=1}^N{log \space p_0^{x^t}}+\sum_{t=1}^N{log \space (1-p_0)^{1-x^t}}$$

Using the log power rule, the log-likelihood is:

$$\mathcal{L}(p_0|\mathcal{X}) \equiv log \space p_0\sum_{t=1}^N{x^t} + log \space (1-p_0) \sum_{t=1}^N{({1-x^t})}$$

The last summation term can be simplified as follows:

$$\sum_{t=1}^N{({1-x^t})}=\sum_{t=1}^N{1}-\sum_{t=1}^N{x^t}=N-\sum_{t=1}^N{x^t}$$

Going back to the log-likelihood function, here is its last form:

$$\mathcal{L}(p_0|\mathcal{X})=log(p_0)\sum_{t=1}^N{x^t} + log(1-p_0) (N-\sum_{t=1}^N{x^t})$$

After the log-likelihood is derived, next we'll consider the maximum likelihood estimation. How do we find the maximum value of the previous equation?

Maximum Likelihood Estimation

When the derivative of a function equals 0, this means it has a special behavior; it neither increases nor decreases. This special behavior might be referred to as the maximum point of the function. Thus, it is possible to get the maximum of the previous log-likelihood by setting its derivative with respect to $p_0$ to 0.

$$\frac{d \space \mathcal{L}(p_0|\mathcal{X})}{d \space p_0}=0$$

Remember that the derivative of $log(x)$ is calculated as follows:

$$\frac{d \space log(x)}{dx}=\frac{1}{x ln(10)}$$

For the previous log-likelihood equation, here is its derivative:

$$\frac{d \space \mathcal{L}(p_0|\mathcal{X})}{d \space p_0}=\frac{\sum_{t=1}^N{x^t}}{p_0 ln(10)} - \frac{(N-\sum_{t=1}^N{x^t})}{(1-p_0) ln(10)}=0$$

Note that $log(p_0) log(1-p_0) ln(10)$ can be used as a unified denominator. As a result, the derivative becomes as given below:

$$\frac{d \space \mathcal{L}(p_0|\mathcal{X})}{d \space p_0}=\frac{(1-p_0)\sum_{t=1}^N{x^t}-p_0(N-\sum_{t=1}^N{x^t})}{p_0 (1-p_0) ln(10)}=0$$

Because the derivative equals 0, there is no need for the denominator. The derivative is now as follows:

$$\frac{d \space \mathcal{L}(p_0|\mathcal{X})}{d \space p_0}=(1-p_0)\sum_{t=1}^N{x^t}-p_0(N-\sum_{t=1}^N{x^t})=0$$

After some simplifications, here is the result:

$$\frac{d \space \mathcal{L}(p_0|\mathcal{X})}{d \space p_0}=\sum_{t=1}^N{x^t}-p_0\sum_{t=1}^N{x^t}-p_0N+p_0\sum_{t=1}^N{x^t}=0$$

The next two terms in the nominator cancel each other out:

$$-p_0\sum_{t=1}^N{x^t}+p_0\sum_{t=1}^N{x^t}$$

The derivative thus becomes:

$$\frac{d \space \mathcal{L}(p_0|\mathcal{X})}{d \space p_0}=\sum_{t=1}^N{x^t}-p_0N=0$$

The negative term can be moved to the other side to become:

$$\frac{d \space \mathcal{L}(p_0|\mathcal{X})}{d \space p_0}=\sum_{t=1}^N{x^t}=p_0N$$

By dividing the two sides by $N$, the equation that calculates the parameter $p_0$ is:

$$p_0=\frac{\sum_{t=1}^N{x^t}}{N}$$

Simply, the parameter $p_0$ is calculated as the mean of all samples. Thus, the estimated parameter $p$ of the Bernoulli distribution is $p_0$.

Remember that $x^t \in {0, 1}$, which means the sum of all samples is the number of samples that have $x^t=1$. Thus, if there are 10 samples and out of them there are 6 ones, then $p_0=0.6$. By maximizing the likelihood (or the log-likelihood), the best Bernoulli distribution representing the data will be derived.

Estimated Distribution

Remember that the probability function of the Bernoulli distribution is:

$$p(x)=p^x(1-p)^{1-x}, \space where \space x={0,1}$$

Once the parameter $p$ of the Bernoulli distribution is estimated as $p_0$, it is plugged into the generic formula of the Bernoulli distribution to return the estimated distribution of the sample $\mathcal{X}=x^t$:

$$p(x^t)=p_0^{x^t}(1-p_0)^{1-x^t}, \space where \space x^t={0,1}$$

Predictions can be made using the estimated distribution of the sample $\mathcal{X}=x^t$.

Multinomial Distribution

The Bernoulli distribution works with only two outcomes/states. To work with more than two outcomes the multinomial distribution is used, where the outcomes are mutually exclusive so that no one affects the other. The multinomial distribution is a generalization of the Bernoulli distribution.

A problem that can be distributed as the multinomial distribution is rolling a dice. There are more than two outcomes, where each of these outcomes is independent from each other.

The probability of an outcome is $p_i$, where $i$ is the index of the outcome (i.e. class). Because the sum of the probabilities for all outcomes must be 1, the following applies:

$$\sum_{i=1}^N{p_i}=1$$

For each outcome $i$, there is an indicator variable $x_i$. The set of all variables is:

$$\mathcal{X}=\{x_i\}_{i=1}^K$$

A variable $x_i$ can be either 1 or 0. It is 1 if the outcome is $i$, and 0 otherwise.
Remember that there is only a single outcome per experiment $t$. As a result, the sum of all variables $x^t$ must be 1 for all the classes $i, i=1:K$.

$$\sum_{i=1}^K{x_i^t}=1$$

Probability Function

The probability function can be stated as follows, where $K$ is the number of outcomes. It is the product of all probabilities for all outcomes.

$$p(x_1, x_2, x_3, ...x_K)=\prod_{i=1}^K{p_i^{x_i}}$$

Note that the sum of all $p_i$ is 1.

$$\sum_{i=1}^K{p_i}=1$$

Likelihood

The generic likelihood estimation formula is given below:

$$L(\theta|\mathcal{X}) \equiv P(X|\theta) =\prod_{t=1}^N{p(x^t|\theta)}$$

For the multinomial distribution, here is its likelihood where $K$ is the number of outcomes and $N$ is the number of samples.

$$L(p_i|\mathcal{X}) \equiv P(X|\theta)=\prod_{t=1}^N\prod_{i=1}^K{p_i^{x_i^t}}$$

Log-Likelihood

The log-likelihood for the multinomial distribution is as follows:

$$\mathcal{L}(p_i|\mathcal{X}) \equiv log \space L(p_i|\mathcal{X}) \equiv log \space P(X|\theta) =log \space \prod_{t=1}^N\prod_{i=1}^K{p_i^{x_i^t}}$$

The $log$ converts the products into summations:

$$\mathcal{L}(p_i|\mathcal{X})=\sum_{t=1}^N\sum_{i=1}^K{log \space p_i^{x_i^t}}$$

Based on the log power rule, the log-likelihood is:

$$\mathcal{L}(p_i|\mathcal{X})=\sum_{t=1}^N\sum_{i=1}^K{[{x_i^t} \space log \space p_i}]$$

Note that the sum of all $x$ for all the classes equals 1. In other words, the following holds:

$$\sum_{i=1}^Kx_i^t=1$$

Then, the log-likelihood becomes:

$$\mathcal{L}(p_i|\mathcal{X})=\sum_{t=1}^N{x_i^t}\sum_{i=1}^K{log \space p_i}$$

The next section uses MLE to estimate the parameter $p_i$.

Maximum Likelihood Estimation

Based on the log-likelihood of the multinomial distribution $\mathcal{L}(p_i|\mathcal{X})$, the parameter $p_i$ is estimated by setting the derivative of the log-likelihood to 0 according to the next equation.

$$\frac{d \space \mathcal{L}(p_i|\mathcal{X})}{d \space p_i}=\frac{d \space \sum_{t=1}^N{x_i^t}\sum_{i=1}^K{log \space p_i}}{d \space p_i}=0$$

According to the derivative product rule, the derivative of the product of the terms $\sum_{t=1}^N{x_i^t}$ and $\sum_{i=1}^K{log \space p_i}$ is calculated as follows:

$$\frac{d \space \sum_{t=1}^N{x_i^t}\sum_{i=1}^K{log \space p_i}}{d \space p_i}=\sum_{i=1}^K{log \space p_i}.\frac{d \space \sum_{t=1}^N{x_i^t}}{d \space p_i} + \sum_{t=1}^N{x_i^t}.\frac{d \space \sum_{i=1}^K{log \space p_i}}{d \space p_i}$$

The derivative of $log(p_i)$ is:

$$\frac{d \space log(p_i)}{dp_i}=\frac{1}{p_i ln(10)}$$

The derivative of the log-likelihood is:

$$\frac{d \space \sum_{t=1}^N{x_i^t}\sum_{i=1}^K{log \space p_i}}{d \space p_i}= \frac{\sum_{t=1}^N{x_i^t}}{p_i ln(10)}$$

Based on the Lagrange multipliers and setting the derivative of the log-likelihood to zero, the MLE for the multinomial distribution is:

$$p_i=\frac{\sum_{t=1}^N{x_i^t}}{N}$$

Note that the multinomial distribution is just a generalization of the Bernoulli distribution. Their MLEs are similar, except that the multinomial distribution considers that there are multiple outcomes compared to just two in the case of the Bernoulli distribution.

The MLE is calculated for each outcome. It calculates the number of times an outcome $i$ appeared over the total number of outcomes. For example, if the face numbered 2 on a dice appeared 10 times from a total number of 20 throws, then its MLE is $10 / 20 = 0.5$.

The multinomial experiment can be viewed as doing $K$ Bernoulli experiments. For each experiment, the probability of a single class $i$ is calculated.

Estimated Distribution

Once the parameter $p_i$ of the multinomial distribution is estimated, it is plugged into the probability function of the multinomial distribution to return the estimated distribution for the sample $\mathcal{X}=x^t$.

$$p(x_1, x_2, x_3, ...x_K)=\prod_{i=1}^K{p_i^{x_i}}$$

Gaussian (Normal) Distribution

Both the Bernoulli and multinomial distributions have their inputs set to either 0 or 1. There is no way that an input $x$ is any real number.

$$x^t \in {0,1} \ t \in 1:N \ N: \space Number \space of \space samples.$$

In the Gaussian distribution, the input $x$ takes a value from $-\infty$ to $\infty$.

$$-\infty < x < \infty$$

Probability Function

The Gaussian (normal) distribution is defined based on two parameters: mean $\mu$ and variance $\sigma^2$.

$$p(x)=\mathcal{N}(\mu, \sigma^2)$$

Given these two parameters, here is the probability density function for the Gaussian distribution.

$$\mathcal{N}(\mu, \sigma^2)=p(x)=\frac{1}{\sqrt{2\pi}\sigma}\exp[-\frac{(x-\mu)^2}{2\sigma^2}]=\frac{1}{\sqrt{2\pi}\sigma}e^{[-\frac{(x-\mu)^2}{2\sigma^2}]} \$$

$$-\infty < x < \infty$$

A random variable $X$ is said to follow the Gaussian (normal) distribution if its density function is calculated according to the previous function. In this case, its mean is $E[X]\equiv \mu$ and variance is $VAR(X) \equiv \sigma^2$. This is denoted as $\mathcal{N}(\mu, \sigma^2)$.

Likelihood

Let's start by revisiting the equation that calculates the likelihood estimation.

$$L(\theta|\mathcal{X})=\prod_{t=1}^N{p(x^t|\theta)} \space Where \space \mathcal{X}=\{x^t\}_{t=1}^N$$

For the Gaussian probability function, here is how the likelihood is calculated.

$$L(\mu,\sigma^2|\mathcal{X}) \equiv \prod_{t=1}^N{\mathcal{N}(\mu, \sigma^2)}=\prod_{t=1}^N{\frac{1}{\sqrt{2\pi}\sigma}\exp[-\frac{(x^t-\mu)^2}{2\sigma^2}]}$$

Log-Likelihood

The log is introduced into the likelihood of the Gaussian distribution as follows:

$$\mathcal{L}(\mu,\sigma^2|\mathcal{X}) \equiv log \space L(\mu,\sigma^2|\mathcal{X}) \equiv log\prod_{t=1}^N{\mathcal{N}(\mu, \sigma^2)}=log\prod_{t=1}^N{\frac{1}{\sqrt{2\pi}\sigma}\exp[-\frac{(x^t-\mu)^2}{2\sigma^2}]}$$

The log converts the product into summation as follows:

$$\mathcal{L}(\mu,\sigma^2|\mathcal{X}) \equiv log \space L(\mu,\sigma^2|\mathcal{X}) \equiv \sum_{t=1}^N{log \space \mathcal{N}(\mu, \sigma^2)}=\sum_{t=1}^N{log \space (\frac{1}{\sqrt{2\pi}\sigma}\exp[-\frac{(x^t-\mu)^2}{2\sigma^2}])}$$

Using the log product rule, the log-likelihood is:

$$\mathcal{L}(\mu,\sigma^2|\mathcal{X}) \equiv log \space L(\mu,\sigma^2|\mathcal{X}) \equiv \sum_{t=1}^N{log \space \mathcal{N}(\mu, \sigma^2)}=\sum_{t=1}^N{(log \space (\frac{1}{\sqrt{2\pi}\sigma}) + log \space (\exp[-\frac{(x^t-\mu)^2}{2\sigma^2}]))}$$

The summation operator can be distributed across the two terms:

$$\mathcal{L}(\mu,\sigma^2|\mathcal{X})=\sum_{t=1}^N{log \space \frac{1}{\sqrt{2\pi}\sigma} + \sum_{t=1}^Nlog \space \exp[-\frac{(x^t-\mu)^2}{2\sigma^2}]}$$

The equation has two separate terms. Let's now work on each term separately and then combine the results later.

For the first term, the log quotient rule can be applied. Here is the result:

$$\sum_{t=1}^Nlog \space (\frac{1}{\sqrt{2\pi}\sigma})=\sum_{t=1}^N(log(1)-log(\sqrt{2\pi}\sigma))$$

Given that $log(1)=0$, here is the result:

$$\sum_{t=1}^Nlog \space (\frac{1}{\sqrt{2\pi}\sigma})=-\sum_{t=1}^Nlog(\sqrt{2\pi}\sigma)$$

Based on the log product rule, the log of the first term is:

$$\sum_{t=1}^Nlog \space (\frac{1}{\sqrt{2\pi}\sigma})=-\sum_{t=1}^N[log{\sqrt{2\pi}+log \space \sigma}]$$

Note that the first term does not depend on the summation variable $t$, and thus it is a fixed term. As a result, the result of the summation is just multiplying this term by $N$.

$$\sum_{t=1}^Nlog \space (\frac{1}{\sqrt{2\pi}\sigma})=-\frac{N}{2}log({\sqrt{2\pi}})-N \space log \space \sigma$$

Let's now move onto the second term, which is given below.

$$\sum_{t=1}^Nlog \space (\exp[-\frac{(x^t-\mu)^2}{2\sigma^2}])$$

The log power rule can be applied to simplify this term as follows:

$$\sum_{t=1}^Nlog \space (\exp[-\frac{(x^t-\mu)^2}{2\sigma^2}])=\sum_{t=1}^Nlog \space e^{[-\frac{(x^t-\mu)^2}{2\sigma^2}]}=\sum_{t=1}^N[-\frac{(x^t-\mu)^2}{2\sigma^2}] \space log(e)$$

Given that the $log$ base is $e$, $log(e)=1$. Thus, the second term is now:

$$\sum_{t=1}^Nlog \space (\exp[-\frac{(x^t-\mu)^2}{2\sigma^2}])=-\sum_{t=1}^N\frac{(x^t-\mu)^2}{2\sigma^2}$$

The denominator does not depend on the summation variable $t$, and thus the equation can be written as follows:

$$\sum_{t=1}^Nlog \space (\exp[-\frac{(x^t-\mu)^2}{2\sigma^2}])=-\frac{1}{2\sigma^2}\sum_{t=1}^N(x^t-\mu)^2$$

After simplifying the two terms, here is the log-likelihood of the Gaussian distribution:

$$\mathcal{L}(\mu,\sigma^2|\mathcal{X})=-\frac{N}{2}log({\sqrt{2\pi}})-N \space log \space \sigma-\frac{1}{2\sigma^2}\sum_{t=1}^N(x^t-\mu)^2$$

Maximum Likelihood Estimation

This section discusses how to find the MLE of the two parameters in the Gaussian distribution, which are $\mu$ and $\sigma^2$.

The MLE can be found by calculating the derivative of the log-likelihood with respect to each parameter. By setting this derivative to 0, the MLE can be calculated. The next subsection starts with the first parameter $\mu$.

MLE of Mean $\mu$

Starting with $\mu$, let's calculate the derivative of the log-likelihood:

$$\frac{d \space \mathcal{L}(\mu,\sigma^2|\mathcal{X})}{d \space \mu}=\frac{d}{d \mu} [-\frac{N}{2}log({\sqrt{2\pi}})-N \space log \space \sigma-\frac{1}{2\sigma^2}\sum_{t=1}^N(x^t-\mu)^2]=0$$

The first two terms do not depend on $\mu$, and thus their derivative is 0.

$$\frac{d \space \mathcal{L}(\mu,\sigma^2|\mathcal{X})}{d \mu}={\frac{d}{d \mu}\sum_{t=1}^N(x^t-\mu)^2}=0$$

The previous term could be written as follows:

$$\frac{d \space \mathcal{L}(\mu,\sigma^2|\mathcal{X})}{d \mu}={\frac{d}{d \mu}\sum_{t=1}^N((x^t)^2-2x^t\mu+\mu^2)}=0$$

Because $(x^t)^2$ does not depend on $\mu$, its derivative is 0 and can be neglected.

$$\frac{d \space \mathcal{L}(\mu,\sigma^2|\mathcal{X})}{d \mu}={\frac{d}{d \mu}\sum_{t=1}^N(-2x^t\mu+\mu^2)}=0$$

The summation can be distributed across the remaining two terms:

$$\frac{d \space \mathcal{L}(\mu,\sigma^2|\mathcal{X})}{d \mu}={\frac{d}{d \mu}[-\sum_{t=1}^N2x^t\mu+\sum_{t=1}^N\mu^2}]=0$$

The derivative of the log-likelihood becomes:

$$\frac{d \space \mathcal{L}(\mu,\sigma^2|\mathcal{X})}{d \mu}=-\sum_{t=1}^N2x^t+2\sum_{t=1}^N\mu=0$$

The second term $2\sum_{t=1}^N\mu$ does not depend on $t$, and thus it is a fixed term which equals $2N\mu$. As a result, the derivative of the log-likelihood is as follows:

$$\frac{d \space \mathcal{L}(\mu,\sigma^2|\mathcal{X})}{d \mu}=-\sum_{t=1}^N2x^t+2N\mu=0$$

By solving the previous equation, finally, the MLE of the mean is:

$$m=\frac{\sum_{t=1}^Nx^t}{N}$$

MLE of Variance $\sigma^2$

Similar to the steps of calculating the MLE for the mean, the MLE for the variance is:

$$s^2=\frac{\sum_{t=1}^N(x^t-m)^2}{N}$$

Conclusion

This tutorial worked through the math of the maximum likelihood estimation (MLE) method that estimates the parameters of a known distribution based on training data $x^t$. The three distributions discussed are Bernoulli, multinomial, and Gaussian.

The tutorial summarized the steps that the MLE uses to estimate parameters:

1. Claim the distribution of the training data.
2. Estimate the distribution's parameters using log-likelihood.
3. Plug the estimated parameters into the distribution's probability function.
4. Finally, estimate the distribution of the training data.

Once the log-likelihood is calculated, its derivative is calculated with respect to each parameter in the distribution. The estimated parameter is what maximizes the log-likelihood, which is found by setting the log-likelihood derivative to 0.

This tutorial discussed how MLE works for classification problems. In a later tutorial, the MLE will be applied to estimate the parameters for regression problems. Stay tuned.