Bayesian Decision Theory is the statistical approach to pattern classification. It leverages probability to make classifications, and measures the risk (i.e. cost) of assigning an input to a given class.

In this article we'll start by taking a look at prior probability, and how it is not an efficient way of making predictions. Bayesian Decision Theory makes better predictions by using the prior probability, likelihood probability, and evidence to calculate the posterior probability. We'll discuss all of these concepts in detail. Finally, we'll map these concepts from Bayesian Decision Theory to their context in machine learning.

The outline of this article is as follows:

- Prior Probability
- Likelihood Probability
- Prior and Likelihood Probabilities
- Bayesian Decision Theory
- Sum of All Prior Probabilities Must be 1
- Sum of All Posterior Probabilities Must be 1
- Evidence
- Machine Learning & Bayesian Decision Theory
- Conclusion

After completing this article, stay tuned for Part 2 in which we'll apply Bayesian Decision Theory to both binary and multi-class classification problems. To assess the performance of the classifier, both the loss and the risk of making a prediction are discussed. If the classifier makes a weak prediction, a new class named "reject" is used to accept samples with a high uncertainty. Part 2 discusses when and why a sample is assigned to the reject class.

Let's get started.

Bring this project to life

**Prior Probability**

To discuss probability, we should start with how to calculate the probability that an action occurs. The probability is calculated according to the past occurrences of the outcomes (i.e. events). This is called the **prior probability** ("prior" meaning "before"). In other words, the prior probability refers to the probability in the past.

Assume that someone asks who will be the winner of a future match between two teams. Let $A$ and $B$ refer to the first or second team winning, respectively.

In the last 10 cup matches, $A$ occurred 4 times and $B$ occurred the remaining 6 times. So, what is the probability that $A$ occurs in the next match? Based on the experience (i.e., the events that occurred in the past), the prior probability that the first team ($A$) wins in the next match is:

$$ P(A)=\frac{4}{10}=0.4 $$

But the past events may not always hold, because the situation or context may change. For example, team $A$ could have won only 4 matches because there were some injured players. When the next match comes, all of these injured players will have recovered. Based on the current situation, the first team may win the next match with a higher probability than the one calculated based on past events only.

The prior probability measures the probability of the next action without taking into consideration a current observation (i.e. the current situation). It's like predicting that a patient has a given disease based only on past doctors visits.

In other words, because the prior probability is solely calculated based on past events (without present information), this can degrade the prediction value quality. The past predictions of the two outcomes $A$ and $B$ may have occurred while some conditions were satisfied, but at the current moment, these conditions may not hold.

This problem is solved using the likelihood.

**Likelihood Probability**

The likelihood helps to answer the question: given some conditions, what is the probability that an outcome occurs? It is denoted as follows:

$$ P(X|C_i) $$

Where $X$ refers to the conditions, and $C_i$ refers to the outcome. Because there may be multiple outcomes, the variable $C$ is given the subscript $i$.

The likelihood is read as follows:

Under a set of conditions $X$, what is the probability that the outcome is $C_i$?

According to our example of predicting the winning team, the probability that the outcome $A$ occurs does not only depend on past events, but also on current conditions. The likelihood relates the occurrence of an outcome to the current conditions at the time of making a prediction.

Assume the conditions change so that the first team has no injured players, while the second team has many injured players. As a result, it is more likely that $A$ occurs than $B$. Without considering the current situation and using only the prior information, the outcome would be $B$, which is not accurate given the current situation.

For the example of diagnosing a patient, this could be an understandably better prediction, as the diagnosis will take into account their current symptoms rather than their prior condition.

A drawback of using only the likelihood is that it neglects experience (prior probability), which *is* useful in many cases. So, a better way to do a prediction is to combine them both.

**Prior and Likelihood Probabilities**

Using only the prior probability, the prediction is made based on past experience. Using only the likelihood, the prediction depends only on the current situation. When either of these two probabilities is used alone, the result is not accurate enough. It is better to use both the experience and the current situation together in predicting the next outcome.

The new probability would be calculated as follows:

$$ {P(C_i)}{P(X|C_i)} $$

For the example of diagnosing a patient, the outcome would then be selected based on their medical history as well as their current symptoms.

Using both the prior and likelihood probabilities together is an important step towards understanding Bayesian Decision Theory.

**Bayesian Decision Theory**

Bayesian Decision Theory (i.e. the Bayesian Decision Rule) predicts the outcome not only based on previous observations, but also by taking into account the current situation. **The rule describes the most reasonable action to take based on an observation**.

The formula for Bayesian (Bayes) decision theory is given below:

$$ P(C_i|X)=\frac{P(C_i)P(X|C_i)}{P(X)} $$

The elements of the theory are:

- $P(Ci)$: Prior probability. This accounts for how many times the class $C_i$ occurred independently from any conditions (i.e. regardless of the input $X$).
- $P(X|Ci)$: Likelihood. Under some conditions $X$, this is how many times the outcome $C_i$ occurred.
- $P(X)$: Evidence. The number of times the conditions $X$ occurred.
- $P(Ci|X)$: Posterior. The probability that the outcome $Ci$ occurs given some conditions $X$.

Bayesian Decision Theory gives balanced predictions, as it takes into consideration the following:

- $P(X)$: How many times did the conditions $X$ occur?
- $P(C_i)$: Hany many times did the outcome $C_i$ occur?
- $P(X|C_i)$: How many times did both the conditions $X$ and the outcome $C_i$ occur together?

If any of the previous factors was not used, the prediction would be hindered. Let's explain the effect of excluding any of these factors, and mention a case where using each factor might help.

- $P(C_i)$: Assume that the prior probability $P(C_i)$ is not used; then we cannot know whether the outcome $C_i$ occurs frequently or not. If the prior probability is high, then the outcome $C_i$ frequently occurs, and it is an indication that it may occur again.
- $P(X|C_i)$: If the likelihood probability $P(X|C_i)$ is not used, then there is no information to associate the current input $X$ with the outcome $C_i$. For example, the outcome $C_i$ may have occurred frequently, but it rarely occurs with the current input $X$.
- $P(X)$: If the evidence probability $P(X)$ is excluded, then there is no information to reflect the frequency of $X$ occurring. Assuming that both the outcome $C_i$ and the input $X$ occur frequently, then it is probable that the outcome is $C_i$ when the input is $X$.

When there is information about the frequency of the occurrence of $C_i$ alone, $X$ alone, and both $C_i$ and $X$ together, then a better prediction can be made.

There are some things to note about the theory/rule:

- The sum of all prior probabilities must be 1.
- The sum of all posterior probabilities must be 1.
- The evidence is the sum of products of the prior and likelihood probabilities of all outcomes.

The next three sections discuss each of these points.

**Sum of All Prior Probabilities Must be 1**

Assuming there are two possible outcomes, then the following must hold:

$$ P(C_1)+P(C_2)=1 $$

The reason is that for a given input, its outcome must be one of these two. There are no uncovered outcomes.

If there are $K$ outcomes, then the following must hold:

$$ P(C_1)+P(C_2)+P(C_3)+...+P(C_K)=1 $$

Here is how it is written using the summation operator, where $i$ is the outcome index and $K$ is the total number of outcomes:

$$ \sum_{i=1}^{K}P(C_i)=1 $$

Note that the following condition must hold for all prior probabilities:

$$ P(C_i)>=0, \space \forall i $$

**Sum of All Posterior Probabilities Must be 1**

Similar to the prior probability, the sum of all posterior probabilities must be 1, according to the next equations.

$$ P(C_1|X)+P(C_2|X)=1 $$

If the total number of outcomes is $K$, here is the sum using the summation operator:

$$ P(C_1|X)+P(C_2|X)+P(C_3|X)+...+P(C_K|X)=1 $$

Here is how to sum all the posterior probabilities for $K$ outcomes using the summation operator:

$$ \sum_{i=1}^{K}P(C_i|X)=1 $$

**Evidence**

Here is how the evidence is calculated when only two outcomes occur:

$$ P(X)=P(X|C_1)P(C_1)+P(X|C_2)P(C_2) $$

For $K$ outcomes, here is how the evidence is calculated:

$$ P(X)=P(X|C_1)P(C_1)+P(X|C_2)P(C_2)+P(X|C_2)P(C_2)+...+P(X|C_K)P(C_K) $$

Here is how it is written using the summation operator:

$$ P(X)=\sum_{i=1}^{K}P(X|C_i)P(C_i) $$

According to the latest equation of the evidence, the Bayesian Decision Theory (i.e. posterior) can be written as follows:

$$ P(C_i|X)=\frac{P(C_i)P(X|C_i)}{\sum_{k=1}^{K}P(X|C_k)P(C_k)} $$

**Machine Learning & Bayesian Decision Theory**

This section matches the concepts in machine learning to Bayesian Decision Theory.

First, the word **outcome** should be replaced by **class**. Rather than saying the outcome is $C_i$, it is more machine learning-friendly to say the class is $C_i$.

Here is a list that relates the factors in Bayesian Decision Theory to machine learning concepts:

- $X$ is the feature vector.
- $P(X)$ is the similarity between the feature vector $X$ and the feature vectors used in training the model.
- $C_i$ is the class label.
- $P(C_i)$ is the number of times the model classified an input feature vector as the class $C_i$. The decision is independent of the feature vector $X$.
- $P(X|C_i)$ is the previous machine learning model's experience in classifying feature vectors similar to $X$ as the class $C_i$. This relates the class $C_i$ to the current input $X$.

When the following conditions apply, it is likely that the feature vector $X$ is assigned to the class $C_i$:

- The model is trained on feature vectors that are close to the current input vector $X$. This increases $P(X)$.
- The model is trained on some samples (i.e. feature vectors) that belong to the class $C_i$. This increases $P(C_i)$.
- The model was trained to classify the samples close to $X$ as belonging to class $C_i$. This increases $P(X|Ci)$.

When a classification model is trained, it knows the frequency that a class $C_i$ occurs, and this information is represented as the prior probability $P(C_i)$. Without the prior probability $P(C_i)$, the classification model loses some of its learned knowledge.

Assuming that the prior probability $P(C_i)$ is the only probability to be used, the classification model classifies the input $X$ based on the past observations without even seeing the new input $X$. In other words, without even feeding the sample (feature vector) to the model, the model makes a decision and assigns it to a class.

The training data helps the classification model to map each input $X$ to its class label $C_i$. Such learned information is represented as the likelihood probability $P(X|C_i)$. Without the likelihood probability $P(X|C_i)$, the classification model cannot know if the input sample $X$ is related to the class $C_i$.

**Conclusion**

This article introduced Bayesian Decision Theory in the context of machine learning. It described all the elements of the theory starting with prior probability $P(C)$, the likelihood probability $P(X|C)$, the evidence $P(X)$, and finally the posterior probability $p(C|X)$.

We then discussed how these concepts build to Bayesian Decision Theory, and how they work in the context of machine learning.

In the next article we'll discuss how to apply Bayesian Decision Theory to binary and multi-class classification problems, see how the loss and the risk are calculated, and finally, cover the idea of the "reject" class.