Introduction
We have seen how probability plays a key role in machine learning. In this chapter, we are going to discuss further how probability theory can be used to make predictions. Data is generated from a process that is not completely known. Since we do not have complete knowledge about the process, probability theory is used to model the uncertainty within the process. In this chapter, we will discuss how probability theory can be used to represent uncertainty and make prediction.
Probability Theory
A random process by which we observe something uncertain is called a random experiment. A random experiment is any situation where the process can lead to more than one possible outcome. Consider a random process of tossing a coin which may result in two mutually exclusive outcomes, either head or tail turns up. The outcomes of this process cannot be predicted with certainty. We can only say how likely each outcome is. The chance of a particular event happening is called probability. Another important concept that is associated with probability is the random variable. A random variable is a numerical description of the outcome of a random experiment.
Consider the process of tossing a coin with two mutually exclusive outcomes, which are either head or tail. We define a random variable that takes one of the two values. Let be the value for heads and denotes the outcome of tails. The probability of an event is denoted by or and its value satisfies and . A higher probability value indicates a higher chance of an event occurring. Thus, using probability value, we can predict the outcome of the next toss. For example, the prediction will be head if and tails otherwise.
If is not known, we can estimate the probability by repeatedly tossing the coin and observing the outcomes. Given a sample of outcomes of past tosses, the estimated probability of heads is
where is the number of tosses with outcome heads. For example, if the number of tosses is 10 and the sample , the estimated probability is
Given two independent events, and , such as getting heads from tossing two coins. Note that independent events means each event is not affected by other events. We can calculate the probability of the two events happening as follows:
Some events are dependent on other events. Consider taking two balls from a jar containing 5 red and 5 blue balls. Initially, both red and blue balls have equal chance of being taken. If the first ball was red, the second ball is less likely to be red as there are 4 red balls left in the jar. The probability that the second ball is chosen is called conditional probability. Let and be the event of selecting the second and first balls respectively. The conditional probability of event given that event is given as follows:
The conditional probability can be rewritten as follows:
where is the conditional probability of event given , also known as the likelihood and is the conditional probability of event and , also known as the posterior probability. and are the probabilities of observing and respectively which can be interpreted as the prior probability and evidence. Using the law of total probability, . The above definition is known as Bayes’ rule [1].
The following example illustrates the use of Bayes’ rule in medical testing. Suppose that for every 100 women, 25 are indeed pregnant.
The sensitivity of the pregnancy test is 98% which means if a woman is pregnant, the test will be positive with a probability of 0.98.
Additionally, the false positive rate of the test kit is 1.5%.
Suppose a woman takes a pregnancy test and the outcome of the test is positive. What is the probability the woman is pregnant?
Since the probability is 0.942, we can conclude that the woman is pregnant.
Naive Bayes
We have seen how we can perform classification when uncertainty is modeled using probabilities i.e. the likelihood, prior and evidence. But, in many cases, this information is not available to us. In the absence of the information, we can estimate the probabilities from past observations (training data). Then, using the estimated probabilities, we can classify the feature vector by applying them to the Bayes’ rule as follows:
The key to using this approach is to estimate the class-conditional probabilities of the features or the likelihood. Let be the number of features, the likelihood is
However, to calculate the likelihood is cumbersome especially when the number of features is large. Naive Bayes is a simple classifier that is based on applying the Bayes’ rule with a strong independence assumption between the features [2]. Due to this assumption, the calculation of likelihood is simplified to
Thus, the posterior probability becomes
Since is a constant, and the prediction is made by comparing the posterior probabilities and choose the class with the largest probability. Thus, the denominator can be omitted and the classification rule becomes
The prior probability can be estimated by assuming equiprobable classes or by calculating the number of instances in the dataset which belong to class . The estimation of the conditional probability or feature’s distribution is done by assuming the distribution of the features [3]. The assumption is done based on the type of the feature which is described in the following sections.
Categorical Naive Bayes
First, we discuss the case where the features are categorical data. Assuming there is a set of features and each feature has its own categorical distribution. For each feature in the training set, the categorical distribution is estimated conditioned on class . The conditional probability of label in feature given class is estimated as
where is the number of times label appears in feature which belongs to class and is the number of instances with class .
Bernoulli Naive Bayes
Bernoulli distribution is a discrete distribution with only two possible values for the random variable. Bernoulli naive Bayes is typically used for document or text classification in which the features are the binary word occurrence , where is a boolean feature expressing the presence (occurrence) or absence of the -th word in the document. If the probability of a word’s occurrence is , then the probability of absence is . Thus, the likelihood of a document given class is given by
which is the product of the conditional probability of each word’s occurrence given class . is the probability of -th word occurring in class . This probability can be estimated by counting the number of times -th word occurs in the documents labeled with class
Multinomial Naive Bayes
Multinomial distribution is a generalized form of binomial distribution, a discrete distribution of a random variable with more than two outcomes. Using multinomial distribution, in the case of document classification, rather than word occurrence vectors, we can use the frequency of a word as a feature. The likelihood of a document given class is given by
and can be estimated as follows:
where is the number of times word appears in the document with class
Gaussian Naive Bayes
In the case of real-valued features, we typically assume the features are distributed according to the Gaussian distribution. The likelihood of feature given class is given by
where is the mean of feature x_j with class and is the variance of feature x_j with class .
Log Trick
Multiplying the conditional probabilities may cause computational instability or known as numerical underflow. This is because the value of is often very small especially if is a high dimensional vector. Thus, to avoid the numerical underflow, we take logarithm of the conditional probabilities when applying Bayes rule. Note that multiplication operation becomes an addition in the logarithm space.
Numerical Example 1
Given the following dataset.
Outlook | Temperature | Humidity | Play Tennis |
Sunny | Hot | High | No |
Cloudy | Hot | High | Yes |
Rain | Mild | High | Yes |
Rain | Cool | Normal | No |
Cloudy | Mild | High | Yes |
Sunny | Cool | Normal | Yes |
Rain | Mild | High | No |
Sunny | Mild | High | No |
Prior probability
,
Conditional probability for
,
,
,
Conditional probability for
,
,
,
Conditional probability for
,
,
Given , predict the class of .
Posterior probability
Since , the prediction is No.
Numerical Example 2
Given the following dataset where 0 is female and 1 is male.
Height (m) | Weight (kg) | Gender |
1.72 | 68 | 1 |
1.83 | 79 | 1 |
1.71 | 65 | 0 |
1.95 | 83 | 1 |
1.68 | 60 | 0 |
1.81 | 76 | 1 |
1.75 | 69 | 0 |
1.67 | 57 | 0 |
Prior probability
,
and for
,
,
and for
,
,
Given , predict the gender of .
Probability density function of given class .
Posterior probability of class .
Probability density function of given class .
Posterior probability of class .
Since , the prediction is class (female).
Naive Bayes is a Linear Classifier
In many cases, naive Bayes leads to a linear decision boundary. Given an input , the prediction is 1 if
Taking a log of both sides and using Bayes rule, we have
Suppose the features are Gaussian,
\noindent where and are the mean and variance of the distribution. Note that is diagonal and symmetric.
Expanding the expression, we have
Assuming both classes share the same covariance matrix, , the quadratic terms are canceled. We have
can be considered as a constant . Thus, we have
which is a linear form in .
References
[1] P. Davies, “Kendall’s Advanced Theory of Statistics. Volume 1. Distribution Theory.” Wiley Online Library, 1988.
[2] D. J. Hand and K. Yu, “Idiot’s Bayes—not so stupid after all?,” International statistical review, vol. 69, no. 3, pp. 385–398, 2001.
[3] G. H. John and P. Langley, “Estimating continuous distributions in Bayesian classifiers,” arXiv preprint arXiv:1302.4964, 2013.