Naive Bayes

Introduction

We have seen how probability plays a key role in machine learning. In this chapter, we are going to discuss further how probability theory can be used to make predictions. Data is generated from a process that is not completely known. Since we do not have complete knowledge about the process, probability theory is used to model the uncertainty within the process. In this chapter, we will discuss how probability theory can be used to represent uncertainty and make prediction.

Probability Theory

A random process by which we observe something uncertain is called a random experiment. A random experiment is any situation where the process can lead to more than one possible outcome. Consider a random process of tossing a coin which may result in two mutually exclusive outcomes, either head or tail turns up. The outcomes of this process cannot be predicted with certainty. We can only say how likely each outcome is. The chance of a particular event happening is called probability. Another important concept that is associated with probability is the random variable. A random variable is a numerical description of the outcome of a random experiment.

Consider the process of tossing a coin with two mutually exclusive outcomes, which are either head or tail. We define a random variable X that takes one of the two values. Let X=1 be the value for heads and X=0 denotes the outcome of tails. The probability of an event X=x is denoted by p(X=x) or P(x) and its value satisfies 0 \le p(x) \le 1 and \sum_{x \in X} p(x) = 1. A higher probability value indicates a higher chance of an event occurring. Thus, using probability value, we can predict the outcome of the next toss. For example, the prediction will be head if p(X=1) > 0.5 and tails otherwise.

If p(X) is not known, we can estimate the probability by repeatedly tossing the coin and observing the outcomes. Given a sample of outcomes of past N tosses, the estimated probability of heads is

p(X=1) = \frac{N_h}{N}

where N_h is the number of tosses with outcome heads. For example, if the number of tosses is 10 and the sample \mathcal{X}=\{1,1,0,1,0,0,1,1,0,0\}, the estimated probability is

p(X=1)=\frac{5}{10}

Given two independent events, A and B, such as getting heads from tossing two coins. Note that independent events means each event is not affected by other events. We can calculate the probability of the two events happening as follows:

p(A \cap B) = p(A,B) = p(A) \times p(B)

Some events are dependent on other events. Consider taking two balls from a jar containing 5 red and 5 blue balls. Initially, both red and blue balls have equal chance of being taken. If the first ball was red, the second ball is less likely to be red as there are 4 red balls left in the jar. The probability that the second ball is chosen is called conditional probability. Let A and B be the event of selecting the second and first balls respectively. The conditional probability of event A given that event B is given as follows:

p(A \mid B)=\frac{p(A,B)}{p(B)}

The conditional probability can be rewritten as follows:

p(A \mid B) = \frac{p(B \mid A)p(A)}{p(B)}

where p(B \mid A) is the conditional probability of event B given A, also known as the likelihood and p(A \mid B) is the conditional probability of event A and B, also known as the posterior probability. p(A) and p(B) are the probabilities of observing A and B respectively which can be interpreted as the prior probability and evidence. Using the law of total probability, p(B) = \sum_j p(B \mid A_j)p(A_j). The above definition is known as Bayes’ rule [1].

The following example illustrates the use of Bayes’ rule in medical testing. Suppose that for every 100 women, 25 are indeed pregnant.

p(y=1) = 0.10

The sensitivity of the pregnancy test is 98% which means if a woman is pregnant, the test will be positive with a probability of 0.98.

p(x=1 \mid y=1) = 0.98

Additionally, the false positive rate of the test kit is 1.5%.

p(x=1 \mid y=0) = 0.015

Suppose a woman takes a pregnancy test and the outcome of the test is positive. What is the probability the woman is pregnant?

\begin{aligned} p(y=1 \mid x=1) &= \frac{p(x=1 \mid y=1)p(y=1)}{p(x=1 \mid y=1)p(y=1) + p(x=1 \mid y=0)p(y=0)} \\ &= \frac{0.98 \times 0.20}{0.98 \times 0.20 + 0.015 \times 0.80} \\ &= 0.942 \end{aligned}

Since the probability is 0.942, we can conclude that the woman is pregnant.

Naive Bayes

We have seen how we can perform classification when uncertainty is modeled using probabilities i.e. the likelihood, prior and evidence. But, in many cases, this information is not available to us. In the absence of the information, we can estimate the probabilities from past observations (training data). Then, using the estimated probabilities, we can classify the feature vector \mathbf{x} by applying them to the Bayes’ rule as follows:

p(y=c \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid y=c)p(y=c)}{p(\mathbf{x})}

The key to using this approach is to estimate the class-conditional probabilities of the features or the likelihood. Let d be the number of features, the likelihood is

p(x_1,x_2,...,x_d \mid y=c) = p(x_1 \mid x_2, ...,x_d,y=c) p(x_2 \mid ...,x_d,y=c) ... p(x_d \mid y=c)

However, to calculate the likelihood is cumbersome especially when the number of features is large. Naive Bayes is a simple classifier that is based on applying the Bayes’ rule with a strong independence assumption between the features [2]. Due to this assumption, the calculation of likelihood is simplified to

p(\mathbf{x} \mid y=c) = \prod_{j=1}^d p(x_j \mid y=c)

Thus, the posterior probability becomes

p(y=c \mid \mathbf{x}) = \frac{p(y=c) \prod_{j=1}^d p(x_j \mid y=c)}{p(x_1, x_2, ..., x_d)}

Since p(x_1, x_2, ..., x_d) is a constant, and the prediction is made by comparing the posterior probabilities and choose the class with the largest probability. Thus, the denominator can be omitted and the classification rule becomes

p(y=c \mid \mathbf{x}) \propto p(y=c) \prod_{j=1}^d p(x_j \mid y=c)

The prior probability p(y=c) can be estimated by assuming equiprobable classes or by calculating the number of instances in the dataset which belong to class c. The estimation of the conditional probability or feature’s distribution p(x_j \mid y=c) is done by assuming the distribution of the features [3]. The assumption is done based on the type of the feature which is described in the following sections.

Categorical Naive Bayes

First, we discuss the case where the features are categorical data. Assuming there is a set of features and each feature j has its own categorical distribution. For each feature j in the training set, the categorical distribution is estimated conditioned on class c. The conditional probability of label k in feature j given class c is estimated as

p(x_j=k \mid y=c) = \frac{N_{kjc}}{N_c}

where N_{kic} is the number of times label k appears in feature j which belongs to class c and N_c is the number of instances with class c.

Bernoulli Naive Bayes

Bernoulli distribution is a discrete distribution with only two possible values for the random variable. Bernoulli naive Bayes is typically used for document or text classification in which the features are the binary word occurrence x_j \in {0,1}, where x_j is a boolean feature expressing the presence (occurrence) or absence of the j-th word in the document. If the probability of a word’s occurrence is p, then the probability of absence is 1-p. Thus, the likelihood of a document \mathbf{x} given class c is given by

p(\mathbf{x} \mid y=c) = \prod_{j=1}^{d} p(x_j \mid y=c)

which is the product of the conditional probability of each word’s occurrence x_j given class c. p(x_j \mid y=c) is the probability of j-th word occurring in class c. This probability can be estimated by counting the number of times j-th word occurs in the documents labeled with class c

Multinomial Naive Bayes

Multinomial distribution is a generalized form of binomial distribution, a discrete distribution of a random variable with more than two outcomes. Using multinomial distribution, in the case of document classification, rather than word occurrence vectors, we can use the frequency of a word as a feature. The likelihood of a document given class c is given by

p(\mathbf{x} \mid y=c) = \frac{x_j!}{\prod_{j=1}^{d} x_j! } \prod_{j=1}^{d} p(x_j \mid y=c)

and p(x_j \mid y=c) can be estimated as follows:

p(x_j \mid y=c)=\frac{N_jc}{N_c}

where N_jc is the number of times word j appears in the document with class c

Gaussian Naive Bayes

In the case of real-valued features, we typically assume the features are distributed according to the Gaussian distribution. The likelihood of feature x_j given class c is given by

p(x_j \mid y=c) = \frac{1}{\sqrt{2 \pi \sigma_{jc}^2}} e^{-\frac{(x_j - \mu_{jc})^2}{2 \sigma_{jc}^2}}

where \mu_{jc} is the mean of feature x_j with class c and \sigma_{jc} is the variance of feature x_j with class c.

Log Trick

Multiplying the conditional probabilities may cause computational instability or known as numerical underflow. This is because the value of p(\mathbf{x} \mid y=c) is often very small especially if \mathbf{x} is a high dimensional vector. Thus, to avoid the numerical underflow, we take logarithm of the conditional probabilities when applying Bayes rule. Note that multiplication operation becomes an addition in the logarithm space.

\begin{aligned}\log p(y=c \mid \mathbf{x}) &\propto \log ( \prod_{j=1}^d p(x_j \mid y=c) \cdot p(y=c)) \\ &\propto \sum_{j=1}^d \log p(x_j \mid y=c) + \log p(y=c) \end{aligned}

Numerical Example 1

Given the following dataset.

Outlook Temperature Humidity Play Tennis
Sunny Hot High No
Cloudy Hot High Yes
Rain Mild High Yes
Rain Cool Normal No
Cloudy Mild High Yes
Sunny Cool Normal Yes
Rain Mild High No
Sunny Mild High No

N=8

Prior probability

p(y=\textttt{Yes})=\frac{4}{8} = \frac{1}{2}, p(y=\textttt{No})=\frac{4}{8} = \frac{1}{2}

Conditional probability for x_1

p(x_1=\texttt{Sunny} \mid y=\texttt{Yes}) = \frac{1}{4}, p(x_1=\texttt{Sunny} \mid y=\texttt{No}) = \frac{2}{4}

p(x_1=\texttt{Cloudy} \mid y=\texttt{Yes}) = \frac{2}{4}, p(x_1=\texttt{Cloudy} \mid y=\texttt{No}) = \frac{0}{4}

p(x_1=\texttt{Rain} \mid y=\texttt{Yes}) = \frac{1}{4}, p(x_1=\texttt{Rain} \mid y=\texttt{No}) = \frac{2}{4}

Conditional probability for x_2

p(x_2=\texttt{Hot} \mid y=\texttt{Yes}) = \frac{1}{4}, p(x_2=\texttt{Hot} \mid y=\texttt{No}) = \frac{1}{4}

p(x_2=\texttt{Mild} \mid y=\texttt{Yes}) = \frac{2}{4}, p(x_2=\texttt{Mild} \mid y=\texttt{No}) = \frac{2}{4}

p(x_2=\texttt{Cool} \mid y=\texttt{Yes}) = \frac{1}{4}, p(x_2=\texttt{Cool} \mid y=\texttt{No}) = \frac{1}{4}

Conditional probability for x_3

p(x_2=\texttt{High} \mid y=\texttt{Yes}) = \frac{3}{4}, p(x_2=\texttt{High} \mid y=\texttt{No}) = \frac{3}{4}

p(x_2=\texttt{Normal} \mid y=\texttt{Yes}) = \frac{1}{4}, p(x_2=\texttt{Normal} \mid y=\texttt{No}) = \frac{1}{4}

Given \mathbf{\acute{x}}=\{x_1=\texttt{Sunny}, x_2=\texttt{Mild},x_3=\texttt{Normal}\}, predict the class of \mathbf{\acute{x}}.

Posterior probability

\begin{aligned}p(y=\texttt{Yes} \mid \mathbf{\acute{x}}) &\propto \log (\frac{1}{4} \cdot \frac{2}{4} \cdot \frac{1}{4} \cdot \frac{1}{2}) \\ &\propto -4.159 \end{aligned}

\begin{aligned}p(y=\texttt{No} \mid \mathbf{\acute{x}}) &\propto \log (\frac{2}{4} \cdot \frac{2}{4} \cdot \frac{1}{4} \cdot \frac{1}{2}) \\ &\propto -3.466 \end{aligned}

Since p(y=\texttt{No} \mid \mathbf{\acute{x}}) > p(y=\texttt{Yes} \mid \mathbf{\acute{x}}), the prediction is No.

Numerical Example 2

Given the following dataset where 0 is female and 1 is male.

Height (m) Weight (kg) Gender
1.72 68 1
1.83 79 1
1.71 65 0
1.95 83 1
1.68 60 0
1.81 76 1
1.75 69 0
1.67 57 0

N=8

Prior probability

p(y=\textttt{Male})=\frac{4}{8} = \frac{1}{2}, p(y=\textttt{Female})=\frac{4}{8} = \frac{1}{2}

\mu and \sigma^2 for x_1

\mu_{1,0} = 1.7025, \sigma_{1,0}^2 = 0.0013

\mu_{1,1} = 1.8275, \sigma_{1,1}^2 = 0.0090

\mu and \sigma^2 for x_2

\mu_{2,0} = 62.75, \sigma_{2,0}^2 = 28.2500

\mu_{2,1} = 76.50, \sigma_{2,1}^2 = 40.3333

Given \mathbf{\acute{x}}=\{\acute{x}_1=1.73, \acute{x}_2=66\}, predict the gender of \mathbf{\acute{x}}.

Probability density function of \mathbf{\acute{x}} given class 0.

\begin{aligned}p(\acute{x}_1 \mid y=0) &= \frac{1}{\sigma_{1,0} \sqrt{2 \pi }} e^{-\frac{(\acute{x}_1 - \mu_{1,0})^2}{2 \sigma_{1,0}^2}} \\ &= 8.2721 \end{aligned}

\begin{aligned}p(\acute{x}_2 \mid y=0) &= \frac{1}{\sigma_{2,0} \sqrt{2 \pi }} e^{-\frac{(\acute{x}_2 - \mu_{2,0})^2}{2 \sigma_{2,0}^2}} \\ &= 0.0622 \end{aligned}

Posterior probability of class 0.

\begin{aligned}p(y=0 \mid \mathbf{\acute{x}}) &\propto \log (p(\acute{x}_1 \mid y=0) \cdot p(\acute{x}_2 \mid y=0) \cdot p(y=0)) \\ &\propto -1.356 \end{aligned}

Probability density function of \mathbf{\acute{x}} given class 1.

\begin{aligned}p(\acute{x}_1 \mid y=1) &= \frac{1}{\sigma_{1,1} \sqrt{2 \pi }} e^{-\frac{(\acute{x}_1 - \mu_{1,1})^2}{2 \sigma_{1,1}^2}} \\ &= 2.4799 \end{aligned}

\begin{aligned}p(\acute{x}_2 \mid y=1) &= \frac{1}{\sigma_{2,1} \sqrt{2 \pi }} e^{-\frac{(\acute{x}_2 - \mu_{2,1})^2}{2 \sigma_{2,1}^2}} \\ &= 0.0160 \end{aligned}

Posterior probability of class 1.

\begin{aligned}p(y=1 \mid \mathbf{\acute{x}}) &\propto \log (p(\acute{x}_1 \mid y=1) \cdot p(\acute{x}_2 \mid y=1) \cdot p(y=1)) \\ &\propto -3.920 \end{aligned}

Since p(y=0 \mid \mathbf{\acute{x}}) > p(y=1 \mid \mathbf{\acute{x}}), the prediction is class 0 (female).

Naive Bayes is a Linear Classifier

In many cases, naive Bayes leads to a linear decision boundary. Given an input \mathbf{x}, the prediction is 1 if

p(y=1 \mid \mathbf{x}) > p(y=0 \mid \mathbf{x})

Taking a log of both sides and using Bayes rule, we have

\log p(\mathbf{x} \mid y=1) + \log p(y=1) - \log p(\mathbf{x} \mid y=0) + \log p(y=0) > 0

Suppose the features are Gaussian, \log p(\mathbf{x} \mid y=c) = \mathcal{N}(\mu_c, \sum_c)

\noindent where \mu_c and \sum_c are the mean and variance of the distribution. Note that \sum_c is diagonal and symmetric.

(\mathbf{x} - \mu_1)^T \sum_1^{-1} (\mathbf{x} - \mu_1) + \log p(y=1) - (\mathbf{x} - \mu_0)^T \sum_1^{-1} (\mathbf{x} - \mu_0) + \log p(y=0) > 0

Expanding the expression, we have

\mathbf{x}^T \sum_1^{-1} x - 2\mathbf{x}^T \sum_1^{-1} \mu_1 + \mu_1^T \sum_1^{-1} \mu_1 -\mathbf{x}^T \sum_0^{-1} \mathbf{x} + 2\mathbf{x}^T \sum_0^{-1} \mu_0 - \mu_0^T \sum_0^{-1} \mu_0 + \log p(y=1) + \log p(y=0) > 0

Assuming both classes share the same covariance matrix, \sum_1 = \sum_0 = \sum, the quadratic terms are canceled. We have

\mathbf{x}^T \sum^{-1} (\mu_0 - \mu_1) + \mu_1^T \sum^{-1} \mu_1 - \mu_0^T \sum^{-1} \mu_0 + \log p(y=1) + \log p(y=0) > 0

\mu_1^T \sum^{-1} \mu_1 - \mu_0^T \sum^{-1} \mu_0 + \log p(y=1) + \log p(y=0) can be considered as a constant C. Thus, we have

x^T \sum^{-1} (\mu_0 - \mu_1) + C > 0

which is a linear form in x.

References

[1] P. Davies, “Kendall’s Advanced Theory of Statistics. Volume 1. Distribution Theory.” Wiley Online Library, 1988.

[2] D. J. Hand and K. Yu, “Idiot’s Bayes—not so stupid after all?,” International statistical review, vol. 69, no. 3, pp. 385–398, 2001.

[3] G. H. John and P. Langley, “Estimating continuous distributions in Bayesian classifiers,” arXiv preprint arXiv:1302.4964, 2013.