Naive Bayesian Classifiers - Multinomial, Bernoulli and Gaussian with Solved Examples and Laplace Smoothing

 Learn about the Naive Bayes Classifier in the following notes.

Numeric Example with Dataset (Transactional Data)

Sum1

Sum2

Consider the following dataset. Apply the Naïve Bayes classifier to the following frequency table and predict the type of fruit given it is {Yellow, Sweet, Long}.


Solution can be viewed in the following pdf.


Numeric Example with Text Data

Multinomial Naive Bayes vs Bernoulli Naive Bayes

Multinomial Naive Bayes and Bernoulli Naive Bayes are both variations of the Naive Bayes algorithm, and they are used for different types of data distributions:

1. Multinomial Naive Bayes:

   - The Multinomial Naive Bayes classifier is used for data that is multinomially distributed, which typically means it is used for discrete data.

   - It is particularly suitable for text classification problems where features (or words) can occur multiple times. For example, it can be used for document classification where the features are the frequencies of the words or n-grams within the documents.

   - The likelihood of the features is assumed to follow a multinomial distribution, which counts how often each outcome of a classification problem occurs.

The likelihood of observing a feature vector x = (x1, x2, ..., xn ) given a class c is calculated as:


2. Bernoulli Naive Bayes:

   - The Bernoulli Naive Bayes classifier is specifically designed for binary/boolean features, which means it is well-suited for data where features are independently and binarily distributed.

   - This model is useful when you’re working with binary feature vectors. For example, in text classification, instead of using word frequencies, you might use binary variables to indicate the presence or absence of a word.

   - The parameters that the model learns are the probabilities of a feature being present in a class, and it explicitly penalizes the non-occurrence of a feature that is indicative of the class, which is a significant distinction from Multinomial Naive Bayes.

The likelihood of observing a feature vector x = (x1, x2, ..., xn ) given a class c is calculated as:

In short, use Multinomial Naive Bayes when your feature data is counts (like word counts in text classification), and use Bernoulli Naive Bayes when your features are binary (0s and 1s to represent the presence or absence of a feature). The main difference lies in the distribution they assume for the input data and consequently how they calculate the likelihoods of the features.

Multinomial Naive Bayes

The following video solves a text classification problem using Multinomial Naive Bayes.

Laplace Smoothing

Laplace smoothing (also known as additive smoothing or Laplacian correction) is a technique used to handle the problem of zero probability in Naive Bayes classifiers. Without smoothing, if the training data has never seen a particular class-feature combination before, it will assign a probability of zero to this combination. This can be problematic because it will cause the entire probability of the class to be zero when using the Naive Bayes rule (since probabilities are multiplied together).

To avoid this, Laplace smoothing is applied by adding a small positive number to each count. The modified formula for calculating the probability of a feature \( f \) given a class \( c \) with Laplace smoothing is:

The effect of Laplace smoothing is to distribute some of the probability mass to unseen features to ensure that no feature has a probability of zero. This allows the classifier to make a prediction even when it encounters a previously unseen feature-class combination.

Bernoulli Naive Bayes

The following video solves a text classification problem using Bernoulli Naive Bayes.



Gaussian Naive Bayes

The following video solves a problem with Discrete and Continuous Features using Gaussian Naive Bayes.






Comments

Popular posts from this blog

ANN Series - 10 - Backpropagation of Errors

Regression 10 - Pseudo-Inverse Matrix Approach - Relationship between MLE and LSA - Derivation

Regression 9 - Maximum Likelihood Estimation(MLE) Approach for Regression - Derivation