Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.
Multinomial Naive Bayes (Basics)¶
Multinomial Naive Bayes (MNB) is a specialized variant of the Naive Bayes algorithm, primarily designed for classification tasks involving discrete data, such as text and document classification. It operates on the assumption that the features (e.g., words in a document) follow a multinomial distribution, making it particularly effective in contexts where data is represented as counts or frequencies. MNB calculates the probability of a given class based on the product of the conditional probabilities of its features, weighted by their frequencies in the data.
Difference from General Naive Bayes¶
The general Naive Bayes algorithm includes a family of classifiers that assume independence among features but differ in the probability distributions they use. For example, Gaussian Naive Bayes assumes features follow a normal distribution, making it suitable for continuous data. In contrast, Multinomial Naive Bayes assumes a multinomial distribution, explicitly designed for discrete data, such as word occurrences or categorical variables. This makes MNB particularly adept at tasks like spam detection, sentiment analysis, or topic classification, where data is often in the form of word counts or term frequencies. A critical distinction lies in how these algorithms handle data representation. While Gaussian Naive Bayes can handle continuous variables, MNB is optimized for integer or non-negative data, such as word counts. This optimization allows MNB to outperform other Naive Bayes variants in tasks where text data or categorical features are involved.
Importance of Learning Multinomial Naive Bayes¶
Understanding Multinomial Naive Bayes is crucial for anyone working in natural language processing (NLP) or text mining. It is a cornerstone algorithm for text classification problems, offering simplicity, efficiency, and effectiveness. For example, when dealing with large-scale datasets like emails, news articles, or customer reviews, MNB provides a computationally efficient method to build reliable models without extensive preprocessing or computational overhead. Moreover, learning MNB provides insights into probabilistic reasoning, feature independence assumptions, and how distributions can be tailored to specific data types. It often serves as a benchmark for evaluating more complex machine learning algorithms, helping practitioners assess the trade-offs between simplicity and performance. Understanding MNB also aids in comprehending fundamental NLP techniques like the bag-of-words and TF-IDF representations. By mastering MNB, learners gain a practical and theoretical foundation for tackling real-world problems in text analytics, paving the way for advanced exploration of machine learning models. Its simplicity, combined with its practical relevance, makes it a vital tool in the machine learning toolkit.
Setting up the Notebook¶
Make Required Imports¶
This notebook requires the import of different Python packages but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.
from src.utils.libimports.multinomialnb import *
Motivating Example¶
Let's assume you want to build a classifier that automatically assigns a news article to a set of predefined categories (e.g., politics, sports, business, entertainment, lifestyle, etc.). This is a very common example for a text or document classification task. To keep it simple and allow for a better explanation of MNB classifier, we consider only single sentences as our input documents and only two class labels (politics and sports). Lastly, let's assume we only have sentences in our whole training dataset, four of class politics and three of class sports. In more detail, our example dataset we will be using throughout the notebook looks as follows:
| Sentence | Class |
|---|---|
| The mayor was elected for this term and next term. | politics |
| A mayor's goal for the next term is to win. | politics |
| The goal for this term was to win the vote. | politics |
| This term's goals are next term's goals. | politics |
| The goal of any team player is the win. | sports |
| A win for the team is a win for each player. | sports |
| Players vote other players for another term. | sports |
Before we delve deeper into the inner workings of the MNB algorithm for classification, let's first try to get the basic underlying intuition the algorithm is based on. To do this, have another look at the example dataset above, but only consider the sentences and ignore the class labels. In other words, the class labels are not given, and assume that your task is now to decide whether a sentence came from a news article about politics or sports. Even without the class labels, most would probably say that the first four sentences are from political articles, while the last three sentences are from sports articles. But think about how you would explain your decisions. A simple and handwavy answer would be to say that the first four sentences just "talk about politics". But how can we make this statement more precise?
The core argument — and the core idea behind MNB — is that a sentence that "talks about politics" contains words that are more likely to be associated with political topics. For example, the word "mayor" is much more likely to appear in a political article compared to one about sports. Similarly, the word "player" is arguably more likely associated with topics about sports. However, this does not mean that words such as "mayor" may never appear in a sports article — for example, consider the sentence: "The team won the match and the mayor congratulated". In other words, we cannot apply hard rules that merely check for the presence or absence of words. Instead we need soft rules to help us quantify the strength of the associations of words to topics. We can then use these learned associations to predict the class (e.g., the topic) of an article based on the words it contains.
In the context of the MNB classification algorithm, we express the associations between words and classes in terms of probabilities. In a nutshell, the MNB algorithm learns the probabilities of seeing a word in a document of a given class (for all different classes). Based on those probabilities, the algorithm can then predict the probability of a document belonging to a class based on the words within the document. The class with the highest probability for a document is the one that will be the final prediction for that document.
Formalization¶
Basic Definitions¶
Given an input document $x$, the MNB classification algorithm aims to calculate the probabilities of $x$ belonging to any class. For example, for our example dataset, we can ask: What are the two probabilities that the 3rd sentence "The goal for this term was to win the vote." belong to class politics or sports. We can express this using conditional probabilities
$P(\text{"politics"} | \text{"The goal for this term was to win the vote."})$
$P(\text{"sports"} | \text{"The goal for this term was to win the vote."})$
If we denote $Y=\{y_1, y_2\}$ as the set of classes — with $y_1\!=\!\text{politics}$ and $y_2\!=\!\text{sports}$ — and $x$ as our 3rd sentence in our example dataset $X$, we can rewrite the probabilities as:
$P(y_1 | x)$
$P(y_2 | x)$
Once we have both probabilities — or more in case of more than two classes — we use the class $y_i$ with the highest probability as the predicted class $y$ for input $x$.
$$\large y = \underset{y_i \in Y}{\text{argmax}}\ P(y_i|x) $$So how can we calculate all probabilities $P(y_i|x)$?
As the name suggests the Multinomial Naive Bayes classification algorithm utilizes Bayes' Theorem. Bayes' Theorem is a mathematical formula used to update probabilities based on new evidence. It helps answer the question: Given some prior knowledge, how should we update our beliefs when we receive new information? Using our notation as introduced above, the theorem is written as:
$$\large P(y_i|x) = \frac{P(x|y_i) P(y_i)}{P(x)} $$The MNB algorithm assumes a Bag-of-Word representation of document $x$ — each document is represented as the multiset (i.e., duplicates are preserved) of words contained in the document. We can therefore define $x = \{w_1, w_2, \dots, w_n \}$ as the set of words $w_1$ to $w_n$ appearing in document $x$. With this, we can rewrite the forumal above as:
$$\large P(y_i|w_1, w_2, \dots, w_n) = \frac{P(w_1, w_2, \dots, w_n|y_i) P(y_i)}{P(w_1, w_2, \dots, w_n)} $$- $P(y_i|w_1, w_2, \dots, w_n)$ is the posterior, i.e., the probability of class $y_i$ given document $x = \{w_1, w_2, \dots, w_n \}$
- $P(w_1, w_2, \dots, w_n|y_i)$ is the likelihood, i.e., the probability of seeing $x$ given that it belongs to glass $y_i$
- $P(w_1, w_2, \dots, w_n)$ is the marginal, i.e., the probability of seeing document $x$ under any class $y$
- $P(y_i)$ is the prior, i.e., the probability that document $x$ belongs to class $y_i$ without seeing any data
Just by looking at this formula, it is not obvious what we have gained. On the contrary, we seemingly have replaced the calculation of one probability with the calculation of three probabilities. However, as we will see, we can now learn all required probabilities from the data. But before that, we can first simplify the formula and thus the calculation of the probabilities.
Ignoring the Marginals¶
Recall that the the class we predict for an input document $x$ is the one with the highest probability $P(y_i|x)$. This implies that we are actually no really interested in the absolute values for $P(y_i|x)$, but we only need to know which probability $P(y_i|x)$ is the largest among all classes. If you look at the formula for $P(y_i|w_1, w_2, \dots, w_n)$ above, you can see that the marginal $P(w_1, w_2, \dots, w_n)$ does not depend on the class $y_i$. In other words, the marginal is the same value for all calculations of $P(y_i|w_1, w_2, \dots, w_n)$, and we can treat it as a constant. We can therefore ignore the marginal since it does not affect the result of finding the class with the highest probability. Of course, ignoring the marginal does no longer give as the true probability $P(y_i|w_1, w_2, \dots, w_n)$, so we need to write
$$\large P(y_i|w_1, w_2, \dots, w_n) \propto P(w_1, w_2, \dots, w_n|y_i) P(y_i) $$for all classes $y_i$. This simplification does not require any assumption and does not yield any loss of information. While the prior $P(y_i)$ is very easy to calculate, as we will see in a bit, the likelihood is still very challenging.
The "Naive" Assumption¶
Let's first try to understand why calculating the likelihood probability $P(w_1, w_2, \dots, w_n|y_i)$ poses such a challenge. To do this, we can first apply the basic rule for conditional probabilities to get:
$$\large P(w_1, w_2, \dots, w_n|y_i) = \frac{P(w_1, w_2, \dots, w_n, y_i)}{P(y_i)} $$Again, $P(y_i)$ will be easy to calculate, but calculating the joint probability $P(w_1, w_2, \dots, w_n, y_i)$ causes problems in practice. To better see this, let's use the chain rule to express a joint probability in terms of only conditional probabilities:
$$\large P(w_1, w_2, \dots, w_n|y_i) = P(y_i)P(w_1|y_i)P(w_2|y_i,w_1)P(w_r|y_i, w_1, w_2)P(w_n|y_i,w_1,\dots,w_{n-1}) $$One problem is computational complexity. Calculating joint probabilities by calculating all conditional probabilities for all documents and class quickly becomes computationally very expensive. This is particularly true if the number of features — the number of unique words/tokens (i.e., the vocabulary) in the context of document classification — increases. The second problem is data sparsity. In many real-word corpora, certain combinations of features (i.e., words/tokens) occur rarely or may not appear in the training data at all. This is particularly true for natural language text which is highly variable and with a large vocabulary size resulting in many possible word combinations. Also, the sequence $w_1, w_2, \dots, w_n, y_i$ can be of arbitrary length, making the space of possible word sequences extremely large. This results in a high-dimensional problem where modeling the joint probability distribution is complex. During training, this means that many conditional probabilities will be very small. Since the likelihood requires the multiplication of $n$ conditional probabilities — with potentially very small values — this can cause problems with computational stability due to a limited precision of machines and algorithms.
The MNB classification algorithm therefore makes the "naive" assumption that all features are independent. Of course, this assumption does strictly not hold in practice when we are assuming the features are words in a document. For example, when we are seeing the word "York" in a sentence, it's also very likely to also see the word "New" in the same sentence. However, for document classification tasks, it turns out that assuming independence between words works well in practice, and the NMB algorithm often performs surprisingly well. For one, it still captures enough useful patterns in the data — that is, that individual words are often more associated with one class than others. Also, dependencies between words may not significantly harm the model's ability to discriminate between classes. So how does this assumption affect the calculation of requires probabilities
$$\large P(y_i|w_1, w_2, \dots, w_n) \propto P(w_1, w_2, \dots, w_n|y_i) P(y_i) $$Assuming the independence between all words $w_1$, $w_2$, ..., $w_n$, we write this formula as
$$\large P(y_i|w_1, w_2, \dots, w_n) \propto P(w_1|y_i) P(w_2|y_i)\dots P(w_n|y_i) P(y_i) = P(y_i)\prod_{i=1}^{n}P(w_i|y_i) $$In short, we no longer have to calculate any joint probabilities which greatly reduces the issues concerning computational complexity and data sparsity.
Calculating Probabilities¶
The MNB classification algorithm use Maximum Likelihood Estimation (MLE) to estimate the prior probabilities $P(y_i)$ and the conditional probabilities $P(w_i|y_i)$ given the training dataset. According to MLE, the estimate $\hat{P}(y_i)$ for the prior probability $P(y_i)$ is calculate as:
$$\large \hat{P}(y_i) = \frac{N_{y_i}}{N} = \frac{\text{number of documents of class }y_i}{\text{total number of documents}} $$For example, our example dataset contains 7 sentences, with 4 of them of class politics and 3 of them of class sports. With this, we can directly calculate our two estimates for both priors as:
$$\large \hat{P}(politics) = \frac{4}{7} \approx 57\% \quad \text{and}\quad \hat{P}(sports) = \frac{3}{7} \approx 43\% $$This means that if we would need to classify an unseen sentence without actually knowing its content, we would predicts is class to be politics, since $P(politics) > P(sports)$. Regarding the conditional estimates, according the MLE, their estimates $\hat{P}(w_i|y_i)$ can be calculated using the following formula:
$$\large \hat{P}(w_i|y_i) = \frac{count(w_i,y_i)}{\sum_{w\in V}count(w, y_i)} $$where $count(w,y)$ is the number of occurrences of a word $w$ in documents of class $y$. For example in our example dataset, the word "win" appears 2x in the 5th sentence and 1x in the 6th sentence, both of class sports. This, $count(win, sports)=3$. The denominator of the estimate for the likelihood ensures that the result is a true probability. In short, training a MNB classifier for document classification boils down to counting the occurrence of each word contained in the training corpus. Performance considerations aside, this makes MNB a very simple classification algorithm that is easy to understand and implement.
Zero Probabilities & Smoothing¶
Despite the naive assumption, in practice, we can still run into the problem of dealing with zero probabilities. Let's have another look at our formula for calculating the posterior probabilities as the product of the prior and the conditional probabilities representing the likelihoods:
$$\large P(y_i|w_1, w_2, \dots, w_n) \propto P(y_i)\prod_{i=1}^{n}P(w_i|y_i) $$In short, the posterior is calculated as the product of many probabilities. Let's assume we want to predict the class of the following example sentence "The mayor elected". This means we need to calculate $P(\text{"politics"}|\text{"the mayor elected"})$ and $P(\text{"sports"}|\text{"the mayor elected"})$. If we just focus on the latter we get:
$$\large P(\text{"sports"}|\text{"the mayor elected"}) \propto P(\text{"sports"})P(\text{"The"}|\text{"sports"})P(\text{"mayor"}|\text{"sports"})P(\text{"elected"}|\text{"sports"}) $$Now, have a look at our example dataset: none of the sentences of class sports contains the word "mayor". As such, $P(\text{"mayor"}|\text{"sports"}) = 0$. This means that our complete product will also be $0$ no matter the prior and the other conditional probabilities. Thus, for any document $x$, if this document contains the word "mayor" the posterior probability $P(\text{"sports"}|x)$ will always be $0$. Note that this implicitly assumes that no document of class sports will ever contain the word "mayor". In practice, this is a very unrealistic assumption even for much larger datasets due to data sparsity.
To avoid such zero probabilities we need to ensures the $P(w_i|y_i) > 0$ for any word $w_i$ in out vocabulary and any class $y_i$. Smoothing accomplishes this by adding a positive non-zero value $k$ to all word counts $count(w,y)$
$$\large \hat{P}(w_i|y_i) = \frac{count(w_i,y_i) + k}{\sum_{w\in V}count(w, y_i) + k|V|} $$where $|V|$ is the size of the vocabulary. Note that the added term $k|V|$ in the denominator ensures that the result remains a true probability. A straightforward choice of $k$ is $k=1$, which is also called Laplace Smoothing. Intuitively, $k=1$ means that each word in our vocabulary appears at least once in some document of each class. However, $k$ does not have to be an integer, and values of the range $(0, 1]$ are common in practice. Any positive value of $k$ will ensure that we do not get zero probabilities for any $\hat{P}(w_i|y_i)$.
In principle, we can also have zero probabilities for the priors $P(y_i)$. This is of course if the dataset does not contain any document of a certain class. For example, if we would assume that we want to predict three classes politics, sports, and lifestyle, your example dataset contains only samples for the first to classes but none for lifestyle. This means that $P(\text{"lifestyle"}) = 0$. Again, we could use smoothing to prevent zero probabilities for all priors:
$$\large \hat{P}(y_i) = \frac{N_{y_i} + k}{N + k|V|} $$However, this is usually not done in practice since training a classifier using a dataset without any sample for one or more classes is very problematic and should not really be attempted in the first place. Thus, most existing implementations of the MNB classification algorithm assume that there are multiple samples of each class and therefore only non-zero prior probabilities.
Log Probabilities¶
While Smoothing ensures that none of the conditional probabilities $P(w_i|y_i)$ are $0$, those probabilities are typically still very small for real-world datasets. Particularly in the case of long documents, we need to multiply many very small probability values. On normal computers with a naturally limited precision , this can lead to arithmetic underflow, negatively affecting the numerical stability of the calculations. It is therefore common to calculate the log probabilities instead of the raw probabilities. Since the logarithm is a monotonically increasing function, it does not affect the order of the final posterior probabilities. As such, we can write:
$$\large \log{P(y_i|w_1, w_2, \dots, w_n)} \propto \log{\left[ P(y_i)\prod_{i=1}^{n}P(w_i|y_i)\right]} = \log{ P(y_i)} + \sum_{i=1}^{n}\log{P(w_i|y_i)} $$where we could apply the logarithm rule that $\log{ab} = \log{a} + \log{b}$. In simple terms, this allows transforming the product of probabilities into a sum of log probabilities, which provides a numerically stable calculation.
Worked Example¶
Let's go through a complete example with all the calculations and train a MNB classifier for our simple dataset for a binary classification task.
Data Preprocessing¶
The MNB classification algorithm only considers the presence or absence of words in a document and therefore only considers the Bag-of-Word (BoW) representation of documents. With the focus on the presence and absence of words, we often want to ignore the variation of words, including (a) whether verbs are in the present, past, or progressing form, (b) whether nouns are in singular or plural form, (c) whether adjectives are in their base, comparative, or superlative form, or (d) whether words are capitalized or not. Thus, it is very common to perform case-folding, and stemming or lemmatization as preprocessing steps for training a MNB classifier for document classification.
Furthermore, the class or topic of a document is arguably mainly expressed by nouns, verbs, and adjectives (and maybe adverbs). In contrast, function words such as determiners, prepositions, and pronouns, generally do not contribute to discriminating documents with respect to the class labels, since function words frequently appear in documents of all classes. It is therefore also common to perform stopword removal. This typically greatly reduces the total number of words in a document which reduces the number of required calculations.
The table below shows our initial example dataset, but now including the sentences (in BoW representation) after case-folding to lowercase, lemmatization, stopword removal, and punctuation mark removal.
| Sentence | Sentences (processed) | Class |
|---|---|---|
| The mayor was elected for this term and next term. | mayor elect term term | politics |
| A mayor's goal for the next term is to win. | mayor goal term win | politics |
| The goal for this term was to win the vote. | goal term win vote | politics |
| This term's goals are next term's goals. | term goal term goal | politics |
| The goal of any team player is the win. | goal team player win | sports |
| A win for the team is a win for each player. | win team win player | sports |
| Players vote other players for another term. | player vote player term | sports |
These processed sentences form the input of training a MNB classifier. From this dataset we can derive the vocabulary as the set of unique words as:
$$\large V = \{ mayor, elect, term, goal, win, vote, team, player \} $$and giving us a vocabulary size of $|V| = 8$. In the code cell below, we define the training dataset X_train containing our seven preprocessed sentences, as well as the vector y_train of class labels. For the naming of both X_train and y_train, we adhere to common conventions, where $X$ generally denotes the samples with their features, and $y$ denotes the class labels. We also generate the vocabulary V from the dataset; the size $|V|$ of the vocabulary we can get easily with len(V) is needed.
# Training data
X_train = np.asarray([
"mayor elect term term",
"mayor goal term win",
"goal term win vote",
"term goal term goal",
"goal team player win",
"win team win player",
"player vote player term"
])
# Labels for training data
y_train = np.asarray(["politics", "politics", "politics", "politics", "sports", "sports", "sports"])
# Extract vocabulary
V = set([word for doc in X_train for word in doc.split()])
Calculating Priors¶
We already calculated the prior probabilities above as
$$\large \hat{P}(politics) = \frac{4}{7} \approx 57\% \quad \text{and}\quad \hat{P}(sports) = \frac{3}{7} \approx 43\% $$where $\hat{P}(politics) = \frac{4}{7}$ indicates that 4 of the 7 sentences are of class politics. The code cell below shows how we can actually calculate the values for our given example dataset for training MNB classifiers. The unique() method of NumPy is used to find the unique elements of an array. It returns the sorted unique values from the input array, optionally along with additional information like the number of occurrences of unique elements by setting the optional parameter return_counts=True. In short, we can use this method to easily calculate $N_{politics}$ and $N_{sports}$ needed to calculate the prior probabilities.
def calculate_priors(y_train):
classes, counts = np.unique(y_train, return_counts=True)
return dict(zip(classes, counts/len(y_train)))
priors = calculate_priors(y_train)
print(priors)
{'politics': 0.5714285714285714, 'sports': 0.42857142857142855}
In short, we can summarize our result for the calculation of the prior probabilities as follows:
| $\large y_i$ | $\large P(y_i)$ |
|---|---|
| politics | $\large \frac{4}{7}$ |
| sports | $\large \frac{3}{7}$ |
Calculating Likelihoods¶
Recall that we calculate the likelihood probability for a word $w_i$ and a given class $y_i$ as:
$$\large \hat{P}(w_i|y_i) = \frac{count(w_i,y_i)}{\sum_{w\in V}count(w, y_i)} $$For an implementation, this means that have to go through our training corpus and count (a) the number of occurrences of a word $w_i$ in all documents of class $y_i$, and (b) the total number of words in all documents of class $y_i$. The method calculate_counts() in the code cell below accomplishes this by iterating over all documents (here: sentences) and over all words in a document for each class and counts the number of occurrences for each word. Notice that the method uses a special word __ALL__ for the count of all the words in documents of the same class. This allows us to store all required counts in the same dictionary.
def calculate_counts(X_train, y_train):
# Get set of unique class labels
classes = np.unique(y_train)
# Initialize dictionary holding counts
counts = { label:defaultdict(int) for label in classes }
# Iterate over each class
for c in classes:
# Get the indices of all documents of that class
class_indices = np.where(y_train == c)[0]
# Iterate over all documents and words
for sentence in X_train[class_indices]:
for word in sentence.split():
counts[c]["__ALL__"] += 1
counts[c][word] += 1
return counts
Let's run this method to calculate all the counts and plot the result.
counts = calculate_counts(X_train, y_train)
print(json.dumps(counts, indent=2))
{
"politics": {
"__ALL__": 16,
"mayor": 2,
"elect": 1,
"term": 6,
"goal": 4,
"win": 2,
"vote": 1
},
"sports": {
"__ALL__": 12,
"goal": 1,
"team": 2,
"player": 4,
"win": 3,
"vote": 1,
"term": 1
}
}
Based on the output above, we can see that all documents of class politics contain a total of 16 words, while all documents of class sports contain a total of 12 words. We can also see that the word "win" appears twice in the documents of class politics and three times in documents of class sports. Notice that this dictionary only contains words with counts larger than $0$. For example, since no sports document contains the word "mayor", this word does not appear in the dictionary for class sports. Of course, this implicitly means that those counts are $0$. For easier visualization, we can also show the individual word counts using a table as shown below, not including also the zero counts.
| $\large w_i$ | $\large count(w_i, politics)$ | $\large count(w_i, sports)$ |
|---|---|---|
| elect | 1 | 0 |
| goal | 3 | 2 |
| mayor | 2 | 0 |
| player | 0 | 4 |
| team | 0 | 2 |
| term | 6 | 1 |
| vote | 1 | 1 |
| win | 2 | 3 |
In principle, we could already use these counts to make predictions since the conditional probabilities for the likelihoods are just ratios of those counts. However, let's explicitly calculate the likelihoods so we can see the actual values. The method calculate_likelihoods() below calculates all conditional probabilities based on the available counts. The method iterates over each class and first extracts the total number of words in all documents of that class from the counts dictionary. For each class, it then iterates overall words in the vocabulary to the respective word counts. It finally calculates the likelihood by dividing the word count by the total word count — incl. the additive smoothing using $k$ — as defined the given formula.
def calculate_likelihoods(counts, vocab, k=0):
# Initialize dictionary
likelihoods = { label:{} for label in counts.keys() }
# Calculate likelihoods based on formula
for label in counts.keys():
# Get number of words in all documents of current class
total_word_count = counts[label]["__ALL__"]
for word in vocab:
# Get the count for the current word
word_count = counts[label][word]
# Calculate likelihood for current word and store in dictionary
likelihoods[label][word] = (word_count + k) / (total_word_count + k*len(vocab))
# Return the dictionary of likelihoods
return likelihoods
As a side note about that implementations, since the counts dictionary contains a defaultdict for each class label, the line word_count = counts[label][word] will return $0$ if no count for word exist for this class — instead of throwing an error in case of a normal Python dictionary. But this is merely to simplify the required code.
Let's run the method calculate_likelihoods() using the calculated counts and $k=1$ to see the resulting likelihoods. Of course, you can change the value of smoothing parameter $k$ to see its effects on the values. However, we assume $k=1$ for later comparisons.
likelihoods = calculate_likelihoods(counts, V, k=1)
print(json.dumps(likelihoods, indent=2))
{
"politics": {
"vote": 0.08333333333333333,
"mayor": 0.125,
"term": 0.2916666666666667,
"goal": 0.20833333333333334,
"elect": 0.08333333333333333,
"win": 0.125,
"team": 0.041666666666666664,
"player": 0.041666666666666664
},
"sports": {
"vote": 0.1,
"mayor": 0.05,
"term": 0.1,
"goal": 0.1,
"elect": 0.05,
"win": 0.2,
"team": 0.15,
"player": 0.25
}
}
To better appreciate what is going on, the table below shows all the individual calculations and results. For example, the likelihood $P(win|politics)$ in the last row of the table with
$$\large P(win|politics) = \frac{2+1}{16+8} = \frac{3}{24} = 0.125 $$shows that the word "win" appears 2x in all documents of class politics, and that there are a total of 16 words in all documents of class politics. The $1$ represents our smoothing constant $k=1$, and the $8$ represents the constant $k|V| = 1\cdot 8 = 8$ to normalize the probabilities with respect to the smoothing values $k$.
| $\large w_i$ | $\large P(w_i, politics)$ | $\large count(w_i, sports)$ |
|---|---|---|
| elect | $\large \frac{1+1}{16+8} = \frac{2}{24}$ | $\large \frac{0+1}{12+8} = \frac{1}{20}$ |
| goal | $\large \frac{3+1}{16+8} = \frac{4}{24}$ | $\large \frac{2+1}{12+8} = \frac{3}{20}$ |
| mayor | $\large \frac{2+1}{16+8} = \frac{3}{24}$ | $\large \frac{0+1}{12+8} = \frac{1}{20}$ |
| player | $\large \frac{0+1}{16+8} = \frac{1}{24}$ | $\large \frac{4+1}{12+8} = \frac{5}{20}$ |
| team | $\large \frac{0+1}{16+8} = \frac{1}{24}$ | $\large \frac{2+1}{12+8} = \frac{3}{20}$ |
| term | $\large \frac{6+1}{16+8} = \frac{7}{24}$ | $\large \frac{1+1}{12+8} = \frac{2}{20}$ |
| vote | $\large \frac{1+1}{16+8} = \frac{2}{24}$ | $\large \frac{1+1}{12+8} = \frac{2}{20}$ |
| win | $\large \frac{2+1}{16+8} = \frac{3}{24}$ | $\large \frac{3+1}{12+8} = \frac{4}{20}$ |
Having calculated all priors and likelihoods, our MNB classifier is done training.
Making Predictions¶
Now that we have a trained MNB classifier, we naturally want to use it to predict the classes for new unseen documents. To show an example, let's consider the following sentence for which we want to predict the class. The table below shows the original sentence as well as sentence after applying the same preprocessing steps as for the training data (case=folding, lemmatization, stopword & punctuation mark removal).
| Sentence | Sentences (processed) | Class |
|---|---|---|
| The mayor and his team have another term next year. | mayor team term year | ??? |
Using our learned priors and likelihoods, we can now calculated both $P(politics|mayor, team, term)$ and $P(sports|mayor, team, term)$ as shown below to see which class has the highest posterior probability (at least proportionally).
$$ \begin{align} \large P(politics|mayor, team, term)\ &\large\propto P(politics)P(mayor|politics)P(team|politics)P(term|politics)\\[1em] \large P(sports|mayor, team, term)\ &\large\propto P(sports)P(mayor|sports)P(team|sports)P(term|sports) \end{align} $$Important: Notice that the word "year" does not appear in any calculation, since "year" did not appear at all in our training corpus and was therefore also not in our vocabulary. Any word that was not in the vocabulary of the training data will be completely ignored but the NMB classifier when making predictions.
To get the final posterior probabilities, let's just plug in the values for the priors and likelihoods:
$$ \begin{align} \large P(politics|mayor, team, term)\ &\large\propto \frac{4}{7}\cdot \frac{3}{24}\cdot \frac{1}{24}\cdot \frac{7}{24} = 0.000868...\\[1em] \large P(sports|mayor, team, term)\ &\large\propto \frac{3}{7}\cdot \frac{1}{20}\cdot \frac{3}{20}\cdot \frac{2}{20} = 0.000321... \end{align} $$Since $P(politics|mayor, team, term)$ has the highest probability among all classes, the predicted class for the document is politics.
We can implement these calculations required to predict the class for a given document as another method. The method predict() in the code cell below takes in the document as a string as well as the learned priors and likelihoods. For each class label, it then calculates the posterior probabilities as shown above. For this, it utilizes the representation of the priors and likelihoods as dictionaries. Again, notice how the snippet if word in likelihoods[label] ensures that we only consider the likelihood of words that are both in our training vocabulary.
def predict(doc, priors, likelihoods, verbose=False):
# Initialize dictionary holding the posterior probabilities for all classes
posteriors = { label:0 for label in priors.keys() }
for label in priors.keys():
# Find all likelihood probabilties for the given class label and words in the document
likelihood_probs = np.asarray([ likelihoods[label][word] for word in doc.split() if word in likelihoods[label] ])
# Calculate posterior by multiplying prior and all likelihood probabilities
posteriors[label] = priors[label] * np.prod(likelihood_probs)
# Show posterior probabilities if specified
if verbose == True: print(f"Posteriors: {posteriors}")
# Return the label (key of dictionary) with the maximum posterior (value of dictionary)
return max(posteriors, key=posteriors.get)
We can now use the predict() method to predict the class for a document. By default, the code cell below predict the class for our (processed) example sentence "mayor team term", but you can try out other sentences to see how it affects the prediction results. For the given example sentence and calling the method with verbose=True, you should see the same posterior probabilities as manually calculated above.
doc = "mayor team term"
#doc = "player team term"
#doc = "player team term elect"
#doc = "player team term term"
#doc = "player team term term term"
#doc = "year match group victory"
predicted_class = predict(doc, priors, likelihoods, verbose=True)
print(f"The predicted class for the sentence '{doc}' is: {predicted_class}")
Posteriors: {'politics': 0.0008680555555555555, 'sports': 0.0003214285714285714}
The predicted class for the sentence 'mayor team term' is: politics
Notice that the last example sentence "year match group victory" in the code cell above does not contain any word that was in our vocabulary of the training dataset. This means that we have no likelihood values for any of the words in this sentence, and the prediction of the class relies solely on the prior probabilities. Thus, if you run the code cell with this example sentence, you will see that the posterior probabilities are in fact the prior probabilities. And since politics is the dominating class in our example dataset, we would predict politics for documents containing only unknown words.
Practical Application¶
The code we have used so far for training a MNB classifier and making predictions focused on understanding the underlying calculations. The code also assumed text documents as input. In practice, the MNB classification algorithm is not limited to classifying text document but is applicable to any kind of multinomial data where a data sample (e.g., a text document) is described be the presence and absence of nominal features (e.g., words) and their number of occurrences. Furthermore, for practical applications, optimizing the runtime performance is an important consideration.
The MultinomialNB class in the scikit-learn library implements the Multinomial Naive Bayes algorithm. It is designed for data that can be represented as frequency counts, such as word occurrences in a document. The algorithm assumes that the features (e.g., words) are conditionally independent given the class and follows a multinomial distribution, which models the probability of observing specific frequencies of events. Key hyperparameters include alpha, which controls the level of smoothing (which we called $k$), and options for feature prior weights. The fit() method trains the model using labeled data, while methods like predict and predict_proba allow for making predictions and obtaining probability estimates for each class.
Let's see how we can use the MultinomialNB class to train a classifier over our example corpus.
Creating a Feature Matrix¶
The MultinomialNB class does not directly take text documents as input. The expected input format consists of two main components: the feature matrix X and the target vector y. The latter should be a 1-dimensional array-like structure containing the class labels for each sample in $X$. The class labels can be integers or strings, representing distinct categories for classification. We can therefore directly use y_train as it meets all the requirements.
The feature matrix X should be a 2-dimensional array-like structure, where rows represent samples and columns represent features. The values in X must be non-negative, as the algorithm is designed to work with counts or frequencies, such as word occurrences in text classification tasks. Each row in X corresponds to a single sample, and each column represents a specific feature (e.g., a word in a vocabulary). The values in the matrix are typically integer counts or term frequencies, but they can also be non-negative real numbers if the features have been normalized (e.g., using TF-IDF). We therefore first need to convert our example dataset containing the seven sentences into this matrix representation
The CountVectorizer class in scikit-learn is a feature extraction tool used to convert a collection of text documents into a matrix of token counts. It is commonly used in natural language processing (NLP) tasks to prepare text data for machine learning models by representing text as numerical data. Each row of the resulting matrix corresponds to a document, while each column represents a token (usually a word or a sequence of words). The value at a specific cell in the matrix indicates the frequency of the token in the document. The code cell below uses the CountVectorizer class to calculate all the word counts for our example dataset. To visualize the matrix, we convert it into a Pandas DataFrame. Notice that we first transform the matrix to get the more common Term-Document Matrix, where the rows represent the terms/words and the columns represent the documents.
count_vectorizer = CountVectorizer()
X_train_vectorized = count_vectorizer.fit_transform(X_train)
# Convert to pandas dataframe -- just for a nice visualization
df_tdm = pd.DataFrame(X_train_vectorized.A.T, columns=[ "d{}".format(d+1) for d in range(len(X_train)) ])
df_tdm = df_tdm.set_index(pd.Index(count_vectorizer.get_feature_names_out()))
df_tdm
| d1 | d2 | d3 | d4 | d5 | d6 | d7 | |
|---|---|---|---|---|---|---|---|
| elect | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| goal | 0 | 1 | 1 | 2 | 1 | 0 | 0 |
| mayor | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| player | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| team | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| term | 2 | 1 | 1 | 2 | 0 | 0 | 1 |
| vote | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| win | 0 | 1 | 1 | 0 | 1 | 2 | 0 |
For example, the "$1$" in the top-left corner of the Term-Document matrix above indicates that the word "elect" appears once in document $d1$. Let's visualize the Term-Document Matrix again, but now also highlighting which documents belong the politics (light blue) and which belong to sports (light orange).
| d1 | d2 | d3 | d4 | d5 | d6 | d7 | |
|---|---|---|---|---|---|---|---|
| elect | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| goal | 0 | 1 | 1 | 2 | 1 | 0 | 0 |
| mayor | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| player | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| team | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| term | 2 | 1 | 1 | 2 | 0 | 0 | 1 |
| vote | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| win | 0 | 1 | 1 | 0 | 1 | 2 | 0 |
Appreciate how this Term-Document Matrix contains all the values needed to calculate our $count(w,y)$ values. For example, we can calculate $count(win|politics)$ by summing up all the counts in the row for the word "win" for documents of class "politics"; this gives us $count(win|politics) = 2$. Similarly, we can calculate $count(win|sports)$ by summing up all the counts in the row for the word "win" for documents of class "sports"; this gives us $count(win|sports) = 3$. The Term-Document Matrix below highlights the respective row and counts.
| d1 | d2 | d3 | d4 | d5 | d6 | d7 | |
|---|---|---|---|---|---|---|---|
| elect | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| goal | 0 | 1 | 1 | 2 | 1 | 0 | 0 |
| mayor | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| player | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
| team | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| term | 2 | 1 | 1 | 2 | 0 | 0 | 1 |
| vote | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| win | 0 | 1 | 1 | 0 | 1 | 2 | 0 |
Training the Classifier¶
With X_train_vectorized and y_train we have the two inputs in the proper format to serve as input to train a MNB classifier using the MultinomialNB class. As with most model implementations provided by scikit-learn. The actual training can be implemented using a single line of code; as shown in the code cell below
mnb_classifier = MultinomialNB().fit(X_train_vectorized, y_train)
Now that the model is trained, let's see if it matches our result. Note that by default, MultinomialNB uses Laplace Smoothing — $k=1$ or alpha=1 — so we should see the same prior and likelihood probabilities as before when training the classifier "manually". The classes_ attribute of the MultinomialNB class is a list or array that contains the unique class labels identified during training. The attribute is automatically set when the model is trained using the fit() method. The class_log_prior_ attribute, on the other hand, stores the log of the prior probabilities for each class.
We can therefore use both attributes to print the two prior probabilities for our two classes politics and sports. Notice that we need to use the np.exp() method to convert the log probabilities back to "normal" probabilities to compare them with our earlier results.
mnb_classifier_prior = dict(zip(mnb_classifier.classes_, np.exp(mnb_classifier.class_log_prior_)))
print(mnb_classifier_prior)
{'politics': 0.5714285714285714, 'sports': 0.42857142857142866}
Unsurprisingly, the prior probabilities from the MultinomialNB match the ones we got from our own calculations. We can now also do the same check for the likelihood probabilities. The feature_log_prob_ attribute of the MultinomialNB class contains the logarithm of the conditional likelihood probabilities of each feature given each class, i.e., $\log{P(x_i|y_i)}$, where $x_i$ is a feature (e.g., a word which we previously denoted as $w_i$). This attribute is stored as a 2D array where each row corresponds to a class, and each column corresponds to a feature. In the code cell below, we use the feature_log_prob_ attribute to extract all probabilities and connect them to the features names, i.e., the word, using the vectorizer. The specific implementation ensures that the output format of all probabilities matches the format we have used above for easy comparison.
mnb_classifier_likelihoods = { label:{} for label in mnb_classifier.classes_ }
for i, label in enumerate(mnb_classifier.classes_):
mnb_classifier_likelihoods[label] = dict(zip([ count_vectorizer.get_feature_names_out()[i] for i,_ in enumerate(mnb_classifier.feature_log_prob_[i]) ],
np.exp(mnb_classifier.feature_log_prob_[i])))
print(json.dumps(mnb_classifier_likelihoods, indent=2))
{
"politics": {
"elect": 0.08333333333333333,
"goal": 0.2083333333333333,
"mayor": 0.12499999999999997,
"player": 0.041666666666666664,
"team": 0.041666666666666664,
"term": 0.29166666666666663,
"vote": 0.08333333333333333,
"win": 0.12499999999999997
},
"sports": {
"elect": 0.05000000000000001,
"goal": 0.10000000000000002,
"mayor": 0.05000000000000001,
"player": 0.25,
"team": 0.15000000000000002,
"term": 0.10000000000000002,
"vote": 0.10000000000000002,
"win": 0.2
}
}
Again, as expected the probabilities are the same as the ones from our own calculations.
Making Predictions¶
The predict() method of the MultinomialNB class predicts the class labels for a given set of input samples. After the model has been trained using the fit() method, the predict() method takes as input a feature matrix X, where each row represents a sample and each column represents a feature, just like for training. It returns an array of predicted class labels, one for each sample, based on the model's learned parameters (e.g., class_log_prior_ and feature_log_prob_). This means that we first have to transform any input document into the data matrix X using the same trained CountVectorizer instance.
The code cell below contains the same example document/sentences as before so you can check if the predicted class labels match out expectations.
doc = "mayor team term"
#doc = "player team term"
#doc = "player team term elect"
#doc = "player team term term"
#doc = "player team term term"
#doc = "year match group victory"
# Convert sentence into data matrix (containing just this one sample)
doc_vectorized = count_vectorizer.transform([doc])
print(f"The predicted class for the sentence '{doc}' is: {mnb_classifier.predict(doc_vectorized)[0]}")
The predicted class for the sentence 'mayor team term' is: politics
Now we have seen how we can train and use a MNB classifier in practice using a well-established implementation of the algorithm using a popular library. Not only do these libraries contain tried and test code, they are typically also optimized to maximize the runtime performance.
Discussion & Considerations¶
The Multinomial Naive Bayes (MNB) classifier is a simple and efficient classification model, which can be very effective for handling text data for certain tasks. The MNB algorithm is computationally efficient, as it requires minimal resources to train and classify documents compared to more complex models. Its simplicity also ensures quick implementation and interpretability, which is useful for understanding how the model makes predictions. Another advantage is its robustness when working with sparse data, a common characteristic of text datasets. Since most documents contain only a small subset of the vocabulary, the classifier is designed to handle the sparsity of feature vectors effectively. While it may not capture complex relationships between words, its performance is often competitive on simpler tasks, especially when combined with techniques like n-grams or proper text preprocessing. In the following, we briefly discuss its limitations and potential mitigation strategies.
Independence Assumption¶
The independence assumption in the multinomial Naive Bayes (MNB) classifier for document classification assumes that the features (e.g., words in a document) are conditionally independent given the class label. In simpler terms, it assumes that the occurrence of one word in a document does not depend on the occurrence of any other word, once the document's class is known. For example, if a document is labeled as sport, the model assumes that knowing the word "football" in the document provides no additional information about the likelihood of seeing the word "team" beyond the fact that the document is already classified as sports. Although this assumption does strictly speaking not hold, it is often acceptable in practice for multiple reasons:
High dimensionality of text data: Text data is inherently high-dimensional, meaning documents typically contain many unique words. In such cases, exact dependencies between words are hard to model without overfitting. The independence assumption simplifies this complexity, making the model computationally efficient while still performing reasonably well.
Class conditional word distribution: In document classification, the most important factor is the relative frequency of words in different classes. The multinomial Naive Bayes model captures this effectively. Even if the independence assumption is not strictly true, the model often succeeds because it focuses on capturing broad patterns rather than exact word relationships.
Robustness in real-world applications: Empirical studies show that the multinomial Naive Bayes classifier performs surprisingly well for tasks like spam filtering and sentiment analysis, despite its naive assumption. This is because in many text classification problems, the true dependencies between words do not critically affect classification accuracy.
This independence assumption has several advantages. Firstly, it greatly simplifies the calculations, as the model does not need to estimate the joint probabilities of all word combinations. This makes it fast and scalable to large datasets. Secondly, the simplicity of the model makes it easy to understand, implement, and train, even for non-experts. And lastly, in text classification, where most documents only contain a small subset of all possible words, the MNB model performs well because it focuses on capturing the most informative features.
However, the independence assumption may also result in an oversimplification of the word dependencies. Words in natural language often have dependencies (e.g., "New" and "York" frequently occur together). Ignoring these relationships can result in loss of information, potentially reducing accuracy. For tasks where the interaction between words is crucial (e.g., sarcasm detection or parsing ambiguous sentences, capturing negation for sentiment analysis), the independence assumption can significantly limit the model's ability to understand context. Also Since MNB does not model word dependencies, the choice of features (e.g., unigrams, bigrams — see below) becomes critical. Poor feature selection can lead to degraded performance.
In short, while the independence assumption in multinomial Naive Bayes is a simplification of reality, it strikes a balance between computational efficiency and classification accuracy, especially in high-dimensional and sparse text datasets. However, for more complex tasks requiring a nuanced understanding of language, advanced models like logistic regression or neural networks might be more suitable. Despite its limitations, the MNB classifier remains a popular choice for text classification tasks due to its simplicity, speed, and robustness in many practical scenarios.
Beyond Unigrams¶
So far, in all examples for using a MNB classifier to train a document classification model, we considered features to be individual words (i.e., unigrams). A common use case where this might negatively affect the model's performance is sentiment analysis. Consider the two simple sentences:
- "The movie was not boring because the cast was great." (positive sentiment)
- "The movie was boring because the cast was not great." (negative sentiment)
When using only unigrams, both sentences would yield the exact same posterior probabilities — since both sentences contain exactly the same words — and would therefore yield the same prediction. We therefore would like to somehow capture the connections between phrases such as "not boring" and "not great". Fortunately, the MNB classification algorithm is completely agnostic about the "nature" of the nominal features. This means that, instead of limiting ourselves to word/unigrams, we can use n-grams as features, which itself only affects the preprocessing of the dataset and not the classifier itself.
To illustrate this, let's look again at our example dataset containing seven sentences. However, notice that no stopword removal has been performed. Stopwords are typically not removed when extracting n-grams of size 2 or larger because they often play an important role in preserving the context and meaning of phrases. In n-grams, especially bigrams or trigrams, stopwords (e.g., "is", "in", "of") can form part of meaningful expressions that provide critical information for text analysis. For example, in phrases like "out of stock" or "state of the art", removing stop words would break these n-grams and lose their semantic value. These phrases are often highly predictive features in tasks like sentiment analysis or topic classification. Additionally, when working with n-grams, the inclusion of stopwords helps maintain the natural structure of the text, allowing the model to capture more realistic language patterns. Removing stop words could lead to incomplete or nonsensical n-grams that may be less useful as features. However, this decision depends on the specific task and dataset—if stopwords are highly frequent and do not contribute meaningful distinctions between classes, their removal might still be considered to reduce noise and dimensionality.
# Training data
sentences = np.asarray([
"the mayor was elected for this term and next term",
"a mayor's goal for the next term is to win",
"the goal for this term was to win the vote",
"this term's goals are next term's goals",
"the goal of any team player is the win",
"a win for the team is a win for each player",
"players vote other players for another term"
])
Like before, we use the CountVectorizer class to convert our training dataset into a valid input for the MultinomialNB classifier implementation — but now to go beyond unigrams. The ngram_range parameter of the CountVectorizer class specifies the range of n-grams to be extracted from the input text. It is a tuple (min_n, max_n) where min_n represents the minimum size of the n-grams and max_n represents the maximum size. For example, the default setting ngram_range=(1, 1) will extract only unigrams (single words), while ngram_range=(1, 2) will extract both unigrams and bigrams (two-word sequences). Similarly, ngram_range=(2, 3) will extract bigrams and trigrams. In the code cell below, we use ngram_range=(1, 2) to extract all unigrams, bigrams, and trigrams.
count_vectorizer_unibitrigrams = CountVectorizer(ngram_range=(1,3))
X_train_vectorized_unibitrigrams = count_vectorizer_unibitrigrams.fit_transform(sentences)
# Convert to pandas dataframe -- just for a nice visualization
df_tdm_unibitrigrams = pd.DataFrame(X_train_vectorized_unibitrigrams.A.T, columns=[ "d{}".format(d+1) for d in range(len(X_train)) ])
df_tdm_unibitrigrams = df_tdm_unibitrigrams.set_index(pd.Index(count_vectorizer_unibitrigrams.get_feature_names_out()))
df_tdm_unibitrigrams
| d1 | d2 | d3 | d4 | d5 | d6 | d7 | |
|---|---|---|---|---|---|---|---|
| and | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| and next | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| and next term | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| another | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| another term | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| win for | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
| win for each | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| win for the | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| win the | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| win the vote | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
113 rows × 7 columns
Recall that we only had 8 features when considering only unigrams as this was the size of our vocabulary — to be fair, this was without stopwords. Now, with all unigrams, bigrams, and trigrams, we have 137 features in total. This has both advantages and disadvantages, and the choice of n-gram size typically depends on the specific problem, dataset, and the balance between model complexity and performance. The main advantages of using larger n-grams are:
Improved contextual representation: Larger n-grams (bigrams, trigrams) capture the sequential context and relationships between words that unigrams might miss. For example, in sentiment analysis, a bigram like "not good" or a trigram like "not at all helpful" provides clear sentiment cues that the individual words ("not", "good", "helpful") do not convey in isolation. This improved representation often leads to more accurate predictions in tasks where word order matters.
Disambiguation of meaning: Larger n-grams can help disambiguate words that might have multiple meanings in different contexts. For example, "New York" as a bigram is distinct from the individual words "new" and "York". This is particularly useful in tasks where preserving the meaning of multi-word expressions is critical.
Higher predictive power: Since larger n-grams encode specific phrases and patterns, they often have higher discriminative power for classifying documents into distinct categories. For instance, phrases like "breaking news" or "artificial intelligence" are strong indicators of certain topics.
On the flip, considering large(r) n-grams often poses challenges in practice:
Increased dimensionality: Including larger n-grams significantly expands the feature space, as the number of possible n-grams grows exponentially with the size of the vocabulary and the n-gram length. This increased dimensionality can lead to higher computational costs, longer training times, and potentially memory issues, especially for large datasets.
Sparsity of data: Larger n-grams tend to appear less frequently in documents, resulting in sparse feature representations. Sparsity can reduce the effectiveness of the MNB classifier, as it relies on sufficient occurrences of features to estimate probabilities reliably. Sparse features may also contribute little to classification and may need pruning.
Risk of overfitting: When working with small datasets, larger n-grams can lead to overfitting, where the model memorizes specific patterns or phrases that are unique to the training set but do not generalize well to unseen data. This can harm the classifier's performance on new examples.
In short, the choice of n-gram size requires balancing these factors. In practice, using a combination of n-gram sizes (e.g., unigrams and bigrams, or unigrams through trigrams) often yields good results by capturing both individual terms and meaningful word sequences. Additionally, techniques like feature selection (e.g., keeping only the top-k most frequent n-grams), regularization, and dimensionality reduction can mitigate the downsides of larger n-grams while preserving their benefits.
Beyond Simple Counts¶
The MNB classification algorithm assumes that the features (e.g., word counts) are generated from a multinomial distribution and interprets them as discrete probabilities. In terms of the Term-Document Matrix as the feature matrix as input, this means that all values are integer values representing the word counts; hence, the use of the CountVectorizer class in previous examples. However, the learning algorithm itself does not require the values in the feature matrix to be integer values but, in fact, accepts any kind of numerical values. This includes, for example, the Term-Document matrix with TF-IDF (Term Frequency-Inverse Document Frequency) weights. The intuition behind using TF-IDF weights instead of simple counts in emphasizing the importance of words that are more informative and less frequent across the corpus. While word counts simply reflect how often a word appears in a document, they do not account for the fact that some words, such as stopwords ("the", "is", "and"), are very common and may not contribute much to distinguishing between classes — which is why stop words as unigrams are typically removed.
TF-IDF weights aim to strike a balance between two factors. Firstly, the Term Frequency (TF) captures how frequently a word appears in a document. Higher frequency implies greater relevance within that document. This represents the basic word counts, but some implementations of TF-IDF might apply a monotonic function like the logarithm to scale its value. And secondly, the Inverse Document Frequency (IDF) measures how unique a word is across the corpus. Words that appear in many documents (e.g., "show", "step", "good") get lower weights, while rarer, more class-specific words (e.g., "quantum", "electron", or "gravity" in a physics-related dataset) are given higher weights. Very common stop words such "the", "is", "and", and so on that virtually appear in all documents will, in fact, get a TF-IDF weight of $0$. By combining these two measures, TF-IDF reduces the impact of common words that provide little discriminatory power and highlights words that are more likely to indicate the document's class. This can have different advantages when using those weights when training a MNB classification model:
Improved feature discrimination: With TF-IDF, the model gives more weight to words that are distinctive and less weight to ubiquitous terms that occur frequently but are less meaningful for classification. For example, in a spam detection task, the word "offer" might be more indicative of spam compared to "provide" which occurs in both spam and non-spam emails.
Reduced noise: Simple counts can inflate the importance of frequent but uninformative words, leading to noisy features that degrade the classifier's performance. TF-IDF mitigates this by penalizing such terms through the IDF component, making the feature set more meaningful for classification. This includes bigrams, trigrams, or generally n-grams beyond unigrams.
Balanced feature contribution: TF-IDF helps avoid domination by long documents that naturally contain more word occurrences. By normalizing term weights, TF-IDF ensures that each document's feature vector is better scaled, improving the fairness of the classification.
Despite its benefits, TF-IDF is not perfectly aligned with the assumptions of the MNB model. TF-IDF produces continuous, real-valued weights, which do not strictly satisfy this assumption of features generated from a multinomial distribution. While TF-IDF can still work well in practice with MNB, its use introduces a minor mismatch between the data representation and the model's theoretical foundation.
However, TF-IDF produces continuous, real-valued weights, which do not strictly satisfy this assumption. While TF-IDF can still work well in practice with MNB, its use introduces a minor mismatch between the data representation and the model's theoretical foundation. In practice, TF-IDF is particularly useful when dealing with text datasets that contain a lot of common or irrelevant words. It is effective in cases where the discriminatory power of words needs to be highlighted. However, for datasets with highly distinctive terms or very short text, simple counts may suffice. Overall, TF-IDF can be a valid choice for improving the quality of feature representations and boosting the performance of the MNB algorithm for document classification tasks.
With existing libraries, using TF-IDF weights instead of simple word counts is very straightforward. For example, apart from the CountVectorizer class, the scikit-learn library also provides a TfidfVectorizer class to calculate the TF-IDF weights for a given training dataset of text documents. Again this class has an input parameter ngram_range to support n-grams of various sizes; see above. As a result, using TF-IDF weights instead of simple word counts requires to change only a single line — the first line of code in the code cell below — compared to our example using the CountVectorizer from before.
tfidf_vectorizer_unibitrigrams = TfidfVectorizer(ngram_range=(1,3))
X_train_vectorized_unibitrigrams = tfidf_vectorizer_unibitrigrams.fit_transform(sentences)
# Convert to pandas dataframe -- just for a nice visualization
df_tdm_unibitrigrams = pd.DataFrame(X_train_vectorized_unibitrigrams.A.T, columns=[ "d{}".format(d+1) for d in range(len(X_train)) ])
df_tdm_unibitrigrams = df_tdm_unibitrigrams.set_index(pd.Index(count_vectorizer_unibitrigrams.get_feature_names_out()))
df_tdm_unibitrigrams
| d1 | d2 | d3 | d4 | d5 | d6 | d7 | |
|---|---|---|---|---|---|---|---|
| and | 0.21558 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| and next | 0.21558 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| and next term | 0.21558 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| another | 0.00000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.233945 |
| another term | 0.00000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.233945 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| win for | 0.00000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.421222 | 0.000000 |
| win for each | 0.00000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.210611 | 0.000000 |
| win for the | 0.00000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.210611 | 0.000000 |
| win the | 0.00000 | 0.0 | 0.222777 | 0.0 | 0.0 | 0.000000 | 0.000000 |
| win the vote | 0.00000 | 0.0 | 0.222777 | 0.0 | 0.0 | 0.000000 | 0.000000 |
113 rows × 7 columns
With the same input parameter ngram_range=(1,3) as before, we again get 113 features representing all unigrams, bigrams, and trigrams. However, the feature values are now no longer integers representing the n-gram counts but real values representing the TF-IDF weights. Still, this feature matrix is a valid input for the MultinomialNB class of scikit-learn implementing the MNB classification algorithm.
Summary¶
The Multinomial Naive Bayes (MNB) algorithm is a probabilistic machine learning model widely used for document classification tasks such as spam filtering, sentiment analysis, and topic classification. It belongs to the family of Naive Bayes algorithms and is particularly well-suited for text data because it assumes that features follow a multinomial distribution. In this context, the features are typically word frequencies or occurrences derived from the documents, making MNB a natural fit for bag-of-words or term-frequency representations. The algorithm operates on the Bayes' theorem, which calculates the posterior probability of a class given a document based on the prior probability of the class and the likelihood of the document under that class. The "naive" assumption in Naive Bayes is that all features (e.g., words) are conditionally independent of each other given the class label. While this assumption is often violated in real-world text data, the model still performs surprisingly well in many applications due to its simplicity and efficiency.
MNB computes the likelihood of a document belonging to a class by analyzing the frequency of words in the document and comparing it to the word distributions in the training data for each class. Specifically, it estimates the probability of each word occurring in each class using Maximum Likelihood Estimation (MLE) with additive smoothing (commonly Laplace smoothing) to handle words that may not appear in the training data. This ensures the algorithm can handle unseen words gracefully and avoids assigning zero probabilities.
One of the key advantages of Multinomial Naive Bayes is its computational efficiency, as it requires minimal resources to train and predict. It works well with high-dimensional data, such as text corpora, and scales effectively to large datasets. However, it assumes that all features contribute equally to the outcome, which can sometimes limit its performance when more complex patterns exist in the data. Additionally, MNB is sensitive to how the text is preprocessed (e.g., tokenization, stop-word removal, or n-gram representation) and typically works better with normalized features like term frequency-inverse document frequency (TF-IDF). In summary, Multinomial Naive Bayes is a robust, simple-to-implement algorithm for text classification that balances performance and computational efficiency. Its effectiveness in real-world applications, despite its simplicity, has made it a cornerstone of many NLP pipelines, especially in situations where interpretability and speed are critical.