Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.
Porter Stemmer¶
The Porter Stemmer is a popular algorithm used in natural language processing (NLP) and information retrieval to reduce words to their base or root form. It was developed by Martin Porter in 1980 and is one of the most widely used stemming algorithms due to its simplicity and effectiveness. The primary purpose of stemming is to group together words with the same root meaning, enabling text analysis systems to treat variations of a word (e.g., "running," "runs," "runner") as a single term ("run"). This process helps improve the efficiency of search engines, text mining, and other NLP tasks.
The algorithm works by systematically applying a series of rules to remove or modify suffixes from words. These rules are organized into steps, each targeting specific suffix patterns, such as "-ing," "-ed," or "-ly." For example, the word "playing" would be reduced to "play," and "happiness" to "happi." The rules are designed to ensure that the stemmed word retains its linguistic meaning and can be understood in context, although the resulting stems are not guaranteed to be valid words.
Despite its strengths, the Porter Stemmer has some limitations. It uses heuristic rules, which means it can sometimes produce stems that are not semantically meaningful or recognizable. For instance, words like "university" may be stemmed to "univers," which is not an actual word. Additionally, the algorithm is language-specific and primarily designed for English. For other languages, different stemming algorithms may be needed to account for unique morphological structures.
In modern NLP applications, stemming is often supplemented or replaced by lemmatization, which is a more sophisticated process that reduces words to their dictionary base form (lemma) using linguistic rules and vocabulary. However, the Porter Stemmer remains relevant in many cases where computational efficiency and simplicity are prioritized, such as in lightweight search engines or preprocessing pipelines for text analysis. Its enduring popularity highlights its significance in the evolution of text processing techniques.
Setting up the Notebook¶
Make Required Imports¶
This notebook requires the import of different Python packages but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.
from src.text.preprocessing.stemming import MyPorterStemmer
Algorithm¶
In this notebook, we discuss in detail the original proposed Porter Stemmer algorithm. This includes that we use the same notations used in this description of the algorithm. The implementation of the Porter Stemmer used throughout the notebook is taken and adopted from the NLTK implementation of the algorithm. This implementation includes extensions to the algorithm; those have been removed to simplify the code and to match the implementation with the originally proposed algorithm.
Let's create an instance of the class MyPorterStemmer to show examples throughout the notebook.
stemmer = MyPorterStemmer()
Representation of Words¶
The Porter Stemmer converts a word into a sequence of c's (consonants) and v's (vowels) as part of its internal mechanism to simplify the process of identifying patterns in word structure. This transformation is not an output of the stemmer but a representation used during intermediate steps for rule application. The purpose is to provide a structured way to analyze and manipulate the word's morphology. Here's why this is done:
Identifying Structural Patterns: Words in English often follow predictable patterns of alternating vowels and consonants. By representing a word as a sequence of c's and v's, the algorithm can focus on identifying these patterns more abstractly. For instance, the word "running" is represented
cvccvcc("r" is a consonant, "u" a vowel, "n" a consonant, and so on). The word "aggressiveness" is represented asvcccvccvcvcvcc. These c-v sequences are then further transformed by aggregating all sequences of c's into a singleC, and all sequences of v's into a singleV. This means thatcvccvccbecomesCVCVC, andvcccvccvcvcvccbecomesVCVCVCVCVC. This abstract representation allows the algorithm to apply rules based on the sequence's structure, rather than directly working with specific letters.Facilitating Rule-Based Operations: Many stemming rules in the Porter algorithm depend on the length and arrangement of vowel and consonant sequences. For example, a rule might only apply if a word has at least one
VCsequence (e.g., "abate" has the sequenceVCVCand meets this criterion, while "ate" does not). Another rule might target words ending in a specific suffix preceded by aVCstructure, such as "-ing" in "playing". By abstracting words into c-v sequences, the algorithm can generalize these rules and make decisions without considering the complexity of individual letters.
The method of _cv_sequence() of the class MyPorterStemmer implements this transformation of words into their internal representation for the stemming algorithm. Notice that this methods does not combine sequences of c's and v's into single a C and V. For example, for the word "running" the method will return cvccvcc and not CVCVC. For implementation purposes this last transformation step is not needed; as we will see in a bit.
Use the code cell below to call the _cv_sequence() on different example words and observe the output:
word = "running"
word = "aggressiveness"
print(stemmer._cv_sequence(word))
vcccvccvcvcvcc
Determining the "Measure" (m)¶
A key concept in the Porter Stemmer is the Measure (m), which counts the number of VC sequences in a word. Notice that, using the VC representation, each word can be represented as
where the square brackets denote an arbitrary presence of consonants or vowels, and $(VC)^m$ denotes $VC$ repeated $m$ times. For example:
- The word "aggressiveness" has a measure of $m=5$ since
VCVCVCVCVCcontains fiveVCpairs. - The word "running" has a measure of $m=5$ since
CVCVCcontains fiveVCpairs. - The word "run" has a measure of $m=1$ since
CVCcontains only oneVCpair. - The word "see" has a measure of $m=0$ since
CVVcontains noVCpair.
This measure is critical for deciding whether a suffix should be removed. For example, a suffix like "-ing" is removed only if the stemmed word has a measure greater than $1$. This prevents over-stemming and ensures meaningful stems.
The method _measure() implements the calculation of the measure $m$. Again, use the code cell below and try different words.
word = "aggressiveness"
#word = "running"
#word = "run"
#word = "see"
print(stemmer._measure(word))
5
If you check out the code for method _measure(), you will notice that it simply calls the method _cv_sequence() and then counts the number of occurrences of the substring vc. Recall that _cv_sequence() return, say for the word "aggressiveness" the sequence vcccvccvcvcvcc and not VCVCVCVCVC. However the number of occurrences of vc in vcccvccvcvcvcc will always be the same as the number of occurrences of VC in VCVCVCVCVC. This is why we do not need to merge sequences of c's and v's in practice.
Rules¶
A rule in the Porter Stemmer algorithm is a specific pattern-based instruction used to transform a word by removing or modifying its suffix. These rules are designed to simplify the word while retaining its root meaning for tasks like text analysis or search. The rules operate in a systematic, step-by-step process, targeting different suffixes and conditions at each stage. The general syntax of a rules is:
$$\large (condition)\ S1 \rightarrow S1 $$This means that if a word ends with the suffix $S1$, and the stem before $S1$ satisfies the given $condition$, $S1$ is replaced by $S2$. The condition is usually given in terms of measure $m$ in regard to the stem before $S1$. For example, consider the rule
$$\large (m > 1)\ EMENT \rightarrow $$Here $S1$ is "EMENT" and $S2$ is null. This would map "REPLACEMENT" to "REPLAC", since "REPLAC" is a word part for which $m = 2$. In contrast, the word "CEMENT" would remain unchanged since the part before "EMENT" (i.e., "C") has a measure of $m = 0$ which does not satisfy the condition.
Side note: In the original algorithm description all rules are defined using capital letters. However, in practice, the rules are case-insensitive. However, to align with the original description we adhere to the notation of using capital letters in rules.
Conditions¶
Beyond the measure $m$, the Porter stemming algorithms also defines the following conditions
- $^*S$ — the stem ends with $S$ (and similarly for the other letters)
- $*v*$ — the stem contains a vowel
- $*d$ — the stem ends with a double consonant (e.g. "-TT", "-SS")
- $*o$ — the stem ends
cvc, where the secondcis not W, X or Y (e.g., "-WIL", "-HOP')
The individual conditions can be combined using negation, the logical "and" and logical "or" to form more complex conditions. For example, the condition
$$\large (*d\ \text{and}\ \text{not}\ (*L\ \text{or}\ *S\ \text{or}\ *Z)) $$test if a stems ends with a double consonant and does not end with letters "L, S* or Z.
Application of Rules¶
The original Porter Stemmer algorithm defines 8 rule sets which are applied sequentially. Each rule set is a list of rules that get tested and applied when the condition evaluates to true. If any rule of a rule set gets applied to an input word the rule set is done — that is, not other rules and the same rule set get checked, and the algorithm moves to the next rule set.
The class MyPorterStemmer implements the method _apply_rule_list() that takes a word and a list of rules as input. The method iterates to the list of rules, to see if a rules is applicable (i.e., if the condition of the current rule evaluates to true). If this is the case, the method applies this rule and immediately returns the result (again, because no further rules are supposed to be checked).
The code cell below shows an example. The list of rules reflect the first rule set evaluated over a given word. In the given implementation, each rule has the form $(S1, S2, condition)$. In the rule set below, all conditions are None which simply means that the rules are always applied if the word ends with $S1$.
word = "abilities"
rules = [
("sses", "ss", None), # SSES -> SS
("ies", "i", None), # IES -> I
("ss", "ss", None), # SS -> SS
("s", "", None), # S ->
]
print(f"{word} ==> {stemmer._apply_rule_list(word, rules)}")
abilities ==> abiliti
Note the that the output "abiliti" for the word "abilities" is not the final stemmed version of the word as further rule set will be evaluated. The Porter Stemmer algorithm denotes these rule sets as steps to highlight that these rule sets are evaluated and potentially applied in sequential order. Let's go through all steps to what they are checking.
Step 1a¶
Step 1a mainly addresses the plural form of nouns as well as the 3rd-person singular for of verbs. More specifically, the rules are:
- $SSES \rightarrow SS$
- $IES \rightarrow I$
- $SS \rightarrow SS$
- $S \rightarrow$
We already saw how the rules for Step 1a are implemented and evaluated by the _apply_rule_list(); see above. The code cell below applies Step 1a to the list of words; feel free to modify the list and inspect the results.
words = ["abilities", "busses", "sings", "class", "classes"]
for word in words:
print(f"{word} ==> {stemmer._step1a(word)}")
abilities ==> abiliti busses ==> buss sings ==> sing class ==> class classes ==> class
Step 1b¶
Step 1b focuses on handling the past tense form and progressive of verbs by applying the following rules:
- $(m>0)\ EED \rightarrow EE$
- $(*v*)\ ED\ \rightarrow$
- $(*v*)\ ING\ \rightarrow$
Let's look a some examples, but only focusing on the first rule.
words = ["freed", "succeed"]
for word in words:
print(f"{word} ==> {stemmer._step1b(word)}")
freed ==> freed succeed ==> succee
Notice that "freed" has not been stemmed as the part before "eed" (i.e., "fr") has measure of $m=0$, and therefore the condition $m>0$ evaluates to false.
Rules 2 and 3 get additional treatments. This is due to the basic distinction between how the past tense and progressive form of verbs is formed. If the word is a regular verb ending with a consonant, the suffixes "-ed" and "-ing" are added to form the past tense and progressive form, respectively. However, there is the exception that if the verb has only one-syllable and ending in one consonant letter preceded by one vowel letter, the final consonant letter is doubled first (with the additional exceptions that "w", "x", and "y" never get doubled). Examples of such words include "swim", "get", "fix", etc. In contrast, if a verb ends with a "e", this "e" is dropped before adding the suffixes "-ed" and "-ing". This means if we stem a word like "exhaled", we want to add the "e" back in after removing "-ed". Therefore Step 1b includes additional rules to handle those cases. Let's check out some examples:
words = ["swimming", "fixed", "begged", "begging", "controlling"]
for word in words:
print(f"{word} ==> {stemmer._step1b(word)}")
swimming ==> swim fixed ==> fix begged ==> beg begging ==> beg controlling ==> controll
Step 1c¶
Step 1c is all about word ending with "y" and has only a single rule:
- $(*v*)\ Y \rightarrow I$
In other word, if the word contains a vowel, replace the "y" with a "i". In English, many words with "y" at the end take an "i" when forming derived forms or suffixes (e.g., "happy" $\rightarrow$ "happiness", "cry" $\rightarrow$ "cried". To handle these variations consistently, the Porter Stemmer replaces "y" with "i" during the stemming process. This helps group related word forms (e.g., "happy" and "happiness") under the same base. The code cell below shows a few examples of the effect of Step 1c.
words = ["baby", "slowly", "cry", "fully", "immediately"]
for word in words:
print(f"{word} ==> {stemmer._step1c(word)}")
baby ==> babi slowly ==> slowli cry ==> cry fully ==> fulli immediately ==> immediateli
Step 2¶
Step 2 of the Porter Stemmer algorithm plays a crucial role in refining the stemming process by handling longer suffixes that are often found in derived words. This step targets suffixes such as "-ational", "-ization", and "-fulness", which frequently appear in formal or technical language. Step 2 applies transformations carefully, with conditions to ensure accuracy. For one, a suffix is only replaced if the remaining stem satisfies certain structural criteria (e.g., having a measure $m>0$, indicating that the stem contains at least one vowel-consonant pair). This prevents over-stemming and helps maintain meaningful stems. The complete list of rules as defined be the original Porter Stemmer algorithm is:
- $(m>0)\ ATIONAL \rightarrow ATE$
- $(m>0)\ TIONAL \rightarrow TION$
- $(m>0)\ ENCI \rightarrow ENCE$
- $(m>0)\ ANCI \rightarrow ANCE$
- $(m>0)\ IZER \rightarrow IZE$
- $(m>0)\ ABLI \rightarrow ABLE$
- $(m>0)\ ALLI \rightarrow AL$
- $(m>0)\ ENTLI \rightarrow ENT$
- $(m>0)\ ELI \rightarrow E$
- $(m>0)\ OUSLI \rightarrow OUS$
- $(m>0)\ IZATION \rightarrow IZE$
- $(m>0)\ ATION \rightarrow ATE$
- $(m>0)\ ATOR \rightarrow ATE$
- $(m>0)\ ALISM \rightarrow AL$
- $(m>0)\ IVENESS \rightarrow IVE$
- $(m>0)\ FULNESS \rightarrow FUL$
- $(m>0)\ OUSNESS \rightarrow OUS$
- $(m>0)\ ALITI \rightarrow AL$
- $(m>0)\ IVITI \rightarrow IVE$
- $(m>0)\ BILITI \rightarrow BLE$
From looking at those rules it is rather obvious that Step 2 assumes the Step 1c was already performed to replace the last "y" in a word with an "i". Thus, for the example shown below focus on words that do not originally end and "y".
words = ["seriousness", "creator", "organization", "organizer"]
for word in words:
print(f"{word} ==> {stemmer._step2(word)}")
seriousness ==> serious creator ==> creator organization ==> organize organizer ==> organize
Step 3¶
Step 3 of the Porter Stemmer algorithm focuses on refining stems further by handling suffixes associated with grammatical transformations, such as changes in part of speech or formality. This step primarily targets suffixes like "-ness", "-ful", "-ative", and "-al", reducing words to their more basic forms. The importance of this step lies in its ability to simplify derived words while maintaining their core meaning, improving the overall efficiency of text analysis tasks. Step 3 deals with suffixes that are often used to form adjectives or nouns from base words, such as: "-ness" (e.g., "happiness" $\rightarrow$ "happi"), "-ative" (e.g., "informative" $\rightarrow$ "inform"), or "-ful" (e.g., "hopeful" $\rightarrow$ "hope"). Similar to Step 2, a suffix is only replaced if the remaining stem has a measure of $m>0$ to prevent over-stemming. The complete list of rules for Step 3 is:
- $(m>0)\ ICATE \rightarrow ICE$
- $(m>0)\ ATIVE \rightarrow$
- $(m>0)\ ALIZE \rightarrow AL$
- $(m>0)\ ICITI \rightarrow IC$
- $(m>0)\ ICAL \rightarrow IC$
- $(m>0)\ FUL \rightarrow$
- $(m>0)\ NESS \rightarrow$
Again, some of the rules assume that the suffix "y" has already been replaced with an "i". The code cell below applies Step 3 to a few example words:
words = ["critical", "fearful", "indicate", "blindness"]
for word in words:
print(f"{word} ==> {stemmer._step3(word)}")
critical ==> critic fearful ==> fear indicate ==> indic blindness ==> blind
Step 4¶
Step 4 of the Porter Stemmer algorithm is designed to remove common suffixes that act as grammatical markers in English, such as "-al", "-ance", "-ence", "-able", "-ible", "-ment", and others. These suffixes often appear in longer, derived words and are typically less essential to the word's core meaning. The purpose of Step 4 is to reduce such words to their bare stems by stripping away these secondary suffixes, enabling more efficient text processing and better grouping of related terms. Secondary suffixes modify the word but do not drastically alter its meaning. These suffixes often indicate abstract concepts, properties, or states. Examples include "-ment" (e.g., "development" $\rightarrow$ "develop"), "-ance" (e.g., "reliance" $\rightarrow$ "rely"), and "-able" (e.g., "readable" $\rightarrow$ "read"). The exact rules are:
- $(m>0)\ AL \rightarrow$
- $(m>0)\ ANCE \rightarrow$
- $(m>0)\ ENCE \rightarrow$
- $(m>0)\ ER \rightarrow$
- $(m>0)\ IR \rightarrow$
- $(m>0)\ ABLE \rightarrow$
- $(m>0)\ IBLE \rightarrow$
- $(m>0)\ ANT \rightarrow$
- $(m>0)\ EMENT \rightarrow$
- $(m>0)\ MENT \rightarrow$
- $(m>0)\ ENT \rightarrow$
- $(m>0\ \text{and}\ (*S\ \text{OR}\ *T))\ ION \rightarrow$
- $(m>0)\ OU \rightarrow$
- $(m>0)\ ISM \rightarrow$
- $(m>0)\ ATE \rightarrow$
- $(m>0)\ ITI \rightarrow$
- $(m>0)\ OUS \rightarrow$
- $(m>0)\ IVE \rightarrow$
- $(m>0)\ IZE \rightarrow$
To show the effects of Step 4, let's look at some examples:
words = ["replacement", "tolerant", "difference", "activism"]
for word in words:
print(f"{word} ==> {stemmer._step4(word)}")
replacement ==> replac tolerant ==> toler difference ==> differ activism ==> activ
Step 5a¶
Step 5a of the Porter Stemmer algorithm is focused on removing a final "e" from words under specific conditions. The purpose of this step is to simplify words to their most basic form without stripping away meaning. Since many English words end with a silent "e" that does not contribute to the word's core meaning, removing it can produce cleaner, more generalized stems for text processing tasks. The final "e" often serves as a grammatical or phonetic marker rather than contributing to the meaning of the word, e.g., "probate" $\rightarrow$ "probat", "cease" $\rightarrow$ "ceas". The two specific rules are:
- $(m>1)\ E \rightarrow$
- $(m=1\ \text{and}\ \text{not}\ *o)\ E \rightarrow$
The code cell below runs Step 5 for a few example words:
words = ["probate", "rate", "accelerate", "cease"]
for word in words:
print(f"{word} ==> {stemmer._step5a(word)}")
probate ==> probat rate ==> rate accelerate ==> accelerat cease ==> ceas
Step 5b¶
The purpose of Step 5b in the Porter Stemmer algorithm is to remove a redundant "l" from words ending in "ll", provided the stem meets a specific measure condition ($m>1$). This step further simplifies word forms, ensuring consistency across related terms while avoiding over-stemming. As the final step of the algorithm, it fine-tunes the stems, making them ready for use in text analysis, retrieval, and other NLP tasks. The single rule of Step 5b is:
- $(m>1\ \text{and}\ *d\ \text{and}\ *L)\ \rightarrow L$
Let's check out a few examples:
words = ["full", "recall", "enroll"]
for word in words:
print(f"{word} ==> {stemmer._step5b(word)}")
full ==> full recall ==> recal enroll ==> enrol
As usual, keep in mind that Step 5c will be applied after all previous steps. This might include words such as "controlling" will be stemmed to "controll" in Step 1b; and therefore further stemmed to "control" in Step 5.
Applying of all Rules¶
Now that we covered all 8 steps of the Porter Stemming algorithm, we apply them to a few example words to see the full effect of this stemming algorithm.
words = ["abilities", "aggressivness", "running", "run", "see", "international",
"organization", "controlling", "nervousness", "accomplishment"]
for word in words:
print(f"{word} ==> {stemmer.stem(word)}")
abilities ==> abil aggressivness ==> aggressiv running ==> run run ==> run see ==> see international ==> intern organization ==> organ controlling ==> control nervousness ==> nervous accomplishment ==> accomplish
Limitations¶
Language-Specific¶
As the most obvious limitation, the Porter Stemmer is specifically designed for English text because it relies on rules tailored to the morphology and grammar of the English language. Stemming for any language must consider its unique linguistic characteristics, such as word formation rules, suffixes, and grammatical structure. In fact, with respect to stemming, English can be considered a rather simple language, requiring only a relatively smaller set of rules to yield good results. Other language often pose bigger challenges for designing an effective stemming algorithm; common reasons include:
Complex Morphology: Some languages have more complex word structures than English, with rich inflectional or derivational forms. For example Agglutinative Languages (e.g., Turkish, Finnish) have words that can consist of long chains of morphemes, each adding specific grammatical meaning. For example, in Turkish, "evlerinizden" ("from your houses") contains multiple affixes that need to be carefully stripped. Fusional Languages (e.g., Russian, Spanish) have words where a single suffix can encode multiple grammatical properties, such as tense, mood, and number, making it harder to design rules for consistent stemming.
Non-Latin Scripts: Languages written in non-Latin scripts, such as Arabic, Hindi, or Chinese, pose unique challenges. For example, in Arabic, root-based morphology means that words are formed from three-letter roots, often requiring root extraction rather than suffix removal. Chinese and Japanese use logographic or syllabic scripts, meaning words are not formed with prefixes and suffixes. Instead of stemming, segmentation (splitting sentences into words) is a prerequisite.
Compound Words: Languages like German and Dutch frequently use compound words: For example, in German, "Hausmeisterwohnung" ("caretaker's apartment") combines "Haus" ("house"), "Meister" ("master"), and "Wohnung" ("apartment"). Stemming must recognize and split such compounds before processing.
Common Failure Types¶
Irregular Word Forms: The Porter Stemmer does not handle irregular word forms or exceptions well because it relies on fixed rules rather than a dictionary or semantic understanding. For example, the stemmer does not affect words such as "better" or "went", since only removing suffixes does not allow to map to "good" and "go", respectively
Over-Stemming: Over-stemming happens when the algorithm reduces words to stems that are too short or unrelated, causing semantically distinct words to collapse into the same stem. For example, The two words "university" and "universe" have distinct meanings, but they are reduced to the same stem "univers", leading to loss of differentiation.
Under-Stemming: Under-stemming occurs when the stemming algorithm reduces inflected words to different word stems, but they should be the same. For example, the Porter Stemmer algorithm does not reduce the words "alumnus", "alumnae", and "alumni" to the same word stem, although they should be treated as synonyms. It easy to see that this is likely to happen from words directly borrow from other languages and therefore not adhere to rules commonly found in the English language.
Summary¶
The Porter Stemmer is an algorithm used in natural language processing (NLP) to reduce words to their root or base form, called the stem. For example, words like running, runner, and runs are simplified to run. It uses a series of rules to strip suffixes like -ing, -ed, or -ly, focusing on English words. This helps standardize words so that similar ones are grouped together, making it easier for computers to process and analyze text.
This algorithm is especially important in tasks like search engines, text classification, and sentiment analysis. By reducing words to their stems, it ensures that different forms of the same word are treated as one, improving the accuracy of tasks like finding relevant search results or identifying patterns in text. While the Porter Stemmer isn't perfect — it may sometimes oversimplify or fail to connect related words — it remains a widely used tool because it is simple, fast, and effective for many applications.