Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.

Stemming & Lemmatization¶

Consider the following to sentences:

  • "Dogs make the best friends."
  • "A dog makes a good friend."

Semantically, both sentences are essentially conveying the same message, but syntactically they are very different since the vocabulary is different: "dog" vs. "dog", "make" vs. "makes", "friends" vs. "friend". This is a big problem when comparing documents or when searching for documents in a database. For example, when one uses "dog" as a search term, both sentences should be returned and not just the second one.

Stemming and lemmatization are two common techniques used in natural language processing (NLP) for text normalization. Both methods aim to reduce words to their base or root forms, but they differ in their approaches and outcomes.

Stemming: Stemming is a process of reducing words to their "stems" by removing prefixes and suffixes, typically through simple heuristic rules. The resulting stems may not always be actual words. The goal of stemming is to normalize words that have the same base meaning but may have different inflections or variations. For example, stemming the words "running," "runs," and "runner" would result in the common stem "run." A popular stemming algorithm is the Porter stemming algorithm.

Lemmatization: Lemmatization, on the other hand, is a more advanced technique that aims to transform words to their "lemmas," which are the base or dictionary forms of words. Lemmatization takes into account the morphological analysis of words and considers factors such as part-of-speech (POS) tags to determine the correct lemma. The output of lemmatization is usually a real word that exists in the language. For example, lemmatizing the words "running," "runs," and "runner" would yield the lemma "run." Lemmatization requires more linguistic knowledge and often relies on dictionaries or language-specific resources.

The choice between stemming and lemmatization depends on the specific NLP task and its requirements. Stemming is a simpler and faster technique, often used when the exact word form is not critical, such as in information retrieval or indexing tasks. Lemmatization, being more linguistically sophisticated, is preferred in tasks where the base form and the semantic meaning of words are important, such as in machine translation, sentiment analysis, or question-answering systems.

It's important to note that stemming and lemmatization may not always produce the same results, and the choice between them should consider the trade-offs between accuracy and computational complexity.

Both stemming and lemmatization are the methods to normalize documents on a syntactical level. Often the same words are used in different forms depending on their grammatical use in a sentence.

Setting up the Notebook¶

Make Required Imports¶

This notebook requires the import of different Python packages but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.

In [1]:
from src.utils.libimports.stemlem import *
from src.utils.data.files import *

Lemmatization requires the information if a word is a noun, verb or adjective. We therefore need a Part-of-Speech tagger to extract this information. The code cell below downloads averaged_perceptron_tagger, Part-of-Speech tagger of NLTK (in case it is not already available in the current NLTK installation).

In [2]:
nltk.download('averaged_perceptron_tagger_eng')
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/vdw/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
Out[2]:
True

Download Required Data¶

Some code examples in this notebook use data that first need to be downloaded by running the code cell below. If this code cell throws any error, please check the configuration file config.yaml if the URL for downloading datasets is up to date and matches the one on Github. If not, simply download or pull the latest version from Github.

In [3]:
lemmatizer_lexicon, _ = download_dataset("text/lexicons/lemmatization/lemmatizer-lexicon.dat")
File 'data/datasets/text/lexicons/lemmatization/lemmatizer-lexicon.dat' already exists (use 'overwrite=True' to overwrite it).

Stemming¶

Stemming is a process in natural language processing (NLP) that reduces words to their base or root forms, called stems. Stemming algorithms apply heuristic rules to remove prefixes and suffixes from words, aiming to normalize variations of words that share a common root. There are several popular stemming algorithms, each with its own approach and characteristics. The main differences between different stemmers include:

  • Porter Stemmer: The Porter stemming algorithm, developed by Martin Porter, is one of the most widely used stemmers. It applies a series of rules and transformations to remove common English word endings, focusing on the structure of the word rather than its linguistic meaning. The Porter stemmer is known for its simplicity and speed but may produce stems that are not actual words.

  • Snowball Stemmer: The Snowball stemmer, also known as the Porter2 stemmer, is an extension of the Porter stemmer. It provides stemmers for multiple languages, including English, German, Spanish, French, and more. The Snowball stemmer is an improvement over the original Porter stemmer, addressing some of its limitations and offering better performance and accuracy for different languages.

  • Lancaster Stemmer: The Lancaster stemming algorithm, developed by Chris D. Paice, is an aggressive stemming algorithm that focuses on removing prefixes and suffixes from words. It applies a set of rules that are more aggressive than those used in the Porter stemmer, often resulting in shorter stems. The Lancaster stemmer is known for its aggressive stemming behavior and can produce stems that are not recognizable as actual words.

  • Lovins Stemmer: The Lovins stemmer, developed by J. H. Lovins, is an early stemming algorithm that uses a set of rules based on linguistic principles to remove common word endings. It aims to produce stems that are linguistically meaningful and recognizable as real words. The Lovins stemmer is not as widely used as the Porter or Lancaster stemmers but can be useful in certain contexts.

The choice of stemmer depends on the specific NLP task, the language being processed, and the trade-offs between simplicity, speed, accuracy, and the desired level of stemming aggressiveness. It's important to evaluate and compare the performance of different stemmers for a particular application to determine the most suitable one.

Define Set of Stemmers¶

We first define a few stemmers provided by NLTK. For more stemmer, see http://www.nltk.org/api/nltk.stem.html

In [4]:
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')

# Put all stemmers into a list to make their use easier
stemmer_list = [porter_stemmer, snowball_stemmer, lancaster_stemmer]

Define List of Example Words¶

To illustrate the effects of stemming, let's consider a list of individual words instead of a complete text document. This makes it easier to point out the difference between different stemmers. The choice of word below cover relevant cases incl.:

  • Plural form of nouns
  • Different verb tenses
  • Irregular verbs (e.g., verbs with an irregular forms the past tense)
  • Irregular adjectives (e.g., adjectives with an irregular forms the comparative and superlative)
In [5]:
word_list = ['only', 'accepted', 'studying','study','studied', 'dogs', 'cats', 
             'running', 'phones', 'viewed', 'presumably', 'crying', 'went', 
             'packed', 'worse', 'best', 'mice', 'friends', 'makes']

Perform Stemming¶

We can now perform stemming of each word using all of our 3 defined stemmers, and print it output in such a way to quickly see the differences.

In [6]:
for word in word_list:
    print (f"{word}:")
    for stemmer in stemmer_list:
        stemmed_word = stemmer.stem(word)
        print (f"\t{stemmed_word} ({type(stemmer).__name__})")
only:
	onli (PorterStemmer)
	onli (SnowballStemmer)
	on (LancasterStemmer)
accepted:
	accept (PorterStemmer)
	accept (SnowballStemmer)
	acceiv (LancasterStemmer)
studying:
	studi (PorterStemmer)
	studi (SnowballStemmer)
	study (LancasterStemmer)
study:
	studi (PorterStemmer)
	studi (SnowballStemmer)
	study (LancasterStemmer)
studied:
	studi (PorterStemmer)
	studi (SnowballStemmer)
	study (LancasterStemmer)
dogs:
	dog (PorterStemmer)
	dog (SnowballStemmer)
	dog (LancasterStemmer)
cats:
	cat (PorterStemmer)
	cat (SnowballStemmer)
	cat (LancasterStemmer)
running:
	run (PorterStemmer)
	run (SnowballStemmer)
	run (LancasterStemmer)
phones:
	phone (PorterStemmer)
	phone (SnowballStemmer)
	phon (LancasterStemmer)
viewed:
	view (PorterStemmer)
	view (SnowballStemmer)
	view (LancasterStemmer)
presumably:
	presum (PorterStemmer)
	presum (SnowballStemmer)
	presum (LancasterStemmer)
crying:
	cri (PorterStemmer)
	cri (SnowballStemmer)
	cry (LancasterStemmer)
went:
	went (PorterStemmer)
	went (SnowballStemmer)
	went (LancasterStemmer)
packed:
	pack (PorterStemmer)
	pack (SnowballStemmer)
	pack (LancasterStemmer)
worse:
	wors (PorterStemmer)
	wors (SnowballStemmer)
	wors (LancasterStemmer)
best:
	best (PorterStemmer)
	best (SnowballStemmer)
	best (LancasterStemmer)
mice:
	mice (PorterStemmer)
	mice (SnowballStemmer)
	mic (LancasterStemmer)
friends:
	friend (PorterStemmer)
	friend (SnowballStemmer)
	friend (LancasterStemmer)
makes:
	make (PorterStemmer)
	make (SnowballStemmer)
	mak (LancasterStemmer)

In general, different stemmers will yield different outputs depending on their underlying rules — although for our example words, only the LancasterStemmer will yield different outputs. In general, the different outputs do not automatically make one stemmer better or worse than another stemmer.


Lemmatization¶

A lemmatizer is a tool or algorithm that transforms words into their base or dictionary forms, known as lemmas. Unlike stemming, which simplifies words by removing prefixes and suffixes without considering linguistic context, lemmatization takes into account the morphological analysis of words, part-of-speech (POS) tags, and language-specific rules to produce meaningful and valid lemmas.

Here is a brief summary of how a lemmatizer for NLP typically works:

  • Tokenization: The text is divided into individual words or tokens using tokenization techniques. This is typically a separate step performed before lemmatization; but the lemmatizer assumes tokenized text as input.

  • POS tagging: Each word is assigned a part-of-speech tag, such as noun, verb, adjective, etc. POS tagging helps determine the appropriate lemma based on the word's grammatical role.

  • Lemmatization rules: The lemmatizer applies language-specific rules and patterns to convert words to their lemmas. These rules consider factors like a word's POS tag, its inflections, and other linguistic properties. For example, for English verbs, the lemmatizer would handle verb conjugations to identify the base form.

  • Lookup in dictionary or lexicon: The lemmatizer may consult a dictionary or lexicon that contains information about word forms and their corresponding lemmas. This can be helpful for irregular words that don't follow regular morphological rules.

  • Lemmatization output: The lemmatizer generates the lemma for each word, which represents the base or canonical form of the word. The resulting lemmas are typically real words that exist in the language and are recognized by native speakers.

  • Post-processing: In some cases, additional post-processing steps may be applied to refine or improve the lemmatization results. These steps could include handling special cases, resolving ambiguities, or dealing with out-of-vocabulary terms.

(Note: Both tokenization and POS tagging are their own topics and their detailed discussion is beyond the scope of this notebook. Therefore, in the following use the toy example that are already tokenized and POS tagged, or use existing libraries and methods to perform these two tasks.)

Lemmatization requires linguistic knowledge, language-specific resources (such as dictionaries or lexicons), and morphological analysis to accurately identify and generate the appropriate lemmas. It is a more sophisticated technique compared to stemming and is generally preferred when preserving the semantic meaning and grammatical correctness of words is crucial in NLP tasks like machine translation, information retrieval, or sentiment analysis.

Building a Simple Lemmatizer¶

As mentioned above, most lemmatizers rely on dictionaries and lexicons to look up the lemma of a word given its POS tag. This means that the main effort of building a basic lemmatizer is often creating and curating such dictionaries and lexicons. For resource-rich languages such as English this can be rather straightforward. A resource-rich language is one that has extensive computational and linguistic resources readily available to support natural language processing tasks. These resources include large datasets, annotated corpora, dictionaries, and language models, as well as software tools and frameworks tailored for the language.

The file provided lemmatizer-lexicon.dat has been generated by combining multiple annotated datasets containing English text. While the purpose of these dataset may have differed, they all shared that each word was annotated with both its POS tag and its lemma. The content of file lemmatizer-lexicon.dat looks as follows:

...
drive###VERB::drive;;NOUN::drive
drove###VERB::drive
driving###VERB::drive;;ADJ::driving
driven###VERB::drive
drives###VERB::drive;;NOUN::drive
...

Each line in the files start with an English word, followed pair a list of (POS tag, lemma)-pairs for that word; different separators are used to uniquely identify the words and pairs.

To create a simple lookup for words to lemma, we can read the file line-by-line and add the information Python to the dictionary; let's call it word_to_lemma. The keys of the dictionary will be all English words listed in the data file. The values are again dictionaries. The keys for these word-level dictionaries are POS tags and the values are the actual lemmas. The code cell generates the dictionary word_to_lemma.

In [7]:
word_to_lemma = {}

with open(lemmatizer_lexicon) as file:
    # Line format: driving###VERB::drive;;ADJ::driving
    for line in file:
        line = line.strip()
        word, lemmas = line.split("###")
        
        if word not in word_to_lemma:
            word_to_lemma[word] = {}
        
        for pos_lemma_pair in lemmas.split(";;"):
            pos, lemma = pos_lemma_pair.split("::")
            word_to_lemma[word][pos] = lemma

To have a look inside the dictionary, we can create a temporary copy which contains only a subset of the words so we can print the dictionary in full. The code cell below create such simplified dictionary by considering only the word contained in the sentence "she's driving faster than allowed.". Notice that we assume that tokenization has already been perform, and most tokenizer will split "'s"

In [8]:
word_to_lemma_tmp = {key: word_to_lemma[key] for key in ["she", "'s", "driving", "faster", "than", "allowed", "."]}

print(json.dumps(word_to_lemma_tmp, indent=2))
{
  "she": {
    "PRON": "she",
    "NOUN": "she"
  },
  "'s": {
    "PART": "'s",
    "AUX": "be",
    "VERB": "be",
    "PRON": "we"
  },
  "driving": {
    "VERB": "drive",
    "ADJ": "driving"
  },
  "faster": {
    "ADJ": "fast"
  },
  "than": {
    "SCONJ": "than",
    "NOUN": "than",
    "ADP": "than",
    "ADV": "than"
  },
  "allowed": {
    "VERB": "allow"
  },
  ".": {
    "PUNCT": "."
  }
}

With this dictionary, we can now find all available lemmas for a given word and their respective POS tags; try out the following example, and feel free to check out other words.

In [9]:
word = "running"

print(f"The lemmata for {word} in the dictionary are: {word_to_lemma[word]}")
The lemmata for running in the dictionary are: {'VERB': 'run', 'NOUN': 'run', 'ADJ': 'running'}

Of course, for lemmatizing a word, both the word and the POS tags are the input for the lemmatizer. For this, we used the nested nature of our dictionary by using first the word as the key for the main dictionary and then the POS tag as the key for the word-level dictionary:

In [10]:
word = "running"
ptag = "VERB"

print(f"The lemma for {word} with the POS tag {ptag} is: {word_to_lemma[word][ptag]}")
The lemma for running with the POS tag VERB is: run

Of course, you can probably already spot a problem. If our dictionary does not contain a word, or the word-level dictionary does not contain the given POS tag, trying to get the lemma of a word will fail and raise an error. In practice, this can be a common case since such dictionaries are never guaranteed to be complete. For one, language is constantly evolving, with new words being added. But more commonly, such dictionaries generally cover named entities, that is, the names of persons, organizations, companies, locations, and so on. Such names are not expected to be lemmatized anyway. This also includes any non-word tokens such as numbers, abbreviations, punctuation marks, URLs, etc.

However, not finding a match in the lemma dictionary for a given word and POS tag is very pragmatic: we simply return the word unchanged. The method lemmatize() below implements this basic idea by wrapping the access to the lemma dictionary into a try ... except ... block. Thus, if accessing the lemma dictionary fails for a given word and POS tag, am error is raised, and the except block ensures that the input word is returned.

In [11]:
def lemmatize(word, pos):
    try:
        return word_to_lemma[word.lower()][pos]
    except:
        return word

Now we can use our dictionary to safely lemmatize any combination of a word and POS tag (even for made-up words and tags). Try it out!

In [12]:
#word = "runnnnning"
word = "running"
ptag = "VERB"

print(f"The lemma for {word} with the POS tag {ptag} is: {lemmatize(word, ptag)}")
The lemma for running with the POS tag VERB is: run

This means we are ready to lemmatize complete text documents. However, recall that we have to assume that tokenization and POS tagging has already been performed. The code cell below, manually defines a sentence as a list of (token/word, POS tag)-pairs. Just a list could be the output of an library for tokenization and POS tagging.

In [13]:
# Sentence after Tokenization and POS Tagging
sentence = [
    ("She", "PRON"), 
    ("'s", "AUX"), 
    ("driving", "VERB"), 
    ("faster", "ADJ"), 
    ("than", "SCONJ"), 
    ("allowed", "VERB"),
    (".", "PUNCT")
]

Each (token/word, POS tag)-pairs contain the required information for running the lemmatize() for all items in the list. So let's lemmatize each word/token of our example sentence and see the results. From our look into word_to_lemma_tmp we already know that each word/token is an existing key in our lemma dictionary, including the punctuation mark .. This is not surprising, since all the words in our example sentence are very common words, and therefore very likely in existing datasets.

In [14]:
lemmas = [lemmatize(word, ptag) for word, ptag in sentence]

print(f"All lemmas for the input sentence: {lemmas}")
All lemmas for the input sentence: ['she', 'be', 'drive', 'fast', 'than', 'allow', '.']

In short, assuming the availability of a suitable dictionary or lexicon, implementing a basic lemmatizer is reasonably straightforward. Of course, more sophisticated lemmatizers will go beyond simple dictionary lookups. For example, a lemmatizer may analyze the morphology of words, i.e., its structure by considering prefixes and suffixes. Some modern lemmatizers use machine learning models trained on large corpora to predict lemmas. These models can account for patterns that might not be explicitly codified. They can adapt to new words or usages that may not yet exist in dictionaries. Advanced lemmatizers, especially those based on neural networks, analyze the entire sentence to determine the most appropriate lemma for a word. This is crucial for handling polysemous words (words with multiple meanings). Beyond standard dictionary entries, lemmatizers may handle non-standard words such as contractions, abbreviations, slang, and misspellings by mapping them to their canonical forms. By combining these techniques, a lemmatizer provides a more nuanced and accurate output than a simple dictionary lookup, especially in complex or ambiguous linguistic contexts.

Lemmatization with NLTK¶

Define Lemmatizer Using NLTK¶

The WordNetLemmatizer is a lemmatization tool provided by the Natural Language Toolkit (NLTK), which is a popular library for NLP in Python. NLTK is widely used for various NLP tasks, including lemmatization, and the WordNetLemmatizer is one of the lemmatization options it offers. It is specifically designed to lemmatize English words based on WordNet, a lexical database for English. WordNet organizes words into synsets (sets of synonyms), and each synset is linked to various lemmas representing different word forms. The WordNetLemmatizer in NLTK utilizes WordNet's information and applies lemmatization rules to transform words to their lemmas. It takes into account the part-of-speech (POS) tag of each word and provides options for lemmatizing nouns, verbs, adjectives, and adverbs.

In [15]:
wordnet_lemmatizer = WordNetLemmatizer()

Perform Lemmatization w.r.t. all Word Types¶

The WordNetLemmatizer distinguishes between nouns, verbs, adjectives, and adverbs. This Part-of-Speech information must be provided as input. The four choices of input parameters are n (noun), v (verb), a (adjective), and r (adverb). In the code cell below, we can lemmatize each of our example words using these for different word types and inspect the output.

In [16]:
pos_list = ['n', 'v', 'a', 'r']

for word in word_list:
    print (word + ':')
    for pos in pos_list:
        lemmatized_word = wordnet_lemmatizer.lemmatize(word, pos=pos) # default is 'n'
        print ('\t', word, '=[{}]=>'.format(pos), lemmatized_word)
only:
	 only =[n]=> only
	 only =[v]=> only
	 only =[a]=> only
	 only =[r]=> only
accepted:
	 accepted =[n]=> accepted
	 accepted =[v]=> accept
	 accepted =[a]=> accepted
	 accepted =[r]=> accepted
studying:
	 studying =[n]=> studying
	 studying =[v]=> study
	 studying =[a]=> studying
	 studying =[r]=> studying
study:
	 study =[n]=> study
	 study =[v]=> study
	 study =[a]=> study
	 study =[r]=> study
studied:
	 studied =[n]=> studied
	 studied =[v]=> study
	 studied =[a]=> studied
	 studied =[r]=> studied
dogs:
	 dogs =[n]=> dog
	 dogs =[v]=> dog
	 dogs =[a]=> dogs
	 dogs =[r]=> dogs
cats:
	 cats =[n]=> cat
	 cats =[v]=> cat
	 cats =[a]=> cats
	 cats =[r]=> cats
running:
	 running =[n]=> running
	 running =[v]=> run
	 running =[a]=> running
	 running =[r]=> running
phones:
	 phones =[n]=> phone
	 phones =[v]=> phone
	 phones =[a]=> phones
	 phones =[r]=> phones
viewed:
	 viewed =[n]=> viewed
	 viewed =[v]=> view
	 viewed =[a]=> viewed
	 viewed =[r]=> viewed
presumably:
	 presumably =[n]=> presumably
	 presumably =[v]=> presumably
	 presumably =[a]=> presumably
	 presumably =[r]=> presumably
crying:
	 crying =[n]=> cry
	 crying =[v]=> cry
	 crying =[a]=> crying
	 crying =[r]=> crying
went:
	 went =[n]=> went
	 went =[v]=> go
	 went =[a]=> went
	 went =[r]=> went
packed:
	 packed =[n]=> packed
	 packed =[v]=> pack
	 packed =[a]=> packed
	 packed =[r]=> packed
worse:
	 worse =[n]=> worse
	 worse =[v]=> worse
	 worse =[a]=> bad
	 worse =[r]=> worse
best:
	 best =[n]=> best
	 best =[v]=> best
	 best =[a]=> best
	 best =[r]=> best
mice:
	 mice =[n]=> mouse
	 mice =[v]=> mice
	 mice =[a]=> mice
	 mice =[r]=> mice
friends:
	 friends =[n]=> friend
	 friends =[v]=> friends
	 friends =[a]=> friends
	 friends =[r]=> friends
makes:
	 makes =[n]=> make
	 makes =[v]=> make
	 makes =[a]=> makes
	 makes =[r]=> makes

Lemmatization in Practice¶

Usually, we only want to lemmatize each word in a document using its correct word type (i.e., Part-of-Speech). This means that we first need to apply a Part-of-Speech (POS) tagger that tells us the type for each word in a sentence; see the dedicated notebook about POS tagging. In the code cell below, we simply use a POS tagger provided by NLTK.

In [17]:
sentence = "The newest study has shown that cats have mostly a better sense of smell than dogs."

# First, tokenize sentence
token_list = word_tokenize(sentence)

# Second, calculate POS tags for each token
pos_tag_list = pos_tag(token_list)

for pos in pos_tag_list:
    print(pos)
('The', 'DT')
('newest', 'JJS')
('study', 'NN')
('has', 'VBZ')
('shown', 'VBN')
('that', 'IN')
('cats', 'NNS')
('have', 'VBP')
('mostly', 'RB')
('a', 'DT')
('better', 'JJR')
('sense', 'NN')
('of', 'IN')
('smell', 'NN')
('than', 'IN')
('dogs', 'NNS')
('.', '.')

The POS tagger distinguishes several dozens of word types. However, we are only interested in whether a word is a noun, verb, adjective, or adverb. We therefore need to map the output of the POS tagger to the 4 valid options "n", "v", "a", and "r"; see above. However, this is relatively easy to do since we only have to look at the first character of the resulting POS tags. All tags for nouns start with an "N", all tags for verbs start with a "V", all tags for adjectives start with a "J", and all tags for adverbs start with an "R".

In [18]:
print ("\nOutput of NLTK lemmatizer:\n")
for token, tag in pos_tag_list:
    word_type = 'n' # Default if all fails
    tag_simple = tag[0].lower() # Converts, e.g., "VBD" to "v"
    if tag_simple in ['n', 'v', 'r']:
        # If the POS tag starts with "n","v", or "r", we know it's a noun, verb, or adverb
        word_type = tag_simple 
    elif tag_simple in ['j']:
        # If the POS tag starts with a "j", we know it's an adjective
        word_type = 'a' 
    lemmatized_token = wordnet_lemmatizer.lemmatize(token.lower(), pos=word_type)
    print(f"{token} ==[{tag}]==[{word_type}]==> {lemmatized_token}")
Output of NLTK lemmatizer:

The ==[DT]==[n]==> the
newest ==[JJS]==[a]==> new
study ==[NN]==[n]==> study
has ==[VBZ]==[v]==> have
shown ==[VBN]==[v]==> show
that ==[IN]==[n]==> that
cats ==[NNS]==[n]==> cat
have ==[VBP]==[v]==> have
mostly ==[RB]==[r]==> mostly
a ==[DT]==[n]==> a
better ==[JJR]==[a]==> good
sense ==[NN]==[n]==> sense
of ==[IN]==[n]==> of
smell ==[NN]==[n]==> smell
than ==[IN]==[n]==> than
dogs ==[NNS]==[n]==> dog
. ==[.]==[n]==> .

Lemmatization with spaCy¶

spaCy already performs lemmatization by default when processing a document without any additional commands. This makes it much more convenient to use than NLTK.

In [19]:
print ("\nOutput of spaCy lemmatizer:\n")
doc = nlp(sentence) # doc is an object, not just a simple list

for token in doc:
    print(f"{token.text} ==[{token.pos_}]==> {token.lemma_}")
Output of spaCy lemmatizer:

The ==[DET]==> the
newest ==[ADJ]==> new
study ==[NOUN]==> study
has ==[AUX]==> have
shown ==[VERB]==> show
that ==[SCONJ]==> that
cats ==[NOUN]==> cat
have ==[VERB]==> have
mostly ==[ADV]==> mostly
a ==[DET]==> a
better ==[ADJ]==> well
sense ==[NOUN]==> sense
of ==[ADP]==> of
smell ==[NOUN]==> smell
than ==[ADP]==> than
dogs ==[NOUN]==> dog
. ==[PUNCT]==> .

Compare the results from NLTK and spaCy. While most words get lemmatized the same way, the noticeable difference is for the word "better". Arguably, NLTK does a better job here, as "good" seems to be the more appropriate lemmatized form in this sentence.


Summary¶

Stemming and lemmatization are essential techniques in natural language processing (NLP) that help normalize and reduce words to their base forms. Here is a brief summary of their uses and importance:

  • Stemming:
    • Uses: Stemming is primarily employed in tasks where the exact word form is not crucial, such as information retrieval, indexing, and search engines.
    • Importance: Stemming allows for the reduction of words to their common base form, which helps in matching variations of words, handling inflections, and improving recall in search queries. It reduces the vocabulary size and can enhance computational efficiency.

  • Lemmatization:
    • Uses: Lemmatization is useful in NLP tasks where preserving the semantic meaning and grammatical correctness of words is important, such as machine translation, sentiment analysis, question-answering systems, and language generation.
    • Importance: Lemmatization provides the base or canonical form of words, capturing their underlying meaning. It helps in resolving word variants, handling different inflections, and maintaining the integrity of the language structure. Lemmatization enables better accuracy and precision in language understanding and generation tasks.

  • Overall Importance:

    • Vocabulary Normalization: Stemming and lemmatization help reduce the dimensionality of text data by grouping words with similar meanings. They assist in avoiding redundancy and noise in the data, leading to better generalization and improved performance in NLP models.

    • Language Understanding: By reducing words to their base forms, stemming and lemmatization enhance the ability of NLP systems to understand and process text. They facilitate tasks such as part-of-speech tagging, syntactic parsing, and semantic analysis by providing consistent representations of words.

    • Information Retrieval: Stemming and lemmatization contribute to more effective information retrieval by matching user queries with relevant documents. They improve recall by accounting for different word forms and variations, enabling a broader range of matching possibilities.

    • Text Analysis and Mining: Stemming and lemmatization aid in analyzing and mining large text corpora by simplifying and standardizing word representations. They assist in extracting meaningful patterns, identifying recurring themes, and gaining insights from textual data.

Choosing the appropriate technique (stemming or lemmatization) depends on the specific NLP task, language, and trade-offs between precision, recall, and computational complexity. It is crucial to evaluate and experiment with both techniques to ensure optimal performance and accurate language processing in various NLP applications.

In [ ]: