Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.

Text Normalization¶

Motivation¶

Natural language is inherently expressive and exhibits significant variation due to its role as a primary medium for human communication. It allows individuals to convey not just factual information but also emotions, intentions, cultural nuances, and creativity. This expressiveness enables a rich diversity of expression, ranging from poetic language to technical jargon, informal slang, and structured academic writing. The same message can often be communicated in countless ways, influenced by context, culture, and personal style.

One main dimension of variation — the one that text normalization aims to address — arises from morphology and syntax. Morphological variation in text documents increases complexity for natural language processing (NLP) tasks because it introduces a high degree of word form variation that can make it difficult for machines to accurately understand and process language. Morphology refers to the structure and form of words, including how words are constructed from roots and affixes (such as prefixes, suffixes, and infixes). In languages with rich morphology, a single root word can take many different forms depending on tense, aspect, number, gender, case, and other grammatical features. The table below list some of the most common linguistic phenomena that results in morphological and syntactic variations.

Linguistic Phenomenon Description Example Before Normalization
Spelling Variations The same words may alternate spellings or regional differences. "color/colour'
Case Sensitivity . "HELLO/hello"
Inflected Forms Verbs may have different forms depending on the tense; nouns may have different forms in their singular and plural form. "run/runs/ran/running", "foot/feet"
Contractions . "I am / I'm", "will no / won't"
Stop Words Words may not contribute useful information to .... "a/an", "the", "and", "or", "from", "with"
Punctuation Removes or standardizes punctuation that might interfere with tokenization or processing.
Non-Standard Words A text may contain words that are not common dictionary words; particular in user-generated content on social media. "u (you)", "gr8 (great)", "lol (laughing out loud)"
Numerical Expressions The Number can be differently represented using digits and words. "1000 / 1000.00 / 1,000 / 1k / one thousand",
Compound Words Compounds may be spelled differently depending on conventions. "ice-cream / ice cream"`
Unicode Variants Characters to are visually similar maybe in fact be different Unicode characters. fi, fi (ligature and standard)
Diacritics and Accents The same word may be spelled with or without diacritical marks. "café / cafe"
Special Characters Removes or replaces characters like emojis, symbols, and hashtags. "@Selene Hello 😊 #AI"
Whitespace Removes unnecessary spaces or line breaks. "Hello world!"
Ambiguity in Abbreviations Expands or standardizes abbreviations for better comprehension. "e.g.", "i.e."

These variations in text documents can complicate tasks like text classification, information retrieval, machine translation, and sentiment analysis since they introduce data sparsity issues, especially in languages with complex inflectional systems. A large corpus of text may contain many different forms of a single root word, making it difficult for machine learning models to capture the full range of forms. For example, a machine learning model trained on a corpus might encounter only a small subset of possible word forms, leading to problems when it needs to process unseen forms. If a model only encounters the word "run" during training, it may fail to understand the various forms such as "running," "ran," or "runner" in future documents.

The goal of text normalization is to reduce morphological and syntactic variations and convert documents into a canonical form. — that is, transforming raw, unstructured text into a standard, consistent representation that makes it easier for NLP systems to analyze and process. The goal is to reduce the complexity and variability inherent in natural language, simplifying the text to its most basic, meaningful components. This often means to remove different word forms, typographical errors, informal language, or inconsistencies that could confuse machine learning models or complicate analysis.

Preliminaries¶

Scope¶

Text normalization is a crucial preprocessing step in NLP, but its implementation varies significantly depending on the specific task or application. For instance, in sentiment analysis, normalization might involve converting text to lowercase, removing punctuation, or expanding contractions to standardize inputs without altering the sentiment-bearing elements. In contrast, machine translation systems might require more nuanced normalization, such as handling diacritics or resolving locale-specific variations, to preserve linguistic accuracy and ensure correct translations.

The dependency on the application stems from the need to balance retaining meaningful information with simplifying the text for the model. In tasks like named entity recognition (NER), it may be essential to preserve case sensitivity and special characters that contribute to identifying entities. On the other hand, applications such as search engine optimization might focus on normalizing text for keyword matching by removing stop words or stemming words. This task-specific adaptability underscores the importance of tailoring normalization approaches to the objectives and constraints of the given NLP application.

As such, there is no single fixed series of normalization steps that need to be performed every time. In this notebook, we cover a wide range of the most common normalization steps and include some discussion for which use cases they are more suitable than others. Keep in mind, however, that the covered steps are not comprehensive. Many NLP tasks or applications might involve text that requires custom normalization (e.g., handling LaTeX math formulas, domain-specific vocabulary such as chemical compounds, or mixed-language content).

Strings vs Token Lists¶

Different normalization steps may be more easily applied before or after tokenization. For example, changing a text to all lowercase characters is equally straightforward when considering a text as a single string or as a list of tokens. Removing stopwords, numbers, or punctuation marks is generally easier when a text has already been tokenized — although doing so with a string is still very much possible. For the normalization steps covered in this notebook, we either look at both approaches or the more convenient one. Apart from showing the "manual" implementation of different normalization steps, we also show, if applicable, how they can be performed using common text processing libraries. In this notebook, we considerspaCy for those examples.

Setting up the Notebook¶

Make Required Imports¶

This notebook requires the import of different Python packages but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.

In [1]:
from src.utils.libimports.textnorm import *
from src.text.preprocessing.normalizing import EmoticonNormalizer
from src.utils.data.files import *

Download Required Data¶

Some code examples in this notebook use data that first need to be downloaded by running the code cell below. If this code cell throws any error, please check the configuration file config.yaml if the URL for downloading datasets is up to date and matches the one on Github. If not, simply download or pull the latest version from Github.

In [2]:
slang_dictionary, _    = download_dataset("text/lexicons/normalization/slang-dictionary.csv")
english_vocabulary, _  = download_dataset("text/lexicons/normalization/vocabulary-american-english.csv")
british_to_american, _ = download_dataset("text/lexicons/normalization/british-to-american.csv")
emoji_mapping, _       = download_dataset("text/lexicons/normalization/emoji-mapping-subset.csv")
File 'data/datasets/text/lexicons/normalization/slang-dictionary.csv' already exists (use 'overwrite=True' to overwrite it).
File 'data/datasets/text/lexicons/normalization/vocabulary-american-english.csv' already exists (use 'overwrite=True' to overwrite it).
File 'data/datasets/text/lexicons/normalization/british-to-american.csv' already exists (use 'overwrite=True' to overwrite it).
File 'data/datasets/text/lexicons/normalization/emoji-mapping-subset.csv' already exists (use 'overwrite=True' to overwrite it).

Case Folding¶

Case folding is a fundamental step in text preprocessing used to standardize text data by converting all characters to the same case, typically lowercase. This standardization ensures that words with the same meaning but different capitalization are treated as identical. For example, "Data," "DATA," and "data" would all be converted to "data." This process reduces redundancy in text data, simplifying analysis and improving the performance of natural language processing (NLP) models.

One of the primary purposes of case folding is to enhance the efficiency and accuracy of text-matching algorithms, such as search engines or sentiment analysis tools. By eliminating case distinctions, case folding ensures that queries and documents align regardless of capitalization. For instance, a search for "Python" would return results for "python," "PYTHON," and "PyThOn" equally. This uniformity makes case-insensitive matching possible, which is crucial for tasks like keyword extraction and information retrieval.

In addition, case folding reduces the complexity of text processing by minimizing the number of unique tokens in a dataset. This is particularly important for machine learning models, as it reduces the dimensionality of input data and ensures that models focus on semantic meaning rather than superficial differences. For example, without case folding, a model might interpret "AI" and "ai" as separate entities, leading to inefficiency and potential inaccuracies.

Application¶

Converting a text to all lowercase or all uppercase is one of the easiest text normalization steps to perform. All modern programming languages or tools support strings and provide built-in methods for this task. In the case of Python, these methods are lower() and upper(). Here is a simple example:

In [3]:
text = "That movie was AMAZING! I went to see it twice."

print(text)
print(text.lower())
print(text.upper())
That movie was AMAZING! I went to see it twice.
that movie was amazing! i went to see it twice.
THAT MOVIE WAS AMAZING! I WENT TO SEE IT TWICE.

Discussion¶

While case folding is a valuable tool for text normalization, it does come with certain risks that can impact the quality and accuracy of downstream tasks. These risks primarily arise from the loss of information carried by capitalization, which can be critical in some contexts.

  • Loss of semantic information: Capitalization often conveys meaning, and removing it through case folding can result in ambiguity. For instance, "Apple" (the company) and "apple" (the fruit) have distinct meanings, but case folding would treat them as identical. This loss of differentiation can negatively affect tasks like named entity recognition (NER) or sentiment analysis, where proper nouns and specific terms are significant.

  • Challenges in domain-specific texts: In domains like legal or medical texts, capitalization may follow specific conventions to denote importance, abbreviations, or categories. For example, "HIV" (the virus) and "hiv" (a casual or misspelled term; e.g., a misspelling of "hive") could convey very different contexts. Case folding could lead to misinterpretation or reduced precision in such cases.

  • Issues with acronyms and initialisms: Acronyms and initialisms like "NASA" and "nasa" might lose their emphasis or context when converted to lowercase. In some applications, especially those involving technical or formal documents, this can result in decreased readability or unintended changes in meaning.

  • Impact on style and formatting: Case folding might strip text of stylistic or formatting nuances important for specific tasks, such as analyzing social media posts, where capitalization might indicate emphasis (e.g., "THIS is important" vs. "this is important").

To mitigate these risks, it is crucial to consider the context and goals of the task before applying case folding. In scenarios where capitalization carries important meaning, additional preprocessing steps or hybrid approaches that selectively apply case folding may be necessary to retain essential information while standardizing the text.


Punctuation & Non-Word Removal¶

For fundamental NLP tasks such as text classification, it is typically the "normal" words — in crude terms, every token that can be found in a standard dictionary — that matters most. In contrast, punctuation marks, URLs, email addresses, specific numbers often carry hardly any meaning and could even be considered noise depending on the task. For example, consider the task of classifying to which news category a article belongs (e.g., politics, economy, sports, technology, weather, etc.). Words such as "match", "player", "ball", "goal", "halftime" and similar are much more indicative that an article talks about sports than a specific date or number/digit (representing the result of a game). Thus, to allow a classifier to focus on the most relevant parts in a document, punctuation marks and other non-words are often simply removed.

Application¶

In general, the decision if and which (types) of tokens should be removed depends on the application use case. In the code cell below, we make the simple assumption that we want to remove all tokens that are not completely composed of letters. Or simply speaking, we want to remove each token that is not a word. The isalpha() method in Python is a string method that checks if all the characters in a string are alphabetic (letters of the alphabet) and returns True if they are, and False otherwise. It does not accept numbers, spaces, or special characters, and the string must contain at least one character for it to return True. For example, "hello".isalpha() would return True, whereas "hello123".isalpha() or "".isalpha() would return False. This method is commonly used to validate input or filter strings for purely alphabetic content, and makes our task very simple.

In [4]:
tokens = ["Yesterday", "@", "9.30", "pm", ",", "the", "match", "ended", "0:0", "--", "all", "players", "left", "dissappointed", "."]

def is_invalid_token(token):
    if token.isalpha() is False:
        return True
    return False

print([ token for token in tokens if is_invalid_token(token) is False ])
#print([ token for token in tokens if token.isalpha() is True ]) # Simpler but less flexible
['Yesterday', 'pm', 'the', 'match', 'ended', 'all', 'players', 'left', 'dissappointed']

For our simple example, the method is_invalid_token() is not really needed as it only wraps the isalpha() method. However, it should be easy to see how is_invalid_token() can be extended to include other, and potentially more indicate, conditions to test if a token is invalid and should therefore be removed. Since the basic token categories such was words, numbers, and punctuation marks are a common filter criteria, spaCy does automatically derives the following attributes for each token in a text when using the default analysis pipeline:

  • is_alpha: token text consists of alphabetic characters
  • is_digit: Token text consists of digits
  • is_punct: Token is a punctuation character

Using these three attributes, we can easily implement the removal of punctuation, numbers, and everything else that are not "normal" words:

In [5]:
text = "Yesterday @ 9.30 pm, the match ended 0:0 -- all players left dissappointed."

print([ token.text for token in nlp(text) if token.is_alpha == True and token.is_punct == False and token.is_digit == False ])
['Yesterday', 'pm', 'the', 'match', 'ended', 'all', 'players', 'left', 'dissappointed']

Discussion¶

The removal of punctuation and non-words is typically meaningful for Natural Language Processing (NLP) tasks where the focus is on analyzing the semantic or syntactic meaning of text without the noise introduced by punctuation or irrelevant symbols. The two most common scenarios include, for one, text classification task where removing non-words helps models focus on meaningful features such as word frequencies. The second common application use case is search and information retrieval where non-words and punctuation may interfere with query matching or ranking algorithms, so preprocessing often involves their removal.


Stopword Removal¶

Stopwords are common words in a language that are often filtered out in text preprocessing because they carry little to no significant meaning for text analysis or modeling tasks. Examples of stopwords in English include "a", "an", "the", "and", "in", "on", "of", and "to". These words are ubiquitous and usually do not contribute much to the context or semantics of the text being analyzed. There various reasons that stop words are removed:

  • Reduce noise in text: Stopwords often clutter text data without adding meaningful information. For example, in the sentence "The cat is on the mat", the significant words for analysis are likely "cat" and "mat," while "the", "is", and "on" provide grammatical structure but little semantic value. Removing stop words helps focus on the core content.

  • Improve computational Efficiency: Text data can be large, and processing irrelevant words increases computational overhead. Removing stopwords reduces the size of the text corpus and simplifies subsequent steps, like tokenization and vectorization, thereby speeding up the analysis or training process.

  • Enhance model performance: Machine learning models can become distracted by irrelevant features. For instance, including stopwords in text classification or sentiment analysis may dilute the importance of meaningful words, leading to less accurate predictions. Removing stopwords ensures the model focuses on content that carries actual weight.

  • Reduce dimensionality: Stopwords are often among the most frequent tokens in a dataset. By eliminating them, the dimensionality of text representations, such as term frequency-inverse document frequency (TF-IDF) matrices or word embeddings, is reduced. This makes the data less sparse and easier to analyze.

  • Improve clarity in information retrieval: In search engines or recommendation systems, stopwords can lead to irrelevant matches if retained. For instance, searching for "history of computers" might yield more relevant results when stopwords like "of" are removed, leaving "history" and "computers" as the primary keywords.

Application¶

Once we have decided which words to consider stopwords, their removal is rather straightforward to implement, particularly if we assume that an input text has already been tokenized. In this case, we only need to check if any token is in this predefined set of stopwords. However, using Regular Expression, removing stop words directly from a text string is equally straightforward. For the following example, we define a very small list of stopwords and also assume a simple sentence from which we want to remove the stopwords.

In [6]:
stopwords = ["a", "an", "the", "not", "and", "or", "but", "to"]

text = "Alice and Bob went to KFC but did not eat anything"

We can now define a Regular Expression that matches any of the predefined stopwords. For this we can make use of the \b anchor. This anchor in Regular Expressions represents a word boundary. It matches the position between a word character (letters, digits, or underscore: [a-zA-Z0-9_]) and a non-word character (anything else, such as spaces, punctuation, or the start/end of a string). It does not consume characters itself but ensures that a match occurs only at the boundary of a word. For example, the pattern \bcat\b matches the word "cat" when it stands alone (e.g., in "cat" or "the cat is here"), but not in "catch" or "scatter". This makes the \b anchor particularly useful for tasks such as searching for whole words, enforcing word-level constraints, or tokenizing text. So let's remove the stopwords from the example sentence; see the code cell below. Notice that we also require some minor additional cleaning steps to remove unnecessary whitespace that may occur after removing stopwords. Depending on the next preprocessing steps, such a cleaning of the string might not be needed.

In [7]:
pattern = r"|".join([ r"\b({})\b".format(w) for w in stopwords ])

text = re.sub(pattern, r"", text, flags=re.I)  # Remove stopwords
text = re.sub(r"\s+", r" ", text)              # Remove duplicate whitespace (introduced by removing stopwords)
text = text.strip()                            # Remove trailing whitespaces (needed of string started or ended with a stopword)

print(text)
Alice Bob went KFC did eat anything

Of course, we can also show the removal of stopwords assuming the text was already tokenized, making it even simpler.

In [8]:
tokens = [ token for token in text.split() if token not in stopwords ]

print(tokens)
['Alice', 'Bob', 'went', 'KFC', 'did', 'eat', 'anything']

The consideration whether a token is a stopword not is common for many NLP tasks. As such, libraries such as spaCy often include the annotation of tokens as part of the analysis. For example, spaCy derives for each token the Boolean attribute is_stop that specifies if a token is a stopword or not. Keep in mind, however, that under the hood, spaCy also assume a predefined list of stopwords for this analysis step. Thus, using spaCy, we can remove the stopwords from our example sentence as follows:

In [9]:
doc = nlp(text)

tokens = [ token.text for token in doc if token.is_stop == False ]

print(tokens)
['Alice', 'Bob', 'went', 'KFC', 'eat']

Discussion¶

Removing stop words is a common normalizing step, but it comes with important caveats that can impact the quality of the analysis or model performance. These caveats highlight situations where stopwords may carry significant meaning or where their removal could lead to unintended consequences.

  • Loss of semantic meaning: Stopwords, though seemingly insignificant, can be crucial for conveying relationships or context in text. For instance, idiomatic expressions like "out of the blue" or "in the dark" the stop words are essential to the meaning. Similarly, in sentiment analysis, words like "not" are sometimes treated as stopwords, but their removal could invert the sentiment of a phrase (e.g., "not happy" vs. "happy").

  • Domain-specific importance: In certain domains, stopwords might carry valuable information. For example, in legal and financial texts, Words like "and" or "or" can denote critical logical relationships (e.g., in a contract, "A and B" implies something different from "A or B."). In biomedical or scientific texts, articles like "the" might specify a unique entity or subject, and their removal could change the context or make the text ambiguous.

  • Phrase-level context: Removing stop words may disrupt the structure of multi-word terms or phrases, potentially reducing the ability to understand the text. For example, the bigram "New York" might lose its meaning if "New" is removed as a stopword, leaving just "York". Similarly, in "The University of Singapore", removing "of" and "the" might reduce clarity or mislead downstream tasks.

  • Impact on grammatical integrity: Stop words contribute to the grammatical structure of sentences. Removing them can make sentences harder to parse and interpret, especially in tasks like text summarization or machine translation where sentence fluency is important.

The decision whether to remove or retain a stop word comes down to the assumption, expectation or simple fact if stopwords are likely to play a critical role for a specific NLP task or application. For example, in the case of basic text classification, the presence or absence of discriminate words typically matters more compared to the grammatical integrity. Here, stopwords can typically be safely removed and even improve the performance of the tasks. However, particularly for tasks where "all words matter" (e.g., text summarization, machine translation, question answering), stopwords matter too and should therefore not be removed. In summary, while removing stop words can streamline NLP tasks, it is essential to carefully evaluate their importance based on the context, domain, and goals of the analysis. An overly aggressive approach to stopword removal can lead to loss of information or misinterpretation.

To give an concrete example, look again at our example sentence "Alice and Bob went to KFC but did not eat anything". After stop word removal, we were left with "Alice Bob went KFC eat". Depending on the use exact use case, the meaning often sentence has been significantly altered, as it might have been important to capture that Alice and Bob did actually not eat there; see "Loss of semantic meaning" above. By removing words/tokens such as "not", "n't", "never", "neither", etc. we run the risk of losing the linguistic phenomenon of negation. Again, whether this is critical or not depends on the task or application. It should be easy to see that applications such as sentiment analysis will be negatively affected if negation is not properly captured. A common solution is therefore to use a customized stop word list. For example, start with a common stopword list but remove all words that might indicate negation.


Stemming and Lemmatization¶

Consider the following to sentences:

  • "Dogs make the best friends."
  • "A dog makes a good friend."

Semantically, both sentences are essentially conveying the same message, but syntactically they are very different since the vocabulary is different: "dog" vs. "dog", "make" vs. "makes", "friends" vs. "friend". This is a big problem when comparing documents or when searching for documents in a database. For example, when one uses "dog" as a search term, both sentences should be returned and not just the second one. Stemming and lemmatization are two common techniques used in natural language processing (NLP) for text normalization. Both methods aim to reduce words to their base or root forms, but they differ in their approaches and outcomes.

  • Stemming: Stemming is a process of reducing words to their "stems" by removing prefixes and suffixes, typically through simple heuristic rules. The resulting stems may not always be actual words. The goal of stemming is to normalize words that have the same base meaning but may have different inflections or variations. For example, stemming the words "running", "runs", and "runner" would result in the common stem "run". A popular stemming algorithm is the Porter stemming algorithm.

  • Lemmatization: Lemmatization, on the other hand, is a more advanced technique that aims to transform words to their "lemmas," which are the base or dictionary forms of words. Lemmatization takes into account the morphological analysis of words and considers factors such as part-of-speech (POS) tags to determine the correct lemma. The output of lemmatization is usually a real word that exists in the language. For example, lemmatizing the words "running" and "runs" would yield the lemma "run", assuming that both words are used as verbs in a given sentence; "runner" is a noun and would not be lemmatized to "run". Lemmatization requires more linguistic knowledge and often relies on dictionaries or language-specific resources.

Both stemming and lemmatization are the methods to normalize documents on a syntactical level. Often the same words are used in different forms depending on their grammatical use in a sentence. The choice between stemming and lemmatization depends on the specific NLP task and its requirements. Stemming is a simpler and faster technique, often used when the exact word form is not critical, such as in information retrieval or indexing tasks. Lemmatization, being more linguistically sophisticated, is preferred in tasks where the base form and the semantic meaning of words are important, such as in machine translation, sentiment analysis, or question-answering systems. It's also important to note that stemming and lemmatization may not always produce the same results, and the choice between them should consider the trade-offs between accuracy and computational complexity.

Application¶

Compared to, say, stopword and punctuation removal, stemming and lemmatization are less trivial text normalization tasks, and we cover them in a separate notebook. However, due to their importance, many to most existing libraries for NLP support either stemming, lemmatization, or both out of the box. For example, spaCy performs lemmatization as part of its default analysis pipeline to map each word/token to its respective lemma. The lemma_ attribute of a spaCy token provides the lemmatized form of the token, which is its base or dictionary form. In spaCy, lemma_ is a string, while lemma gives the corresponding hash integer value. This makes lemmatizing a text very simple:

In [10]:
print([ token.lemma_ for token in nlp("Dogs make the best friends.") ])
print([ token.lemma_ for token in nlp("A dog makes a good friend.") ])
['dog', 'make', 'the', 'good', 'friend', '.']
['a', 'dog', 'make', 'a', 'good', 'friend', '.']

To further improve the example, we can also combine lemmatization with stopword and punctuation removal as seen before:

In [11]:
print([ token.lemma_ for token in nlp("Dogs make the best friends.") if token.is_stop is False and token.is_punct is False ])
print([ token.lemma_ for token in nlp("A dog makes a good friend.") if token.is_stop is False and token.is_punct is False ])
['dog', 'good', 'friend']
['dog', 'make', 'good', 'friend']

Notice how our two example sentences, which have been syntactically very different in the beginning, are now much more similar to appropriately reflect the similar (and almost identical) semantics.

Discussion¶

Stemming and lemmatization can greatly reduce variation in documents as variations caused by different tenses of verbs or singular/plural forms for nouns are very common. Still, most common text normalization steps, stemming and lemmatization only affects variations with respect to the morphology and syntax of words and not variation due to synonymy. For example, the sentence "Canines are inclined to be excellent companions." arguably conveys the same meaning as "Dogs make the best friends." However, their canonical form after lemmatization (and stopword/punctuation removal) would still be very different. When deciding whether to use stemming or lemmatization, the application context is a critical consideration. Stemming is computationally faster and might suffice for tasks where precision in word meaning is less critical, such as search engine indexing or topic modeling. However, the crudeness of stemming can lead to errors, such as treating "universal" and "universe" as equivalent. Lemmatization, being more sophisticated, is suited for applications requiring linguistic accuracy, like sentiment analysis or machine translation, but it is computationally more intensive and requires access to a lexicon or Part-of-Speech (POS) tagger.

There are scenarios where stemming or lemmatization might not be necessary or beneficial. For example, in applications like named entity recognition or when analyzing specific domains like legal or medical texts, reducing terms to their root forms could remove critical nuances. Similarly, if the corpus relies on inflections or derivations to convey meaning, such as poetry or language with complex morphological rules, stemming or lemmatization could distort the interpretation. In such cases, preserving the original word forms might be more appropriate. Another consideration is language diversity. Stemming and lemmatization tools are typically designed for specific languages and may perform poorly with multilingual corpora. Additionally, for embeddings or transformer-based models, these techniques are often unnecessary because such models process tokens in their original forms and derive contextual meaning directly. Thus, the decision to apply stemming or lemmatization should be guided by the specific requirements and constraints of the NLP task at hand.


Unicode Variants¶

Unicode is a universal character encoding standard designed to ensure that text and symbols from all the world's writing systems can be consistently represented, processed, and displayed across different platforms and devices. Developed by the Unicode Consortium, it assigns a unique code point (a numeric identifier) to every character, symbol, or glyph, regardless of the platform, program, or language. This eliminates ambiguities and inconsistencies caused by earlier encoding systems like ASCII, which were limited to specific languages or character sets.

A Unicode character refers to any textual or symbolic element encoded in the Unicode standard, ranging from letters, numbers, and punctuation marks to emojis, mathematical symbols, and characters from ancient scripts. For instance, the English letter "A" has the Unicode code point U+0041, while the emoji 😊 has U+1F60A. These code points are typically represented in hexadecimal notation, and Unicode supports over 149,000 characters as of its latest version, covering more than 150 scripts. Unicode's broad scope and standardization have made it indispensable for modern computing, enabling seamless text exchange and display across diverse languages and applications. Whether you're reading an email in Japanese, viewing a webpage with Arabic text, or sending emojis in a message, Unicode ensures consistent interpretation and rendering of characters.

However, Unicode also poses challenges when working with text data. For example, Some Unicode characters look very similar because the Unicode standard aims to encode every character from all the world's writing systems, including those with overlapping visual designs. This can lead to the inclusion of characters that are nearly indistinguishable in appearance but are distinct in their linguistic, cultural, or technical usage. To give a simple example, have look at the following three Unicode characters:

In [12]:
print("\U00000027") # Apostrophe (equivalent to ASCII character)
print("\U00002019") # Right Single Quotation Mark
print("\U000002BC") # Modifier Letter Apostrophe
'
’
ʼ

All three characters can are used to represent an apostrophe. The Unicode Standard 16.0, Section 6.2.7 Apostrophes, actually states:

U+0027 APOSTROPHE is the most commonly used character for apostrophe. For historical reasons, U+0027 is a particularly overloaded character. In ASCII, it is used to represent a punctuation mark (such as right single quotation mark, left single quotation mark, apostrophe punctuation, vertical line, or prime) or a modifier letter (such as apostrophe modifier or acute accent). [...] When text is set, U+2019 RIGHT SINGLE QUOTATION MARK is preferred as apostrophe. [...] U+02BC MODIFIER LETTER APOSTROPHE is preferred where the

apostrophe is to represent a modifier letter (for example, in transliterations to indicate a glottal stop). [...] An implementation cannot assume that users' text always adheres to the distinction between these characters. [...]

And this is just for a single character, the apostrophe, but situations are very similar for many other characters such as hyphen/dashes, punctuation marks, letters or more.

Another issue is that the same — that is, same-looking — character may be represented by different Unicode code points. For example, the German umlaut "ä" has its own single coded point U+00E4. However, Unicode also supports multi-code point characters. These are sequences of multiple code points that together represent a single visual or semantic unit. While many characters are represented by a single code point, others require multiple code points to fully encode. A common example for using multiple code points are diacritics. Diacritics are small marks or symbols added to letters to modify their pronunciation, tone, or meaning. They are commonly used in many languages to indicate aspects like vowel quality, stress, intonation, or nasalization. For example, the acute accent ("é"), grave accent ("è"), and umlaut ("ä") are diacritics that alter how the base letter is pronounced. Diacritics can also serve grammatical purposes, such as distinguishing homonyms (e.g., "resume" vs. "résumé") or marking emphasis. In Unicode, diacritics are often represented as combining characters that attach to a base letter, allowing for a flexible representation of diverse scripts. For example the umlaut "ä" can be encoded by combining the code point for "a" and the code point for the diacritic representing the two dots above a character. Thus, we can print the umlaut "ä" in two different ways>

In [13]:
print("\U000000E4")
print("\U00000061\U00000308")
ä
ä

Application¶

Converting different Unicode characters that essentially have the same appearance to a canonical form typically requires some hand-crafted mapping, since there is a single agreed-upon convention. Such a mapping can be done by creating and curating a dictionary that maps Unicode code points to their canonical form. For example, the code cell below shows a simple solutions that maps the two code points for the umlaut "ä" to the ASCII character "a".

In [14]:
unicode_map1 = {"\U000000E4": "a", "\U00000061\U00000308": "a"}

text = u"Der Bär lässt sich ärgern" # German for "The bear let itself get taunted" (silly; just to have multiple umlauts)

print(''.join(idx if idx not in unicode_map1 else unicode_map1[idx] for idx in text))
Der Bar lasst sich argern

Given the vast size of the Unicode standard, compiling such a dictionary for all (relevant) code points takes a lot of effort. However, there are libraries available to implement such a mapping. For example, the unidecode library in Python is a utility that transliterates Unicode text into ASCII characters. It is particularly useful for converting non-Latin scripts, accented characters, and special symbols into their closest Latin equivalents. For example, "ä" becomes "a", and "你好" becomes "Ni Hao". This is helpful in scenarios like data preprocessing, where ASCII-only text is required, such as in URL generation, search indexing, or working with systems that do not support Unicode. By preserving the readability of the original text as much as possible, unidecode makes it easier to handle multilingual or accented input in ASCII-restricted environments. Let's use this library to convert our German example sentence from above.

In [15]:
print(unidecode(text))
Der Bar lasst sich argern

Discussion¶

While libraries such as unidecode can make life quite simple, it assumes that the output is indeed as expected or required by a given application use case. For one, unidecode converts unicode characters to ASCII characters, but desired canonical representation of a text may still contain Unicode characters. For example, we could use our maps simple to map the same umlaut to the same code point:

In [16]:
unicode_map2 = {"\U000000E4": "\U000000E4", "\U00000061\U00000308": "\U000000E4"}

print(''.join(idx if idx not in unicode_map2 else unicode_map2[idx] for idx in text))
Der Bär lässt sich ärgern

But also, we might want to map Unicode characters to different characters or sets of characters. For example, in German, the umlaut "ä" is commonly be replaced with the digraph "ae" if only the Latin alphabet is available. This means that a German using an English keyboard would write the example sentence from before as "Der Baer laesst sich aergern". Of course, with a custom manually curated dictionary for the mapping this simple to implement, e.g.:

In [17]:
unicode_map3 = {"\U000000E4": "ae", "\U00000061\U00000308": "ae"}

print(''.join(idx if idx not in unicode_map3 else unicode_map3[idx] for idx in text))
Der Baer laesst sich aergern

To sum up, working with Unicode for text normalization is challenging due to the vast diversity of scripts, languages, and character representations it supports. Unicode is designed to be a universal encoding system, but this universality introduces complexities. A single character can often be represented in multiple ways which look identical but have different underlying code points. Ensuring that these representations are treated equivalently requires normalization, a process that can be computationally intensive and error-prone when dealing with large datasets or multiple languages. Mistakes in choosing the appropriate normalization form can lead to inconsistencies or data loss. The cultural and linguistic diversity supported by Unicode introduces edge cases that make normalization even harder. Scripts like Arabic, Indic scripts, or Chinese may involve context-dependent rendering, bidirectional text, or complex glyph shaping, all of which can complicate the process. In practice, text normalization requires not only technical expertise in Unicode standards but also awareness of the linguistic nuances involved, making it a multifaceted challenge in software development.

Lastly — beyond letters, punctuation marks, diacritics, and similar — the Unicode standard also includes emojis. Emojis are pictorial symbols used in digital communication to express emotions, ideas, objects, or actions. Each emoji is defined in the Unicode standard, with many represented as single code points, while others are formed using multiple code points, often combined with modifiers for skin tone, gender, or family compositions. Emojis add context, tone, and personality to messages, making them more engaging and expressive. While emojis are part of the Unicode standard, due to their special meanings, we discuss their potential normalization separately later.


Handling Non-Standard Words¶

Internet Slang¶

Internet slang refers to informal, non-standardized language used in online communication, such as instant messaging, online forums, social media, and other digital platforms. It evolves rapidly and reflects the casual, fast-paced nature of digital communication. Internet slang often includes abbreviations, acronyms, and creative spellings that help users express themselves quickly and efficiently. Common example are "gr8 (great)", "yolo (you only live once)", "brb (be right back)", "omg (oh my god)", and many more. Given the popularity of Internet slang in online conversations, it is perfectly valid to treat such words "as is" without any form of normalization, even if they are not (yet!) in an standard English dictionary. This is particularly true if the text is used to train a new model based on this larger vocabulary. After all, to any algorithm "gr8" is just as a word as "(great)", only that the vocabulary is now slightly inflated.

However, if the size of the vocabulary is an issue, or we have to rely on a model that has been trained with a vocabulary containing Internet slang. This means the slang terms would need to be considered out-of-vocabulary (OOV) terms. In this situation, it can be meaningful to normalize slang words by converting them into their corresponding standard words — that is, to convert the OOV terms to terms that are (more likely to be) in the vocabulary.

Application¶

The most common approach to handle slang words is the use of dictionaries or lexicons. Such precompiled slang dictionaries map slang terms to their standard equivalents (e.g., "brb" $\rightarrow$ "be right back", or "idk" $\rightarrow$ "I don't know"). Given the popularity of Intern slang, various such dictionaries or lexicons are available online. In the following, we will be using a simple dictionary in the form of a csv file with 5,300 entries. The code cell below loads this file into a pandas DataFrame.

In [18]:
df_slang = pd.read_csv(slang_dictionary)

df_slang.head(8)
Out[18]:
slang translation
0 *4u kiss for you
1 *67 unknown
2 *eg* evil grin
3 07734 hello
4 0day software illegally obtained before it was rele...
5 0noe oh no
6 0vr over
7 10q thank you

In most cases, the "translation" of a slang term is the word or phrase it stands for. However, there are examples such as "0day", where the translation is an definition or explanation of the slang term. Normalizing a slang term with such a definition is arguably not perfectly suitable. However, these examples are rather uncommon, so we simple ignore them here. In short, normalizing Internet slang simply mean replacing the slang term with the translation given by the dictionary file. To simplify the implementation, we first convert the DataFrame into a Python dictionary with the keys being the slang terms and the values their corresponding translations.

In [19]:
slang_dict = df_slang.set_index('slang')['translation'].to_dict()

print(slang_dict["brb"])
be right back

Next we define a simple auxiliary method translate_slang() which captures two considerations to help us with the normalization. Firstly, we do not want to care if the slang term is uppercase, lowercase or anything in between, particularly since all keys in the dictionary are lowercase. This method checks if the lowercase version of the input term is in the dictionary or not. And secondly, since not all (or even most) words/terms in a text are slang terms, we need to handle the case where we try to translate a slang term that is not in the dictionary. In the method, we achieve this by simply using a try ... except ... block when accessing the dictionary. If the access fails due to the key not being found, we return the original input word.

In [20]:
def translate_slang(word, slang_dict):
    try:
        return slang_dict[word.lower()]
    except:
        return word

Let's apply this method to a couple of example words/terms.

In [21]:
print(translate_slang("brb", slang_dict))
print(translate_slang("omg", slang_dict))
print(translate_slang("gn8", slang_dict))
print(translate_slang("car", slang_dict))
print(translate_slang("bus", slang_dict))
be right back
oh my god
good night
car
bus

In the examples above, "brb", "omg", and "gn8" are Internet slang terms and part of their dictionary. They therefore get translated into their respective proper word or phrase. In contrast "car" and "bus" are not valid keys in the slang dictionary, and as such do not get normalized. We can now use this method and apply it to a complete input text. It is easy to see that it is much more convenient if the text is already tokenized.

In [22]:
tokens = ["Good", "joke", "lol", "!", "The",  "delivery",  "was", " crazy", "OMG"]

print([ translate_slang(token, slang_dict) for token in tokens ])
['Good', 'joke', 'laughing out loud', '!', 'The', 'delivery', 'was', ' crazy', 'oh my god']

The only additional consideration is that an individual Slang term might be translated into multiple words. This means that the resulting list is no longer a proper token list. However, this can easily be handled using the flatten() method provided by the pandas library. This method takes in an arbitrary nested list and returns a flat list of all the items. The change to the initial call of translate_slang() slang is to split the output to get a list of each word. This list of lists we can then give to the flatten() method. The code cell below shows these two steps for our tokenized example sentence.

In [23]:
tokens_normalized = [ translate_slang(token, slang_dict).split() for token in tokens ]
print(tokens_normalized)

tokens_normalized = list(pd.core.common.flatten(tokens_normalized))
print(tokens_normalized)
[['Good'], ['joke'], ['laughing', 'out', 'loud'], ['!'], ['The'], ['delivery'], ['was'], ['crazy'], ['oh', 'my', 'god']]
['Good', 'joke', 'laughing', 'out', 'loud', '!', 'The', 'delivery', 'was', 'crazy', 'oh', 'my', 'god']

Discussion¶

One main issue of normalizing Internet slang terms by translating them into "normal" words is that a term/word in the slang dictionary might in fact be a valid token in itself. For example, "4" is often used as short form for the word "for". However, a "4" in a sentence might very well refer to the actual digit e.g., "I ate 4 pieces of cake.". Converting this sentence to "I ate four pieces of cake." would alter the meaning of the sentence. Here some more examples, where "F8" may refer to the function key in a keyboard, and "GM" to the company General Motors.

In [24]:
print(translate_slang("4", slang_dict))
print(translate_slang("F8", slang_dict))
print(translate_slang("GM", slang_dict))
for
fate
good morning

Although not many slang terms can also be considered proper words, the use of numbers/digits or greetings such as "GM" for "good morning" are very common in online conversations. The most straightforward solution is to manually curate the slang dictionary to remove such ambiguous instances. More general approaches might use simple heuristics that always ignore numbers/digits when normalizing tokens.

Emojis¶

Emojis are small digital images or icons used to express emotions, ideas, or concepts in text-based communication. Originating in Japan in the late 1990s, they have become a universal feature of online interaction, ranging from social media posts to messaging apps. Emojis add a layer of non-verbal communication, allowing users to convey tone, mood, or feelings that might otherwise be ambiguous in written text. For instance, a 😊 (smiling face) can indicate happiness, while a 😢 (crying face) can convey sadness, enhancing the emotional context of a message.

In sentiment analysis, emojis play a crucial role because they act as direct indicators of a user's emotional state or intent. Sentiment analysis involves analyzing text to determine the sentiment it expresses, such as positive, negative, or neutral. Emojis help enrich this analysis by providing explicit emotional cues. For example, a tweet containing a thumbs-up emoji (👍) likely suggests approval or positivity, whereas one with an angry face emoji (😠) may indicate dissatisfaction. By considering emojis, sentiment analysis models can achieve greater accuracy in interpreting the nuances of human communication.

Emojis can also bridge gaps in sentiment analysis where text alone might be ambiguous. For instance, the phrase "great" could be sarcastic or genuine, depending on the accompanying emoji. "Great 🙄" suggests annoyance or sarcasm, while "Great 😊" indicates enthusiasm. Thus, incorporating emoji analysis into sentiment analysis allows for a more precise understanding of the emotional context and intent behind a message.

Analyzing emojis poses its challenges, however, as their meanings can vary depending on cultural context, personal usage, or the platform displaying them. For example, the 🙏 emoji might be interpreted as a prayer gesture in some cultures but as a thank-you or high-five in others. Despite these complexities, advancements in NLP and machine learning have enabled models to account for such variability, improving the reliability of sentiment analysis.

Application¶

By now, the official list of Unicode code points representing emojis contains 3,790 entries (Unicode 16.0). The vast majority of them are typically not used to express a certain mood or sentiment (e.g., animals, cars, flags, everyday objects, etc.). However, some of the most popular emojis are face emojis that people use to express happiness, sadness, anger, joy, and so on. When building a sentiment analysis system, such emojis can therefore be extremely useful. On the other hand, it is typically not needed to distinguish between, say, 😀 and 😊. It is mainly important that both these faces indicate a positive mood or emotion. We can therefore treat the normalization of emojis as converting them to predefined placeholder terms ("[EMOJI+]" for positive emojis, "[EMOJI0]" for negative emojis, and "[EMOJI-]" for negative emojis). For this, we provide a csv file that contains this label for each covered emoji.

In [25]:
df_emojis = pd.read_csv(emoji_mapping)

df_emojis.head()
Out[25]:
CODE_POINT EMOJI LABEL
0 1F600 😀 [EMOJI+]
1 1F603 😃 [EMOJI+]
2 1F604 😄 [EMOJI+]
3 1F601 😁 [EMOJI+]
4 1F606 😆 [EMOJI+]

Note that this file only includes the most popular face emoji. In principle, one can maintain such a file for all existing emojis, but it has already been argued that most emojis are not really used to express any sentiment. While we could always map such case to "[EMOJI0]", it is typically perfectly fine to simply remove those.

Like for the slang terms, let's first create a Python dictionary to easily find the label for a given emoji — that is, the keys are the emojis and the values are the respective labels.

In [26]:
emoji_dict = df_emojis.set_index("EMOJI")["LABEL"].to_dict()

print(emoji_dict["😆"])
[EMOJI+]

And again, more for convenience, we create a small auxiliary method translate_emoji() that returns the label if the emoji is in the dictionary, and returns the emoji "as is" otherwise.

In [27]:
def translate_emoji(emoji, emoji_dict):
    try:
        return emoji_dict[emoji]
    except:
        return emoji

Let's see how it works when applied to individual emojis.

In [28]:
print(translate_emoji("😆", emoji_dict))
print(translate_emoji("🤔", emoji_dict))
print(translate_emoji("🤢", emoji_dict))
print(translate_emoji("🙈", emoji_dict))
[EMOJI+]
[EMOJI0]
[EMOJI-]
🙈

Assuming a tokenized input text, we can now use this method to normalize each token; here is an example:

In [29]:
tokens = ["The", "movie", "was", "😆", ",", "only", "the", "ending", "was", "😩", "🙈", "."]

print([ translate_emoji(token, emoji_dict) for token in tokens ])
['The', 'movie', 'was', '[EMOJI+]', ',', 'only', 'the', 'ending', 'was', '[EMOJI-]', '🙈', '.']

Since the method translate_emoji() will return the same emoji if it was not found in the dictionary the normalized token list may still contain emojis, like in the example above. If this is problem and removing such emojis is the preferred output, we can easily accomplish this first converting all unicode characters to ASCII — which will fail for the emojis and return an empty string — and then remove all empty string from the token list. The code cell below implements all steps to get the final normalized token list.

In [30]:
tokens_normalized = [ translate_emoji(token, emoji_dict) for token in tokens ]
print(tokens_normalized)

tokens_normalized = [ token.encode("ascii", "ignore").decode("utf-8") for token in tokens_normalized ]
print(tokens_normalized)

tokens_normalized = [ token for token in tokens_normalized if token.strip() != "" ]
print(tokens_normalized)
['The', 'movie', 'was', '[EMOJI+]', ',', 'only', 'the', 'ending', 'was', '[EMOJI-]', '🙈', '.']
['The', 'movie', 'was', '[EMOJI+]', ',', 'only', 'the', 'ending', 'was', '[EMOJI-]', '', '.']
['The', 'movie', 'was', '[EMOJI+]', ',', 'only', 'the', 'ending', 'was', '[EMOJI-]', '.']

Discussion¶

Normalizing emojis can significantly enhance the accuracy and depth of sentiment analysis because emojis often convey strong emotional, contextual, or relational cues. Emojis are widely used in informal communication, especially on social media platforms, to express emotions succinctly. For example, a smiling face 😊 typically conveys positivity, while a crying face 😢 indicates sadness. Without normalization, sentiment analysis models may fail to recognize these semantic meanings, leading to reduced performance in tasks requiring an understanding of the user's emotional state. Incorporating emojis into sentiment analysis by normalizing them into text labels or embeddings ensures that their meanings are captured effectively. The example shown above is very simple and various extensions are conceivable. For one, we could map emojis to more fine grained labels, e.g., mapping 😍 to "love" or 😭 to "crying". Also, the dictionary should also other emojis beyond face emojis that are often used to express some sentiment, mood, or feelings (e.g., ❤️, 👍🏼, 🌞, 🍀).

Emoticons¶

Text emoticons are symbolic representations of facial expressions or emotions created using standard keyboard characters. Originating before the widespread use of graphical emojis, they offer a way to convey emotions, tone, or intent in text-based communication. Examples include ":)" for a smile, ":(" for sadness, and ":D" for excitement or happiness. These simple combinations of punctuation marks and letters add emotional context to messages, making them particularly useful in digital communication, where tone can often be misunderstood. In sentiment analysis, text emoticons are valuable indicators of a user's emotional state and can enhance the interpretation of a text's sentiment. Sentiment analysis involves identifying the emotional tone of a text, such as whether it is positive, negative, or neutral. Emoticons provide explicit cues; for instance, a message with "I got the job! :D" clearly expresses happiness and positivity. By incorporating emoticon detection into sentiment analysis algorithms, analysts can improve the accuracy of their assessments.

One advantage of emoticons is their universality and simplicity, as they rely on a limited set of keyboard characters. Unlike emojis, which can vary in appearance across platforms, emoticons retain a consistent form. This makes them particularly useful in contexts where graphical emojis may not be supported, such as older devices, programming environments, or text-only systems. Moreover, their straightforward nature makes them easier to categorize for sentiment analysis tasks. However, the interpretation of text emoticons is not without challenges. Some emoticons, such as ";)" (winking face), can carry nuanced or context-dependent meanings, indicating sarcasm, humor, or flirtation depending on the context. Additionally, variations like ":-P" (playful or teasing) may not always align perfectly with a binary sentiment classification of positive or negative. To address this, sentiment analysis systems must incorporate contextual understanding and possibly combine emoticon analysis with linguistic cues in the surrounding text.

There are two fundamental types of emoticons, Western-style emoticons and Japanese-style emoticons, which have distinct ways of expressing emotions and facial expressions using text characters. Each style reflects cultural and linguistic preferences, resulting in unique approaches to conveying emotion in digital communication.

  • Western-style emoticons are typically horizontal and require the viewer to tilt their head to interpret them. Examples include ":-)" (smile), ":-(" (frown), and ":-D" (big smile). These emoticons often use basic punctuation marks like colons, semicolons, dashes, and parentheses to represent facial features. They are minimalist and rely on a small set of characters, making them straightforward and universally recognizable. Western emoticons focus primarily on the mouth to convey emotion, such as ":P" for a playful tongue-out expression or ;) for a wink.

  • Japanese-style emoticons, known as kaomoji, are designed to be read upright, and they tend to include a wider range of characters to create more detailed and expressive faces. Examples include "(^^)", "(T_T)", and "(¬¬)". Kaomoji often emphasize the eyes rather than the mouth, as eyes play a significant role in conveying emotion in Japanese culture. These emoticons also use non-standard characters, such as underscores, carets, and Japanese text symbols, to achieve greater variety and expressiveness.

The primary difference lies in their orientation and level of detail. Western emoticons are simpler and rely on a horizontal layout, while Japanese kaomoji are more complex, upright, and visually detailed. This distinction reflects cultural differences in emotional expression and communication style. Western emoticons often focus on efficiency and brevity, whereas kaomoji emphasize creativity and nuance, providing users with a broader emotional palette to communicate their feelings.

Application¶

In principle, we can handle emoticons similar to emojis (and Internet slang), by collecting a dictionary that maps popular emoticons to some meaningful value or label. For example, the lexicon of the VADER sentiment analysis system — which is also part of the NLTK library — contains a selection of emoticons together with a sentiment score reflecting the average of human-annotated score between $-4$ and $+4$. These kinds of lexicons are more essentially required for Japanese-style emoticons. However, many to most Western-style emoticons have well-defined structure. At the minimum, these emoji use two characters to represent eyes and a mouth, respectively (e.g., ":(" or ";)"). Two optional characters may be used to represent a nose or top (e.g., hat, hair, furrowed brows), respectively (e.g., ">:o(" or "{:p"). Since the choice of characters for each part of the emoticon (mouth, noise, eyes, top) is typically limited, we can

  • Identify (almost) arbitrary emoticons based pattern matching (e.g., using Regular Expression)
  • Automatically derive the general sentiment without the need for human annotation or curating

To showcase this idea, we provide you with the implementation of a EmoticonNormalizer class; see src/normalizer.py file. Let's first see how it works:

In [31]:
emoticon_normalizer = EmoticonNormalizer()

for emoticon in [":-)]]", ";o))))", ":-(((", "((:", ":o|"]:
    _, _, emoticon_normalized, sentiment_label = emoticon_normalizer.normalize(emoticon)
    print(f"{emoticon} ==> {emoticon_normalized} / {sentiment_label}")
:-)]] ==> :-) / [EMOTICON+]
;o)))) ==> ;o) / [EMOTICON+]
:-((( ==> :-( / [EMOTICON-]
((: ==> (: / [EMOTICON+]
:o| ==> :o| / [EMOTICON-]

Since this implementation is an example of a highly customized normalization step, we only give a high-level explanation how this class works — but the source code is not difficult to follow, so check it out if you are interested. The EmoticonNormalizer uses a Regular Expression to check if a string is an emoticon by trying to match the mouth, nose, eyes, and top with respect to a predefined set of characters. For example, the eyes can be represented by any character in the: .:;8BX= (the nose and the top characters are optional). In fact, the normalizer uses two regular expressions to handle both orientations: mount-nose-eyes-top and top-eyes-noise-mouth.

Once a string has been identified as an emoticon, the normalizer tries to automatically derive the general sentiment (positive, neutral, or negative). To this end, the normalizer makes the assumption that the sentiment is captured by the character for the mouth alone. In fact, the mouth character is often duplicated to make the sentiment or mood more explicit. For example, the emoticons ":o)" or ":)))" arguably express a positive sentiment since the parenthesis for the mouth mimics are smiles. Therefore the normalizer checks if the mouth characters belonging to a predefined set of characters indicate a positive or negative sentiment, again with the consideration of both orientations.

While this is the basic approach of the EmoticonNormalizer class, its implementation contains some refinements to handle some additional corner cases. But again, these details are beyond the scope here. Instead, let's use the class to normalize a tokenized example sentence.

In [32]:
tokens = ["The", "movie", "was", ":p", ",", "only", "the", "ending", "was", ":(((", ">:o|", "."]

print([ emoticon_normalizer.normalize(token)[-1] for token in tokens ])
['The', 'movie', 'was', '[EMOTICON+]', ',', 'only', 'the', 'ending', 'was', '[EMOTICON-]', '[EMOTICON-]', '.']

Discussion¶

As mentioned before the EmoticonNormalizer relies on the observation that Western-style emoticons commonly feature a well-defined structure using well-defined sets of characters. These characteristics allow the identification of an emoticon and its general sentiment as a pattern-matching task. Japanese-style emoticons generally lack this simple structure, so a basic lookup-based approach is likely to perform much better. The EmoticonNormalizer is also likely not capable of properly normalizing all conceivable Western-style emoticons, but it should cover the most popular ones. In the end, whether it is worthy to put such additional efforts into handling emoticons (as well as emojis) depends on the exact task or application. For sentiment analysis, these efforts are arguably justified as many people make ample use of emoticons and emojis to express their mood or feelings in online conversation.


Misspelled Words¶

Misspelled words are words that are not spelled correctly according to the standard rules of a particular language. These errors can arise from various reasons, including typographical mistakes, lack of familiarity with correct spelling, phonetic spelling, or even intentional alterations like slang. For example, writing "definately" instead of "definitely" is a common misspelling in English. Unfortunately, the automatic correction of misspelled words poses several challenges:

  • Ambiguity in intent: One of the primary difficulties in correcting misspelled words is determining the writer's intent. A misspelled word might resemble multiple valid words, and without context, it's challenging to discern which word was intended. For instance, "baet" could be a misspelling of "bait" or "beat," and the correct choice depends on the surrounding text.

  • Context dependence: Automatic correction tools often rely on context to decide the correct replacement for a misspelled word. For example, in the sentence "He reed the book" the word "reed" could be corrected to "read" (past tense) or "reed" (a type of plant) based on the sentence's broader meaning. Ensuring tools can understand and analyze context effectively is a complex task, especially in longer or more nuanced text.

  • Nonstandard language use: Variations in dialect, slang, and creative writing often intentionally deviate from standard spelling rules. For example, phrases like "gonna" (informal for "going to") or "luv" (informal for "love") may not technically be misspellings but are often flagged by automated systems. Correcting these words incorrectly could disrupt the intended tone or style of the text.

  • Typographical variations: Typos are another layer of complexity, as they often result in words that are not close to the intended word in spelling but might resemble other valid words. For example, "wokr" could be a typo for "work" or "woke" and distinguishing between these options requires advanced pattern recognition and linguistic understanding.

  • Dynamic vocabulary: Language evolves constantly, with new words, names, and acronyms being introduced regularly. Spell-checking systems must be updated to accommodate these changes, which is challenging and resource-intensive. Without regular updates, these systems may misidentify newly coined or culturally specific terms as misspelled words.

While algorithms and machine learning models have made significant strides in correcting misspellings, the task remains inherently complex due to the intricacies of human language, the need for contextual understanding, and the dynamic nature of vocabulary and language use. This includes that spell checking is generally considered its own dedicated task that goes beyond "classic" text normalization. However, some spelling mistakes may have straightforward and "systematic" causes that can be fixed in a rather simple manner as part of normalization. In the following, we look at two concrete examples.

Americanize Text¶

American English and British English differ due to historical, cultural, and linguistic developments that emerged after the colonization of North America by English settlers in the 17th century. When English speakers migrated to the New World, they carried with them the language and dialects of their time. Over the centuries, as the American colonies developed independently from Britain, their language evolved separately, influenced by diverse factors such as geography, politics, cultural exchange, and contact with other languages. One major factor contributing to the differences is the influence of other languages on American English. The United States became a melting pot of cultures, incorporating vocabulary and linguistic features from Native American, Spanish, French, Dutch, and German languages, among others. Words like "raccoon", "barbecue", "prairie", and "cookie" reflect these influences. Meanwhile, British English continued to be shaped by interactions with European languages closer to home. Noah Webster, a key figure in shaping American English, played a significant role in formalizing many of its distinctions. In the late 18th and early 19th centuries, Webster advocated for spelling reforms to simplify and differentiate American English from British English. His dictionary introduced spellings such as "color" instead of "colour" and "theater" instead of "theatre", aiming for a more phonetic and uniquely American style.

Other distinction is between American and British English is the use of different words for the same concept; for example:

British English American English Description
Flat Apartment A place to live
Lift Elevator Used to move between floors
Lorry Truck A large vehicle for goods transport
Petrol Gas/Gasoline Fuel for cars
Boot Trunk The storage area of a car
Bonnet Hood The front cover of a car engine
Biscuit Cookie A baked sweet treat
Crisps Chips Thinly sliced fried potatoes
Chips Fries Fried potato strips
Holiday Vacation A break or time off work/school
Jumper Sweater A knitted upper garment
Torch Flashlight A portable light source
Queue Line A row of people waiting
Dustbin Trash can A container for garbage
Trousers Pants Lower body clothing

Application¶

While all these differences could be considered for americanizing an input text, to keep it simple, we only focus on the differences in the spelling of certain words (e.g., "color" vs. "colour"; "theater" instead of "theatre"). Like normalizing slang terms or emojis, we solve this task using publicly available dictionaries to look up the American spelling for a British word and replace it.

In [33]:
df_be2ae = pd.read_csv(british_to_american)

df_be2ae.head()
Out[33]:
BRITISH ENGLISH AMERICAN ENGLISH
0 africanisation africanization
1 africanise africanize
2 americanisation americanization
3 americanise americanize
4 arabise arabize

We convert the DataFrame to a Python dictionary first to make to lookup easier.

In [34]:
be2ae = df_be2ae.set_index("BRITISH ENGLISH")["AMERICAN ENGLISH"].to_dict()

print(be2ae["organise"])
organize

The auxiliary method americanize_word() helps us to handle the case if an input token is not a British spelling, in which case we return the word unchanged.

In [35]:
def americanize_word(word, word_dict):
    try:
        return word_dict[word]
    except:
        return word

Let's normalize a tokenized text by applying americanize_word() to each token in the token list.

In [36]:
tokens = ["The", "programme", "was", "gruelling", "but", "useful", "."]

print([ americanize_word(token, be2ae) for token in tokens ])
['The', 'program', 'was', 'grueling', 'but', 'useful', '.']

Discussion¶

Automatically converting text from British spelling to American spelling is especially useful in scenarios where consistency and alignment with local preferences are important. For instance, in content localization, businesses or websites targeting an American audience may need to convert British spelling to American spelling to match the regional language norms. This is common for e-commerce platforms, product manuals, or promotional content, ensuring that terms like "colour" become "color" or "favour" becomes "favor", making the content more relatable and accessible to local users.

In the academic and research publishing domain, it's common for journals to require submissions in a specific style guide. For example, many American academic publications prefer American English, so authors may need to change spellings like "organise" to "organize" or "defence" to "defense" when submitting their work. This ensures the research aligns with the expectations of the journal and its readers, providing a more standardized approach to international academic writing.

NLP systems also benefit from converting spelling between regional varieties. Chatbots or AI systems designed for customer interaction may need to adjust to local norms to better understand and respond to users. For example, if the NLP model is trained on American English, it might struggle with words like "analyse" or "realise" from British English. By converting these to their American equivalents ("analyze" and "realize"), the system becomes more efficient and accurate in processing user input, improving communication.

Global journalism and multinational corporations are also key use cases for automatic spelling conversion. Media outlets might need to adapt their content depending on the regional focus, such as adjusting a news article's spelling for the U.S. edition. Similarly, multinational companies that operate across different regions may standardize their internal communications or marketing materials in American English, especially if the majority of their stakeholders or customers are based in the U.S. This helps maintain uniformity in corporate communication, ensuring it resonates well with the target audience.

Expressive Lengthenings¶

Expressive lengthenings in text refer to the deliberate extension of letters within a word to emphasize emotion, tone, or intensity. For example, writing "sooo cute" or "nooooo" stretches certain vowels or consonants to convey feelings such as excitement, hesitation, distress, or urgency. These lengthenings mimic spoken intonations, translating non-verbal cues like pitch, volume, and elongation of speech into written form. By doing so, they add an emotional or conversational quality that standard spelling might lack. In digital communication, expressive lengthenings often appear in informal contexts, such as social media posts, texts, or chats. They provide nuance, enabling the writer to convey sarcasm, enthusiasm, or exaggerated reactions without additional explanation. For instance, "yeees" communicates excitement and approval, while "whyyyy" suggests frustration or disbelief. These modifications break away from conventional grammar rules, reflecting the casual and creative nature of modern digital language. This stylistic choice is not merely playful but also functional, fostering a sense of intimacy and connection in text-based interactions. Expressive lengthenings help readers infer emotional undertones, enriching the overall meaning. As a result, they contribute to the evolving nature of written communication, blending textual precision with the dynamic, expressive qualities of spoken language.

Application¶

When trying to normalize expressive lengthenings, we have consider the fact that many words contain indeed repeated characters (e.g., "book", "beer", "bookkeeping"). There are also cases where simply removing duplicate characters might actually change the meaning for a word (e.g., "boot" vs. "bot", both being proper English words). We therefore have to try a bit smarter approach. Firstly, we need to be able to see if any modification of a word may yield a string that is no longer a valid English word. For this, we can first load a dictionary of common English words.

In [37]:
df_vocab = pd.read_csv(english_vocabulary)

df_vocab.head()
Out[37]:
WORD
0 A
1 AA
2 AAA
3 AA's
4 AB

For a quick check if a string is an English word — at least one covered by the dictionary — let's convert the DataFrame into a Python set.

In [38]:
vocab_ae = set(df_vocab["WORD"])

The dictionary we just loaded covers only "normal" words. However, expressive lengthenings may also be used for Internet slang terms (e.g., "looool"). Luckily, we already loaded a dictionary of slang terms, so we can just add them to our vocabulary of valid words/terms.

In [39]:
vocab_ae.update(set(slang_dict.keys()))

Since the number of additional characters may be arbitrary, we can use a Regular Expression to match substrings where the same character appears at least three times in a row, and then replace it with only two of those characters. Again, checking for only two repeated characters and/or reducing it to only one character can lead to invalid words. The code cell below shows the Regular Expression applied to an example word.

In [40]:
print(re.sub(r"(\w)\1{2,}", r"\1\1", "cooool", flags=re.I))
cool

This example works just fine since "cool" does indeed feature to o's. However, if we try:

In [41]:
print(re.sub(r"(\w)\1{2,}", r"\1\1", "looool", flags=re.I))
lool

We won't get the arguably correct result of "lol". The address this issues, the method handle_expressive_lengthening() implements the normalization of expressive lengthenings as a two-step process:

  • Firstly, the method reduces all characters that are repeated at least three times in a row down to two characters. If this change indeed modified the input token, the method checks if the new token is a valid word by looking if the word is contained in the vocabulary. If that is the case, this news token gets return as the normalization results; otherwise

  • Secondly, if the new token is not a valid word, the method uses a second Regular Expression to reduce repeated characters down to only one character. Again, the method then checks if this further modified is a valid word and returns it if that is the case. If this step also fails, the method simple returns the initial input token without any changes.

In [42]:
def handle_expressive_lengthening(token, vocabulary):
    token_copy = token
    if re.match(r"(\w)\1{2,}", token_copy, flags=re.I) is False:
        return token
    token_copy = re.sub(r"(\w)\1{2,}", r"\1\1", token_copy, flags=re.I)
    if token_copy.lower() in vocab_ae:
        return token_copy
    token_copy = re.sub(r"(\w)\1", r"\1", token_copy, flags=re.I)
    if token_copy.lower() in vocab_ae:
        return token_copy
    else:
        return token

Let's checkout some examples to see how the method handle_expressive_lengthening behaves.

In [43]:
print(handle_expressive_lengthening("looool", vocab_ae))
print(handle_expressive_lengthening("cooooool", vocab_ae))
print(handle_expressive_lengthening("daaaaamn", vocab_ae))
print(handle_expressive_lengthening("niiiiceee", vocab_ae))
print(handle_expressive_lengthening("heeeckkkk", vocab_ae))
lol
cool
damn
nice
heck

And of course, we can also apply the method to an example sentences that has already been tokenized.

In [44]:
tokens = ["The", "movie", "was", "amaaaazing", "looool", "."]

print([ handle_expressive_lengthening(token, vocab_ae) for token in tokens ])
['The', 'movie', 'was', 'amazing', 'lol', '.']

Discussion¶

The implementation of the method handle_expressive_lengthening() makes several simplifying assumptions and is therefore far from perfect. Most obviously, it relies on the vocabulary to be complete to reliably check if a modified token is now a valid word. For example, our current vocabulary does not contain named entities such as "Microsoft". So any mentioning of this word with some expressive lengthenings will not be normalized correctly:

In [45]:
print(handle_expressive_lengthening("microsooooft", vocab_ae))
microsooooft

A more subtle — and arguably less likely — case of normalization errors occurs if a word contains multiple expressive lengthenings for different characters, but those characters would need to be reduced to different lengths, either one or two. The code cell below shows three example where only the first one returns the expected result:

In [46]:
print(handle_expressive_lengthening("loooollll", vocab_ae)) # Works because both "o" and "l" need to get reduced to one character each
print(handle_expressive_lengthening("coooollll", vocab_ae)) # Does arguably not work since "cooll" is not a proper word so further reduction
print(handle_expressive_lengthening("beeerrrr", vocab_ae)) # Does not work since neither "beerr" nor "ber" are proper words
lol
col
beeerrrr

In the case of "coooollll" the algorithm first reduces the expressive lengthenings to "cooll". Since this is not a proper word, it further reduced it the "col". While this is a proper word, it is arguably not the correct one (which is most likely "cool"). In the case, of "beeerrrr", both "beerr" after the first reduction step and "ber" after the second step are not valid words found in the vocabulary. There, the method falls back to returning the input token unchanged. In principle, many of such cases can be covered by improving the method handle_expressive_lengthening(). However, this would quickly increase its complexity, and in practice it is often not worth the effort. This includes the consideration that the more complex the method becomes, the more likely it will behave unexpectedly in other situations, including cases that work right now with the simple implementation.


Summary¶

Text normalization is a vital preprocessing step in natural language processing (NLP) that involves transforming raw text into a standardized and consistent format. This process ensures compatibility and enhances the performance of NLP systems by addressing the inconsistencies, variations, and noise present in raw text. Such issues might include misspellings, abbreviations, slang, or irregular punctuation, all of which can hinder the accuracy and reliability of computational models.

The process typically begins with transforming text to lowercase to standardize its appearance, particularly in languages where case does not alter meaning. Another common step is removing unnecessary punctuation and special characters, which often do not contribute significantly to the semantic understanding of the text. Additionally, tokenization is employed to split the text into smaller units, such as words or sentences, which serve as the foundation for subsequent processing. Normalization often includes correcting misspellings or expanding contractions and abbreviations, such as converting "u" to "you" or "can't" to "cannot". In some cases, words are reduced to their root forms through stemming or lemmatization, and uninformative words, like "and" or "the", may be removed if they do not add value to the task.

Text normalization poses several challenges. Languages differ in their grammatical rules, syntax, and cultural nuances, making it essential to tailor normalization strategies for each language. Moreover, preserving context and meaning during normalization is difficult; for instance, indiscriminate removal of words or symbols might strip the text of valuable information, particularly in tasks like sentiment analysis. Social media and informal text add another layer of complexity, as they often include unconventional spellings, slang, and emojis. Named entities, such as names of people or brands, must often be retained in their original form, further complicating the process.

The exact approach to text normalization is highly dependent on the NLP task or application. For sentiment analysis, it is crucial to preserve elements like emojis, contractions, and slang, as these carry emotional nuance. In contrast, tasks like machine translation benefit from stricter uniformity in text, such as consistent casing and corrected spelling, to ensure accurate translations. Similarly, search engines rely on stemming or lemmatization to match user queries with indexed content effectively. Task-specific requirements emphasize the importance of tailoring normalization strategies. Over-normalizing can degrade performance in specialized applications; for instance, stemming might improve search retrieval but could impair readability in tasks like text summarization. A balanced and adaptive approach is essential to align text normalization with the specific goals of the NLP task, ensuring both computational efficiency and semantic fidelity. Ultimately, text normalization serves as a foundational step, bridging the gap between raw text data and sophisticated NLP analysis.

In [ ]: