from src.utils.libimports.llmdataprep import *
from src.utils.data.files import *

example_pdf,  _ = download_dataset("text/docs/llm-example-document.pdf")
example_docx, _ = download_dataset("text/docs/llm-example-document.docx")

File 'data/datasets/text/docs/llm-example-document.pdf' already exists (use 'overwrite=True' to overwrite it).
File 'data/datasets/text/docs/llm-example-document.docx' already exists (use 'overwrite=True' to overwrite it).

wiki_search_results = wikipedia.search("Python", results=3)

for rank, result in enumerate(wiki_search_results):
    print(f"{rank+1}. {result}")

1. Python
2. Monty Python
3. Python (programming language)

wiki_page = wikipedia.page("Python (programming language)")

print(wiki_page.pageid)
print(wiki_page.title)

23862
Python (programming language)

print(f"{wiki_page.content[:1200]}...")

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.
Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language, and he first released it in 1991 as Python 0.9.0. Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.
Python consistently ranks as one of the most popular programming languages, and it has gained widespread use in the machine learning community.


== History ==

Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands; it was conceived as a successor to the ABC programming language, which was inspired by SETL, capable of exception handling and interfacing with the Amoeba operating system. Python implementation began in Dec...

response = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)")

wiki_page_html = response.text

print(wiki_page_html[:1000])

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-sticky-header-enabled vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Python (programming language) - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enab

# Parse HTML
wiki_soup = BeautifulSoup(wiki_page_html, 'html.parser')

for idx, p in enumerate(wiki_soup.find_all("p")):
    if p.text.strip() != "":
        print(p.text)
        if idx >= 3:
            break

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.[33]

Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.

Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language, and he first released it in 1991 as Python 0.9.0.[34] Python 2.0 was released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.[35]

# Parse HTML
wiki_soup = BeautifulSoup(wiki_page_html, 'html.parser')

# Find DIV the holds the main content of the Wikipedia article
wiki_page_html_content = wiki_soup.find("div", {"id": "mw-content-text"})

# Remove info box with tabular data
wiki_page_html_content.find("table", {"class": "infobox vevent"}).decompose()

# Convert HTML to markdown
wiki_page_markdown_content = html2text.html2text(str(wiki_page_html_content))

# Print the first 1,000 characters of the page in Markdown format
print(f"{wiki_page_markdown_content[:1000]}...")

General-purpose programming language

**Python** is a [high-level](/wiki/High-level_programming_language "High-level
programming language"), [general-purpose programming language](/wiki/General-
purpose_programming_language "General-purpose programming language"). Its
design philosophy emphasizes [code readability](/wiki/Code_readability "Code
readability") with the use of [significant
indentation](/wiki/Significant_indentation "Significant indentation").[33]

Python is [dynamically type-checked](/wiki/Type_system#DYNAMIC "Type system")
and [garbage-collected](/wiki/Garbage_collection_\(computer_science\) "Garbage
collection \(computer science\)"). It supports multiple [programming
paradigms](/wiki/Programming_paradigm "Programming paradigm"), including
[structured](/wiki/Structured_programming "Structured programming")
(particularly [procedural](/wiki/Procedural_programming "Procedural
programming")), [object-oriented](/wiki/Object-oriented "Object-oriented") and
[functional programmi...

converter = PdfConverter(
    artifact_dict=create_model_dict(),
)

rendered = converter(example_pdf)

pdf_doc_markdown, _, _ = text_from_rendered(rendered)

Recognizing layout: 100%|███████████████████████████████████████████████| 1/1 [00:01<00:00,  1.88s/it]
Running OCR Error Detection: 100%|██████████████████████████████████████| 1/1 [00:00<00:00, 13.13it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Detecting bboxes: 0it [00:00, ?it/s]

print(f"{pdf_doc_markdown}")

# Data Preparation for Training LLMs

### 1 Introduction

Data preparation is a critical foundational step in training large language models (LLMs), involving the collection, cleaning, formatting, and structuring of vast amounts of textual data to ensure quality, diversity, and relevance. This process includes tasks such as deduplication, normalization, tokenization, filtering harmful or low-quality content, and balancing data across domains and languages to minimize bias and improve model performance. Properly prepared data enables LLMs to learn effectively, generalize across tasks, and generate coherent, informative responses, making data preparation as essential to success as model architecture or training techniques.

## 2 Data Collection

Data collection for training large language models (LLMs) involves gathering extensive and diverse text sources from the internet, books, academic articles, and other publicly available materials. The goal is to compile a broad and representative dataset that captures the richness of human language, knowledge, and context. This stage is crucial, as the quality and diversity of collected data directly impact the model's capabilities and performance.

#### 2.1 Public Datasets

...

2.2 APIs

...

### 2.3 Web Scraping

...

md = MarkItDown()

result = md.convert(example_docx)

print(result.text_content)

Data Preparation for Training LLMs

# 1 Introduction

Data preparation is a critical foundational step in training large language models (LLMs), involving the collection, cleaning, formatting, and structuring of vast amounts of textual data to ensure quality, diversity, and relevance. This process includes tasks such as deduplication, normalization, tokenization, filtering harmful or low-quality content, and balancing data across domains and languages to minimize bias and improve model performance. Properly prepared data enables LLMs to learn effectively, generalize across tasks, and generate coherent, informative responses, making data preparation as essential to success as model architecture or training techniques.

# 2 Data Collection

Data collection for training large language models (LLMs) involves gathering extensive and diverse text sources from the internet, books, academic articles, and other publicly available materials. The goal is to compile a broad and representative dataset that captures the richness of human language, knowledge, and context. This stage is crucial, as the quality and diversity of collected data directly impact the model's capabilities and performance.

## 2.1 Public Datasets

...

## 2.2 APIs

...

## 2.3 Web Scraping

...

html_string = "<p>Words can be in <b>bold</b> or in <i>italics</i>.</p>"

def remove_html_tags(text):
    return re.sub(r'<[^>]+>', '', text, flags=re.IGNORECASE)

print(remove_html_tags(html_string))

Words can be in bold or in italics.

example_page_html = """
<html>
<head><title>Sample Page</title></head>
<body>
    <header><h1>Site Header</h1></header>
    <nav>Main navigation menu</nav>
    <article>
        <h2>Main Content Title</h2>
        <p>This is the main article content.</p>
    </article>
    <aside>Related links and ads</aside>
    <footer>Footer with contact info</footer>
</body>
</html>
"""

# Parse the HTML
soup = BeautifulSoup(example_page_html, "html.parser")

# Define boilerplate tags to remove
boilerplate_tags = ['header', 'footer', 'nav', 'aside', 'script', 'style']

# Remove the boilerplate elements
for tag in boilerplate_tags:
    for element in soup.find_all(tag):
        element.decompose()

# Print cleaned HTML or text
cleaned_html = str(soup)
cleaned_text = soup.get_text(strip=True, separator="\n")

print("CLEANED HTML:\n", cleaned_html)
print("\nCLEANED TEXT:\n", cleaned_text)

CLEANED HTML:
 
<html>
<head><title>Sample Page</title></head>
<body>


<article>
<h2>Main Content Title</h2>
<p>This is the main article content.</p>
</article>


</body>
</html>


CLEANED TEXT:
 Sample Page
Main Content Title
This is the main article content.

doc = Document(example_page_html)

print(doc.summary())

<html><body><div><body id="readabilityBody">
    
    <nav>Main navigation menu</nav>
    <article>
        <h2>Main Content Title</h2>
        <p>This is the main article content.</p>
    </article>
    
    
</body>
</div></body></html>

print("\U00000027") # Apostrophe (equivalent to ASCII character)
print("\U00002019") # Right Single Quotation Mark
print("\U000002BC") # Modifier Letter Apostrophe

'
’
ʼ

print("\U000000E4")
print("\U00000061\U00000308")

ä
ä

sentence1 = "Der Bär hört die Hühner."
sentence2 = "Der Bär hört die Hühner."

if sentence1 == sentence2:
    print("Both sentences are identical.")
else:
    print("Both sentences are different.")

Both sentences are different.

unicode_map = {
    "\U00000061\U00000308": "\U000000E4",
    "\U0000006F\U00000308": "\U000000F6",
    "\U00000075\U00000308": "\U000000FC"
}

def multiple_replace(mapping, text): 
    regex = re.compile("|".join(map(re.escape, mapping.keys())))
    return regex.sub(lambda mo: mapping[mo.group(0)], text)

sentence1_mapped = multiple_replace(unicode_map, sentence1)
sentence2_mapped = multiple_replace(unicode_map, sentence2)

print(sentence1_mapped)
print(sentence2_mapped)

Der Bär hört die Hühner.
Der Bär hört die Hühner.

if sentence1_mapped == sentence2_mapped:
    print("Both sentences are identical.")
else:
    print("Both sentences are different.")

Both sentences are identical.

headline1 = "dogs can associate words with objects, study finds"
headline2 = "dogs can associate words with objects, study finds"
headline3 = "dogs can associate words with objects; studies find"
headline4 = "dogs can connect words with things, experiments show"

if headline1 == headline2:
    print("Both headlines are exact duplicates")
else:
    print("Both headlines are NOT exact duplicates")

Both headlines are exact duplicates

if hash(headline1) == hash(headline2):
    print("Both headlines are exact duplicates")
else:
    print("Both headlines are NOT exact duplicates")

Both headlines are exact duplicates

def get_ngrams(s, ngram_size=3):
    return set([ s[i:i+ngram_size] for i in range(len(s)-ngram_size+1)])

ngrams1 = get_ngrams(headline1)
ngrams2 = get_ngrams(headline2)
ngrams3 = get_ngrams(headline3)
ngrams4 = get_ngrams(headline4)

print(ngrams1)

{'s, ', ' st', 'ate', ' ob', 'stu', 'ect', 'h o', 'bje', 'ts,', 'rds', 'cia', 'dog', ' ca', 'ind', ', s', 'ord', 'udy', 'n a', 'soc', 'oci', 'obj', 'sso', 'ds ', ' fi', 'wit', 'th ', ' as', 's w', 'te ', 'ass', 's c', 'ogs', 'tud', 'jec', ' wi', 'cts', 'y f', 'an ', 'gs ', 'ith', 'fin', ' wo', 'can', 'dy ', 'e w', 'wor', 'nds', 'iat'}

def jaccard_similarity(ngrams1, ngrams2):
    return len(ngrams1.intersection(ngrams2)) / len(ngrams1.union(ngrams2))

print(f"Jaccard similarity between Headline 1 and Headline 2: {jaccard_similarity(ngrams1, ngrams2):.2f}")
print(f"Jaccard similarity between Headline 1 and Headline 3: {jaccard_similarity(ngrams1, ngrams3):.2f}")
print(f"Jaccard similarity between Headline 1 and Headline 4: {jaccard_similarity(ngrams1, ngrams4):.2f}")

Jaccard similarity between Headline 1 and Headline 2: 1.00
Jaccard similarity between Headline 1 and Headline 3: 0.73
Jaccard similarity between Headline 1 and Headline 4: 0.24

embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

v1 = embedding_model.encode(headline1)
v2 = embedding_model.encode(headline2)
v3 = embedding_model.encode(headline3)
v4 = embedding_model.encode(headline4)

print(f"Shape of embedding vectors: {v1.shape}")

Shape of embedding vectors: (384,)

print(f"Cosine similarity between Headline 1 and Headline 2: {embedding_model.similarity(v1, v3).item():.2f}")
print(f"Cosine similarity between Headline 1 and Headline 4: {embedding_model.similarity(v1, v4).item():.2f}")

Cosine similarity between Headline 1 and Headline 2: 0.98
Cosine similarity between Headline 1 and Headline 4: 0.80

Data Preparation for Training LLMs — An Overview¶

Setting up the Notebook¶

Make Required Imports¶

Download Required Data¶

Preliminaries¶

Data Collection¶

Types of Data¶

General Data¶

Specialized Data¶

Collection Methods¶

Public Datasets¶

APIs¶

Web Scraping¶

Data Extraction & Conversion¶

HTML (Hypertext Markup Language)¶

PDF (Portable Document Format)¶

DOCX¶

Preliminary Data Cleaning¶

Markup¶

Boilerplate Content¶

Unicode¶

Data Quality¶

Basic Quality-Based Filtering¶

Data Deduplication¶

Exact Deduplication¶

Fuzzy Deduplication¶

Semantic Deduplication¶

Data Decontamination¶

Model "Inbreeding"¶

Definition & Causes¶

(Potential) Consequences¶

Mitigation Strategies¶

Ethical Consideration¶

Toxicity & Biases¶

Privacy Protection & Anonymization¶

Data Governance & Transparency¶

Summary¶