Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.

Using Pretrained LLMs Locally — A Starter Guide¶

Pretrained large language models (LLMs) are no longer confined to cloud-based APIs or powerful research clusters. With advances in model compression, quantization, and optimized runtimes, it has become increasingly practical to run these models directly on personal computers or edge devices. This shift reflects a broader trend of democratizing AI, making it accessible not just to enterprises with vast resources, but also to individual developers, researchers, and hobbyists. As a result, the landscape of LLM deployment is expanding, offering new opportunities for experimentation, customization, and innovation outside the traditional cloud environment.

Running LLMs locally offers several clear advantages. One of the biggest benefits is data privacy: sensitive inputs remain on the user’s machine, avoiding potential exposure to third-party servers. Local inference also reduces latency, since responses do not rely on round trips to external APIs, which is especially important for real-time applications. Moreover, running models locally can be more cost-effective, as it avoids recurring API usage fees and allows users to leverage existing hardware. Finally, local deployment fosters greater control and customization, enabling fine-tuning, domain adaptation, or integration into specialized workflows without external constraints.

However, there are also significant challenges. Pretrained LLMs are resource-intensive, and even with optimizations, they often require powerful GPUs or large amounts of memory to run efficiently. This can limit accessibility for users with modest hardware. Furthermore, keeping models updated, managing dependencies, and troubleshooting performance bottlenecks can introduce additional complexity compared to simply using a cloud API. In some cases, cloud-based solutions may still be preferable for scaling, reliability, or accessing cutting-edge models without the burden of local setup.

Learning about local LLM deployment is important because it equips practitioners with the ability to make informed choices between cloud and on-device solutions depending on their goals. For organizations, understanding this approach can open paths toward greater data security and cost efficiency. For individuals, it provides a hands-on way to explore, customize, and innovate with AI beyond what is possible through closed APIs. As tools and frameworks continue to evolve, being familiar with local LLM deployment ensures that developers and researchers can fully leverage the growing ecosystem, balancing its strengths and limitations to meet their unique needs.

This notebook provides a guide on how to get started with running pretrained models on your local machine both for experimenting with LLMs or to build powerful LLM-driven applications without relying on cloud-based APIs. We first discuss in more detail the pros and cons of using APIs and running LLMs locally, and then cover different popular runtimes and frameworks to load, run, and use pretrained models. Let's get started!

Setting up the Notebook¶

Make Required Imports¶

This notebook requires the import of different Python packages (particularly the official OpenAI Python API library) but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.

In [1]:
# Some required base Libraries
import requests, json 

# PyTorch library (used as backend for transformers library)
import torch

# transformers library from Hugging Face to load, run, and use pretrained models
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# ollama wrapper library for Ollama
import ollama

# Auxiliary LangChain libraries to integrate Ollama and Hugging Face Transformers
from langchain_huggingface import HuggingFacePipeline
from langchain_ollama import OllamaLLM

from src.utils.compute.gpu import *

Checking & Setting Computing Device¶

PyTorch allows to train neural networks on supported GPU to significantly speed up the training process. If you have a support GPU, feel free to utilize it. However, for this notebook it's certainly not needed as our dataset is small and our network model is very simple. We provide an auxiliary method to automatically select the best device. It checks if a supported GPU is available and if so, uses it as the preferred device.

In [2]:
# Select preferred device (GPU, if available; CPU otherwise); you can enfore the use of the CPU
device = select_device(force_cpu=False)

print("Available device: {}".format(device))
Available device: cuda:0

Preliminaries¶

Before checking out this notebook, please consider the following:

  • In this notebook, we only look at using locally running LLMs for basic inference — submit simple prompts and read the generated responses. More advanced techniques such as Retrieval-Agumented Generation (RAG), quantization, fine-tuning, and so on are each a topic on their own and beyond the scope here.

  • Our focus is on locally running LLMs to access the models programmatically within an application or script. While there are desktop applications to interactively converse with local LLMs are also very popular, we only briefly mention those at the end.


Motivation: Cloud-Based APIs vs Local Inference¶

When working with large language models (LLMs), there are two primary approaches: using cloud-based APIs or running models locally. Cloud APIs provide convenient, scalable access to powerful models without the need for specialized hardware or setup, while local deployments give users direct control over the model and its environment. Both approaches are widely adopted in practice, but each comes with trade-offs that influence how they can be used effectively. The choice between cloud-based and local LLMs depends on factors such as performance requirements, data sensitivity, cost considerations, and the level of control needed for specific applications.

Limitations of Cloud-Based APIs¶

Using cloud-based APIs to access large language models (LLMs) offers several significant advantages, particularly in terms of ease of use. With a cloud API, you need to worry about downloading, configuring, or maintaining large models locally. The provider hosts the model, handles hardware requirements, and ensures it is optimized for performance. This allows you to focus purely on application logic and prompt design rather than infrastructure, making it ideal for rapid prototyping, experimentation, or deploying production applications without large upfront costs in hardware.

Another major advantage is scalability and availability. Cloud APIs can handle high volumes of requests and support concurrent users, often with minimal latency and automated load balancing. You can access state-of-the-art models that may be too large or computationally expensive to run locally, including models with billions of parameters or specialized fine-tuned versions. This allows businesses and developers to leverage the latest advances in AI without worrying about GPU memory, parallelization, or model updates. They also relieve users from the burden of maintenance, versioning, and security updates, as the provider handles these aspects.

However, there are also several challenges as downsides that come with using cloud-based APIs to use LLMs.

Data Privacy¶

When working with cloud-based APIs, data privacy is a major concern because any input sent to the API is typically processed on servers owned by the provider. This means sensitive or confidential information — such as personal data, company secrets, or proprietary research — leaves your local environment and is transmitted over the internet. Even with encryption in transit, there is a risk that this data could be logged, cached, or analyzed by the service provider for model improvement or troubleshooting, depending on the provider's data retention policies.

For example, if you are a company that uses a cloud LLM to summarize confidential client reports, those reports must be sent to the cloud servers. If the provider logs prompts for fine-tuning purposes, there is a chance that sensitive client information could be stored outside the your control. Similarly, if you are a healthcare professional using a cloud LLM to analyze patient records, you would need to ensure compliance with regulations like HIPAA; sending patient data to an external API without proper safeguards could lead to legal and ethical issues.

Another concern is accidental data leakage through the model's responses. Cloud LLMs are trained on massive datasets and sometimes may regenerate or expose patterns from previously seen data. If sensitive or proprietary information is sent to the model, there is a small risk that it could inadvertently appear in output generated for another user or application, depending on how the API handles model state or caching. In practice, you need to carefully review privacy policies, data retention policies, and access controls to ensure compliance.

Cost Control¶

The usage of Cloud-based APIs to access LLMs is typically billed per request, per token, or per compute time. The costs can escalate quickly, especially for applications that process large volumes of data, handle many concurrent users, or generate long outputs. This introduces ongoing operational costs that must be monitored and managed to avoid unexpected bills. For example, consider your business integrating a cloud LLM to summarize customer support tickets in real-time. If each ticket is 1,000 tokens and there are 10,000 tickets per month, the cumulative token usage could result in a significant monthly cost. Without proper monitoring or usage limits, this could easily exceed the budget allocated for AI services.

Another cost concern comes from inefficient prompts or excessive token generation. Using overly long prompts or setting high max_tokens for generation increases the number of tokens processed per request. For instance, asking a model to generate a 5,000-token essay for every prompt when only a 500-token summary is needed can multiply costs unnecessarily. Additionally, some applications might inadvertently make repeated or redundant API calls, further inflating usage costs. Finally, managing scaling costs is also important. Cloud APIs are convenient for handling spikes in traffic, but sudden surges can dramatically increase monthly expenses. For example, a chatbot embedded in your public website could see unexpected high traffic during a marketing campaign, resulting in heavy API usage. Organizations often need to implement rate limits, caching strategies, or batch processing to control costs while still maintaining performance. Monitoring dashboards and budget alerts provided by API providers can help mitigate these risks, ensuring predictable and manageable expenses.

Flexibility & Customization¶

Cloud-based APIs offer convenience and scalability, but they come with limitations in customization and flexibility. Since the models are hosted and maintained by the provider, users often have limited control over the underlying architecture, training data, or fine-tuning capabilities. You can usually adjust various basic parameters, but deeper modifications (e.g., such as adding domain-specific knowledge, changing model behavior significantly, or integrating new task-specific components) are typically restricted. This can be a significant limitation for organizations or developers who need highly specialized AI behavior. For example, your legal tech company may want an LLM to provide responses strictly aligned with local regulations. Using a cloud API, they can prompt the model carefully, but they cannot retrain it on proprietary legal datasets directly. Similarly, a biotech startup may want an LLM to answer technical questions about novel proteins. Without the ability to fine-tune the model on proprietary research papers, the API may produce generic or less accurate answers.

Another limitation is workflow integration and tool use. Cloud-based APIs generally provide a request-response interface, which may not easily support multi-step reasoning, external tool invocation, or custom memory management. For instance, building an autonomous agent that queries internal databases, executes code, and stores context for later decisions is harder with a cloud API alone. While some providers offer enhanced features (e.g., plugin systems or fine-tuning services), the flexibility is still constrained compared to frameworks like LangChain or Ollama that allow programmatic orchestration of multiple LLMs and external tools.

Finally, response variability and control can be a challenge. Cloud APIs may update models in the background or serve different versions depending on load and availability, which can affect reproducibility. If you require precise, predictable outputs for a specialized application, relying on a cloud model that you cannot fully control may introduce inconsistencies. Running a model locally or using a customizable framework ensures the same model version, tokenization, and configuration every time, giving developers more control over outputs and behavior.

Local Inference¶

The alternative to using cloud-based APIs is to running LLMs locally. This offers significant benefits, particularly when data privacy is a priority. By keeping all inputs and outputs on a local machine or internal network, sensitive information never leaves your environment, reducing the risk of data leaks or exposure to third-party servers. Also, unlike cloud-based APIs, which charge per token, request, or compute time, local deployments incur primarily upfront hardware and electricity costs, making them more predictable over time. For applications with high-volume or frequent usage, this can result in substantial cost savings compared to recurring cloud fees. Lastly, running LLMs locally enables greater flexibility and customization. You can fine-tune models, apply quantization, experiment with different architectures, or integrate models into complex workflows without restrictions imposed by cloud providers. This makes it possible to tailor model behavior to specific domains, optimize performance for available hardware, or embed models into custom applications and agents. Local deployment empowers you to leverage the full potential of LLMs, combining control, adaptability, and efficiency in ways that cloud-only solutions often cannot match.

On the flip sideRunning LLMs locally comes with several downsides and challenges that need to be considered:

  • Hardware requirements: Many LLMs are extremely large and require significant computational resources. Running them efficiently often needs high-end GPUs, substantial RAM, or fast storage. On lower-end machines, models may run slowly, be limited in size, or be impossible to load entirely.

  • Setup and maintenance complexity: Installing, configuring, and managing local LLMs can be technically challenging. Users may need to handle dependencies, runtime environments, quantization for memory optimization, and model updates. Troubleshooting performance or compatibility issues can require advanced technical knowledge.

  • Scalability limitations: Local deployments are constrained by the hardware they run on. Serving multiple users, handling high request volumes, or running models in production can quickly overwhelm a single machine. Unlike cloud APIs, local setups often lack built-in mechanisms for load balancing or distributed inference.

  • Performance trade-offs: Without specialized hardware, local models may run slower and have higher latency than cloud-hosted models optimized for large-scale inference. Some advanced features available in cloud environments—like streaming outputs, dynamic batching, or very large state-of-the-art models—may not be feasible locally.

  • Ongoing maintenance and versioning: Keeping models up to date, managing multiple versions, or integrating new models can be labor-intensive. Unlike cloud providers, which handle updates and optimizations automatically, local deployments require manual intervention to maintain performance and security.

In short, there is no single best answer when it comes to choosing between cloud-based APIs and running LLMs locally. This choice highly depends on your exact use case, requirements, and constraints (e.g., limited budget). That being said, one of the most practical challenges are technical considerations. While running LLMs locally is technically feasible, the requirements vary greatly depending on model size, hardware (CPU vs GPU), and the format in which the model is stored (e.g., full precision vs quantized). To provide some numbers, the table below provides rough estimates of RAM and VRAM usage based on model size and precision.

Model CPU RAM (int8, quantized) GPU VRAM (fp16) Notes
3B (e.g., TinyLlama) 4–6 GB 2–3 GB Works on most laptops
7B (e.g., Mistral) 12–16 GB 8–10 GB Requires decent GPU or CPU + swap
13B (e.g., LLaMA-2) 24–30 GB 16–20 GB Needs high-end consumer GPU or 64 GB RAM
65B (e.g., LLaMA-2) 64+ GB 48+ GB Requires professional workstation/server

While smaller pretrained models may perform very well for a given task, they generally do not offer the same capabilities as really large LLMs with nowadays up to a trillion parameters and more running on huge GPU clusters. Overall, however, running pretrained models locally on your own hardware typically requires some additional consideration and steps compared to easy-to-use cloud-based APIs. In the following, we therefore want to provide a basic guide to get started with using LLMs for local inference.


Popular Tools & Frameworks¶

The growing popularity and capabilities of LLMs have led to an increased demand for tools and frameworks that make it easier to run these models locally. Beyond simple inference, developers are increasingly looking to integrate LLMs into applications, enabling them to interact with other systems, manage context, and perform complex tasks. This has spurred the development of platforms like Ollama, Hugging Face Transformers, and LangChain, which provide the infrastructure and APIs to embed LLMs seamlessly into software workflows. These frameworks not only allow local execution of models but also simplify the programmatic access and orchestration of LLMs within applications. Let's go through some practical examples using these tools.

Ollama¶

Ollama is a powerful, open-source tool designed to make it easy for users to run large language models (LLMs) on their own local machines. It acts as a bridge between your computer and a wide variety of open-source LLMs, simplifying the complex process of downloading, configuring, and running these models. By using a straightforward command-line interface, Ollama allows users to download and manage different models, much like a package manager. This gives developers, researchers, and AI enthusiasts a user-friendly and accessible way to experiment with and deploy AI, all without needing to rely on third-party cloud services.

Ollama is built on top of llama.cpp, but they serve different purposes and offer distinct user experiences. Think of llama.cpp as the core engine for running large language models (LLMs), a highly efficient C++ library that handles the heavy lifting of model inference. It is powerful and highly customizable, giving advanced users granular control over settings like quantization, GPU offloading, and memory management. However, it is a command-line-based tool that requires some technical knowledge to set up and use, making it less accessible for beginners. Ollama is a user-friendly wrapper around the llama.cpp engine. It is designed to simplify the entire process of running LLMs locally, making it a "plug-and-play" solution. This abstraction makes it far more accessible for developers and everyday users who do not want to deal with the complexities of compiling code or manually managing command-line flags.

Installation¶

To download and install Ollama, just go to the official download page and follow the instructions which slightly differ depending on your operating system (Windows, MacOS, or Linux). To confirm that the installation was successful, open a command line terminal and run the following command to show the installed version:

ollama --version

If this command runs successfully, Ollama is ready to go.

Loading & Serving Models¶

Once Ollama is installed, we first need to load at least one pretrained model and expose it via a local API. To see which models are available, you can go to the official Ollama Library and browse and search for models that may serve your purpose best. Once you have identified a model you want to use, you can download it with the ollama pull command; for example, the command below downloads the tinyllama model — we go with a very small model to avoid any memory concerns since this notebook is just about providing a simple guide to get going.

ollama pull tinyllama

You naturally download multiple models; just downloading the models does not actually run them. To see which models you have downloaded and therefore are available, you list them all using the command below.

ollama list

For example, the tinyllama model we have downloaded should appear with the name tinyllama:latest in this list. This is the name you typically specify to refer to the model you want to use. If you want to use the model interactively using a one-off session from your terminal — ideal for testing, experimentation, or small-scale usage where you do not need programmatic access — run the following command:

ollama run tinyllama:latest

This loads the model into memory, and you can now directly write your prompts into this terminal and see the responses immediately. Once you exit the session, the model stops running. However, if you want to use the model programmatically within your application, you can start a local server that keeps the model running in the background and exposes it via an API endpoint. For this, you just need to run the following command in a command line terminal.

ollama serve

Now other programs, scripts, or services can send HTTP requests to the server and receive responses from the model. This is ideal for integration into applications, automated workflows, or multi-user scenarios, because the model does not need to be reloaded for each request. The server keeps running until you manually stop it, allowing persistent, programmatic access.

Using Ollama in Python¶

With the Ollama "serving" pretrained models and making them accessible via a locally(!) running API, we can submit prompts to the model using HTTP requests against this API. However, a more convenient method is to use the ollama Python library. It essentially acts as a bridge between Python applications and the Ollama service, making it easier to integrate LLM-powered features into scripts, apps, or pipelines. Within this library, the Client class is the main entry point for interacting with Ollama. It wraps Ollama's HTTP API and provides convenient methods to run models, stream responses, and manage models (e.g., pulling or listing them). In short, the Client abstracts away the networking details and exposes Ollama’s capabilities in a simple Pythonic interface.

The code cell below creates a client object using the Client class. The main argument is the address of the API. By default, the API listens to port 11434 on localhost. In this case, we can omit the host argument. Of course, the Ollama runs on a different machine and or different port, we need to explicitly tell the Client class where to find the API.

In [3]:
client = ollama.Client(host="http://localhost:11434")
#client = ollama.Client() # Also works since "http://localhost:11434" is the default

We can now submit prompts to any available model. The generate() method of the library is used to run a model and produce text completions based on a given prompt; see the example in the code cell below. You provide it with the model name (e.g., "tinyllama:latest") and an input prompt, and it returns the model's generated response. This method is typically used for single-turn interactions where you want the model to complete or expand text, rather than maintain a multi-turn conversation. Under the hood, generate() sends a request to the Ollama service, streams back the model's output, and collects the result for you. It's useful for tasks like summarization, code generation, or drafting text when you don't need the context management like for dialogue that chat() provides.

In [4]:
model =  "tinyllama:latest"
prompt = "How old are the pyramids of Giza?"

response = client.generate(model=model, prompt=prompt)

print(response['response'])
The Pyramids of Giza, including the Great Pyramid and the Pyramid of Khufu, were built over two thousand years ago during the Old Kingdom (2570-2184 BCE) of ancient Egypt. They are estimated to be around 4,500 years old, which makes them among the oldest extant manmade structures on Earth. The Great Pyramid of Giza was completed around 2569 BCE and is believed to have been the largest pyramid in the world at its construction.

In contrast, the chat() method is designed for multi-turn conversations with a model. Unlike generate(), which just takes a single prompt and returns a completion, chat() accepts a structured list of messages (with roles like "system", "user", and "assistant") and maintains the conversational context across turns. This makes it ideal for building chatbots or assistants where the model needs to remember prior exchanges and respond coherently. Internally, chat() organizes the dialogue into a sequence of messages and sends them to the model, which then generates a continuation based on the full conversation history. This enables richer, more interactive behavior compared to generate(), since the model can reason about what was said earlier and adjust its responses accordingly. In short, chat() is Ollama’s higher-level interface for dialogue, while generate() is for one-off completions.

Note: tinyllama itself is not inherently a chat-capable model. It is a smaller, more lightweight version of the LLaMA model primarily designed for inference on resource-constrained devices. By default, it is trained for general-purpose text generation, not for dialogue. The code cell below will still return some meaningful output, but for a proper dialogue still response generation you would need to load and run a chat-capable model (e.g., vicuna, openchat, llama2).

In [5]:
# Ensure you have a chat-capable model pulled and running
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I am doing well, thank you! How can I help you today?"},
    {"role": "user", "content": "How old are the pyramids of Giza?"}
]

reply = ollama.chat(model=model, messages=messages)

print(reply['message']['content'])
The pyramids of Giza are believed to have been built during the reign of King Menkaure (2586-2541 BCE), who ruled over Egypt between 2573 and 2549 BCE. The exact age of the pyramids themselves is not known, but they were constructed over several centuries during King Menkaure's reign. Archaeological research has suggested that they could have been built as far back as the late Predynastic period (c. 3100-3000 BCE).

Besides generate() and chat(), the ollama Python library provides several other useful methods through its Client class, mostly for managing models and interacting with the local Ollama service:

  • pull(): downloads a model from the Ollama model library to your local machine.
  • list(): returns a list of all models currently available locally, along with details like size and modification date.
  • show(): displays information about a specific model, such as metadata, parameters, and system prompts used.
  • ps(): shows models that are currently running.
  • delete(): Removes a model from your local system to free up disk space.

Together, these methods give you the ability to manage models (download, inspect, delete) and to run or converse with them (generate() and chat()). This makes the Ollama library not just an inference tool, but also a lightweight model management interface. For a quick example, the code cell below uses the list() method to show all the currently available models; of course, the output will depend on the models you have downloaded.

In [6]:
response = client.list()

for model in response.models:
    print(f"{model.model} (size={model.size/(1024*1024*1024):.3f} GB, #parameters: {model.details.parameter_size})")
gemma3:1b (size=0.759 GB, #parameters: 999.89M)
tinyllama:latest (size=0.594 GB, #parameters: 1B)

Overall, Ollama provides an easy way to run pretrained large language models (LLMs) locally because it abstracts away the complexities of setup, dependencies, and hardware configuration. Instead of dealing with intricate model weights, tokenizers, and serving frameworks, you simply install Ollama, pull a model, and start generating text or chatting with it. This makes experimenting with state-of-the-art open-source models accessible even to those without deep ML infrastructure knowledge, while still giving power users control over running models on their own hardware without relying on cloud services.

The ollama Python library builds on this simplicity by giving developers a clean and Pythonic interface to work with LLMs programmatically. With just a few method calls, you can generate completions (generate()), hold conversations (chat()), or manage models (pull(), list(), delete()). This allows seamless integration of LLMs into applications, automation scripts, or research workflows. In short, Ollama makes local LLM deployment straightforward, and its Python library makes interacting with those models just as simple in code.

Hugging Face¶

Hugging Face is a very popular open-source platform and community for natural language processing (NLP) and machine learning. It is best known for its transformers library, which provides thousands of pretrained models for tasks such as text generation, translation, classification, and more. Researchers and developers can easily share, discover, and fine-tune models on the Hugging Face Hub, making it a central ecosystem for modern AI development.

When it comes to running pretrained LLMs locally, Hugging Face makes the process straightforward through its transformers and datasets libraries. By calling just a few lines of code, you can download a model (e.g., GPT-2, LLaMA, or BLOOM) and run inference directly on your own hardware, whether CPU or GPU. Hugging Face also provides optimized backends (such as accelerate, optimum, and integrations with frameworks like PyTorch and TensorFlow) that handle device placement, quantization, and performance tuning. This means that you do not need to manually set up tokenizers, manage model weights, or worry about compatibility issues.

In practice, Hugging Face supports local deployment by letting developers choose between out-of-the-box pretrained models for quick experimentation, or fine-tuned/custom models for specific applications. Combined with tools like accelerate for distributed computing and optimum for hardware-specific optimizations (e.g., NVIDIA GPUs or Apple Silicon), Hugging Face makes it both accessible and efficient to run powerful LLMs locally. This flexibility is why it is one of the most widely adopted platforms in the AI developer community.

Loading Model¶

Like for Ollama, we first need to specify which model we want to use. The model library of Hugging Face, often called the Hugging Face Hub, is a vast open-source repository where researchers, developers, and organizations share pretrained models for machine learning. It covers a wide range of tasks such as text generation, translation, summarization, question answering, sentiment analysis, image recognition, audio processing, and more. Each model entry typically includes its weights, configuration files, tokenizer, and documentation, making it easy to load and use directly with the Hugging Face transformers library.

In the following example, we focus on ChatGPT-style text generation using the small TinyLlama/TinyLlama-1.1B-Chat-v1.0 model. This model is a lightweight, open-source LLM developed as part of the TinyLlama project. It has about 1.1 billion parameters, making it much smaller than mainstream LLMs like LLaMA-2 (7B+) or GPT-style models. Despite its reduced size, it is trained to be efficient and effective for conversational use, allowing it to run on more modest hardware such as consumer GPUs or even some high-end CPUs.

The “Chat” designation means this version has been fine-tuned specifically for dialogue and instruction following, rather than just raw text completion. This tuning makes it better at holding back-and-forth conversations, answering questions, and following structured prompts in a way that feels more natural and assistant-like. In essence, TinyLlama-1.1B-Chat-v1.0 is a compact, resource-friendly chat model, well-suited for local deployment and lightweight applications where larger LLMs would be too demanding.

In [7]:
# Specify the model name from Hugging Face Hub
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

What makes the Hugging Face model library powerful is its standardized interface — any model hosted there can usually be loaded with a single line of code. This removes the friction of dealing with raw model files and ensures interoperability across frameworks. The AutoModelForCausalLM class of transformers library is a generic wrapper for loading pretrained causal language models (models trained to predict the next token in a sequence). It provides a unified interface so you do not need to know the exact underlying architecture (e.g., GPT-2, LLaMA, BLOOM). Instead, you simply call AutoModelForCausalLM.from_pretrained("model-name"), and the correct model class is automatically selected and initialized with the appropriate weights and configuration.

Apart from the model name, we also set two arguments of the from_pretrained() method:

  • dtype: This argument controls the precision (data type) of the model weights when loading it into memory. By default, models are usually loaded in 32-bit floating point (torch.float32), which is precise but also memory-heavy and slower to run. By specifying a different dtype, such as torch.float16 or torch.bfloat16, you can reduce memory usage and speed up inference, especially on GPUs that support mixed precision. Below we use torch.float32 since this ensures compatibility no matter if the model is running in the CPU or a GPU; and the model is very small anyway.

  • device_map: This argument specifies how the model layers are distributed across available devices (CPUs, GPUs, or multiple GPUs). By default, a model is loaded entirely on a single device, but large models often exceed the memory capacity of one GPU. Using device_map, you can assign different parts of the model to different devices or let Hugging Face automatically decide the optimal placement with device_map="auto".This is particularly useful for efficient inference and training of large models. For example, on a multi-GPU system, device_map can split layers across GPUs to fit the model in memory while maximizing performance.

In [8]:
# Load the pretrained model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype=torch.float32,  # Use float16 only if you hava a supported GPU with enough memory
    device_map="auto"
)

Loading Tokenizer¶

Pretrained models come with their own tokenizer because the tokenization process is tightly coupled with how the model was trained. A tokenizer determines how raw text is split into tokens (subwords, words, or characters), and these tokens are then mapped to integer IDs that the model processes. If a different tokenizer were used at inference time, the mapping between text and token IDs would not match what the model learned during training, leading to degraded performance or even nonsensical outputs. For example, the word "unbelievable" might be split into ["un", "believ", "able"] in one tokenizer and ["unbel", "ievable"] in another, which would produce different embeddings that the model was not trained to understand.

Another reason is that different pretrained models optimize tokenization for specific trade-offs and languages. Models like BERT use WordPiece, GPT-2 and GPT-3 rely on Byte-Pair Encoding (BPE), while LLaMA models adopt SentencePiece with ByteFallback. These choices affect vocabulary size, sequence length, and efficiency. By providing a tokenizer along with the pretrained model, developers ensure compatibility and reproducibility: anyone using the model can preprocess text in exactly the same way as during training, guaranteeing consistent results. In short, a pretrained model without its tokenizer is incomplete. The tokenizer ensures that raw text is transformed into the exact token sequences the model was trained to interpret, making it an essential part of the model package.

The transformers library provides the AutoTokenizer.from_pretrained() method to automatically load the correct tokenizer associated with a given pretrained model. When you call it with a model name, it downloads the model's tokenizer configuration, vocabulary, and special tokens from the Hugging Face Hub (or a local directory), and returns an instance of the appropriate tokenizer class. So let's do this for our chosen model.

In [9]:
# Load the tokenizer associated with model
tokenizer = AutoTokenizer.from_pretrained(model_name)

For a quick example, we can use the tokenizer to convert a simple sentence into its corresponding sequences of token ids; see the code below. Notice that the sentence contains the word "myglobin" which is arguably not a very common — we will see why this makes the example a bit more interesting in moment

The argument return_tensors="pt" tells the tokenizer to return its outputs (like input_ids and attention_mask) as PyTorch tensors instead of plain Python lists or NumPy arrays. This is important because models in PyTorch expect tensor inputs, not lists. Similarly, you can use return_tensors="tf" for TensorFlow tensors or "jax"/"np" for JAX and NumPy arrays. But in this notebook, we work with PyTorch. We also move the output of the tokenizer to the same device (CPU or GPU) as the model using the to() method.

Note: In the example below and throughout the notebooks, we consider only single prompts. While the tokenizer allows to convert/encode multiple strings at the same time, it would require additional considerations to ensure that the resulting arrays of token ids have the same length, e.g., through automatically padding the arrays. But we skip this issue to keep things simple.

In [10]:
example_sentence = "myoglobin is a protein"

# Tokenize the prompt and move it to the same device as the model (CPU or GPU)
inputs = tokenizer([example_sentence], return_tensors="pt").to(model.device)

# Print the resulting token ids
print(inputs.input_ids[0])
tensor([    1,   590,   468,   417,  2109,   338,   263, 26823],
       device='cuda:0')

If you look at the output, you will notice that the number of token ids is larger than the number of words in the example sentence. To see what is going on we can convert the token ids back to the actually tokens using the methods convert_ids_token_tokens provided by the tokenizer class:

In [11]:
tokenizer.convert_ids_to_tokens(inputs.input_ids[0])
Out[11]:
['<s>', '▁my', 'og', 'lo', 'bin', '▁is', '▁a', '▁protein']

There are two main reasons for the number of tokens to be larger than the number of words in the input sequence. Most importantly, the tokenizer is performing subword tokenization — that is splitting less common words into more common tokens. Basically all modern models that rely on text as input rely on subword tokenization methods. The second reason for the larger number of token ids is the addition of special tokens. For example, assuming a TinyLLaMA model, the special token <s> is used as a start-of-sequence marker that signals the model where the input begins. It helps the model establish context, especially when distinguishing between prompt text and generated text, and ensures consistency during training and inference. By always starting with <s>, the model can better align with the structure it was trained on, improving generation quality and reducing ambiguity in how it interprets the beginning of prompts.

Of course, in practice, particularly when receiving the output of the model in the form of arrays of token ids, we are generally not interested in the individual tokens but in the human-readable text. The tokenizer class therefore also comes with the methods decode() to convert an array of token ids into text. Let's apply this method on our token ids we have just generated from the example sentence.

In [12]:
tokenizer.decode(inputs.input_ids[0])
Out[12]:
'<s> myoglobin is a protein'

Now all individual tokens have been appropriately merged to the proper words. However, we still have this special token <s>, and typically we want to ignore special tokens that might be contained in some output. As such, we can set the argument skip_special_tokens=True to ignore any special tokens during the decoding steps.

In [13]:
tokenizer.decode(inputs.input_ids[0], skip_special_tokens=True)
Out[13]:
'myoglobin is a protein'

With the model and tokenizer loaded and ready to use, we can now submit prompts to the LLM.

Use Model for Inferencing¶

Pretrained language models often expect a specific structure for input because they are often fine-tuned or trained with a consistent prompt format. The model learns not only the language patterns but also how to interpret roles (e.g., user vs. assistant) and instructions based on the structure. If the input deviates from this expected format, the model may misinterpret the prompt, produce incomplete responses, or fail to follow instructions correctly. This is especially true for chat or instruction-following models, which rely on clearly defined message roles and separators to maintain context and provide coherent answers. For the TinyLlama-1.1B-Chat-v1.0 model, the prompt template follows a simple structured format to indicate turns in a conversation:

<|user|>
Hello, how are you?</s> 
<|assistant|>
I am doing well, thank you! How can I help you today?</s> 
<|user|>
How old are the pyramids of Giza?</s> 

In principle, we can "manually" write any prompts in this format and give this prompt directly to the tokenizer. However, much more conveniently, we can let the tokenizer handle this automatically. The apply_chat_template method of a tokenizer class is used to format input messages according to a model's expected chat or instruction-following template before tokenization. Once the template is applied, the tokenizer converts this structured text into token ids that the model can process. This ensures that every input, whether from a single-turn query or a multi-turn conversation, preserves the conversational structure. Thus, we can use the well-established format of writing conversational prompts supporting different roles as shown by the messages variables in the code cell below.

When calling the apply_chat_template() method, notice the return_tensors="pt" argument. It specifies that the output of the tokenizer should be returned as PyTorch tensors. By default, tokenizers may return token IDs as Python lists or NumPy arrays, which are not directly compatible with PyTorch models. Setting return_tensors="pt" ensures that the tokenized input is converted into a torch.Tensor, which can be immediately fed into a PyTorch-based model for inference or training. This is important because LLMs in Hugging Face, like those loaded with AutoModelForCausalLM, expect inputs as tensors on a specific device (CPU or GPU). Using return_tensors="pt" simplifies the workflow by producing the correctly typed input, avoiding additional conversion steps, and allowing you to leverage PyTorch’s GPU acceleration and autograd capabilities seamlessly.

Lastly, we have to ensure that the resulting tensor resides on the same device as the model (CPU or GPU). We can do so by calling the to() method for that tensor and specifying the as target the device where the model is located.

In [14]:
# Define prompt
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I am doing well, thank you! How can I help you today?"},
    {"role": "user", "content": "How old are the pyramids of Giza?"}
]

# Apply prompt template and convert to token ids
prompt_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

Before we actually pass the tensor containing all token ids of our prompt, let's first have a look at the prompt in human-readable form. We already saw how we can use the decode() method of the tokenizer for that.

In [15]:
print(tokenizer.decode(prompt_ids[0]))
<|user|>
Hello, how are you?</s> 
<|assistant|>
I am doing well, thank you! How can I help you today?</s> 
<|user|>
How old are the pyramids of Giza?</s> 

We can see how the prompt adheres to the expected template for our model. This also means that we can easily switch to a different model and different tokenizer without changing our messages variable. The apply_chat_template() will always ensure that the final prompt will have the correct format.

Let's finally submit our prompt to the loaded model to get its response. The generate() method of the AutoModelForCausalLM class is used to produce text sequences from a causal language model. Calling generate() instructs the model to predict the next tokens in sequence, continuing until a stopping criterion is met (like a maximum length, end-of-sequence token, or custom stopping condition). This method handles the autoregressive generation internally, efficiently producing outputs without requiring you to manually implement the token-by-token loop. The methods also supports a variety of generation strategies and parameters, such as greedy decoding, beam search, top-k/top-p sampling, temperature scaling, and repetition penalties. These options allow fine control over the creativity, diversity, and coherence of the generated text.

In [16]:
outputs = model.generate(
    prompt_ids,
    max_new_tokens=200,                     # Limit the number of tokens in the response
    do_sample=True,                         # Enable sampling for diversity
    temperature=0.7,                        # Sampling temperature; lower = more deterministic
)

The output of the generate() method is a list of generated token ids. We therefore need the tokenizer to decode those ids into human-readable text; as we have seen before. The output contains both the input prompt together with the generated response. Typically, we are only interested in the latter. For out TinyLlama-1.1B-Chat-v1.0 model we know that actual response text is preceded by the <|assistant|> marking the role of the model. We can therefore extract everything after the role marker and print the model's response. All those three steps are performed in the code cell below.

In [17]:
# Decode the generated token IDs back into text
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Extract only the assistant's reply (remove the prompt)
response_text = response.split("<|assistant|>")[-1].strip()

# Print the final response
print(f"{response_text}")
The pyramids of Giza date back to around 2500 BC, which is around 5,000 years ago. The oldest pyramid, the Great Pyramid of Khufu, was completed around 2580 BC and was estimated to be between 140,000 and 160,000 years old when it was built. The pyramids have been the subject of debate and research for centuries, and there is no definitive answer as to their age. The exact age of the pyramids is still the subject of scholarly debate, and archeological evidence is limited to the archaeological site itself.

Similar to Ollama, The Hugging Face transformers library provides an easy and consistent way to run pretrained large language models (LLMs) locally. By offering thousands of models on the Hugging Face Hub, it allows developers to quickly download and use state-of-the-art models for a wide variety of tasks such as text generation, summarization, translation, and question answering. Users don’t need to worry about the low-level details of model weights, tokenization, or architecture; the library abstracts these complexities and provides a standardized API that works across different model types and sizes.

Programmatically, the transformers library makes it simple to interact with models using Python code. With classes like AutoModelForCausalLM and AutoTokenizer, you can load a model, tokenize input text, and generate responses with just a few lines of code. Methods like generate() handle the autoregressive text generation automatically, while tokenizers manage the conversion between text and model-readable token IDs. Additional arguments, such as device_map or dtype, make it straightforward to optimize the model for your hardware and memory constraints. This combination of a rich model library, easy-to-use API, and flexible runtime options allows to build applications, prototypes, or research experiments that leverage LLMs locally without relying on cloud services.

LangChain¶

LangChain is an open-source framework designed to build applications that use language models as components in larger systems. Unlike Hugging Face Transformers or Ollama, which primarily focus on providing access to pretrained LLMs, LangChain focuses on orchestrating language models with external tools, data sources, and memory. It provides abstractions for chains (sequences of operations), agents (LLMs that can take actions), and retrievers (accessing external knowledge), making it ideal for creating chatbots, question-answering systems, or multi-step reasoning pipelines.

In fact, the LangChain auxiliary libraries provide classes for the integration of Ollama and Hugging Face Transformers. The langchain_ollama library provides the OllamaLLM class to access models that are served by Ollama, very similar to the ollama library — but here integrated into the LangChain framework. To submit a single prompt to the model, we can use the invoke() method. This is the primary way to execute a chain or a runnable object in LangChain. It is a synchronous method that takes a single input and returns a single output, making it suitable for standard, one-off operations. The code cell below shows a minimal example.

In [18]:
# Initialize Ollama with  chosen model
ollama_llm = OllamaLLM(model="tinyllama:latest")

# Invoke the model with a query
response = ollama_llm.invoke("How tall are the pyramids of Giza?")

# Print LLM-generated response
print(response)
The pyramids of Giza, also known as Khamzeh-ye Ghaem or Sazman-e Pouya, are not officially named Pyramid I or II. They are actually pyramids that are located in Iran and are not directly associated with the ancient Egyptian civilization. The exact heights of these structures have been estimated at varying sizes, but they are generally thought to be around 15 meters tall. The exact dimensions of the pyramids have changed over time due to modifications made by later civilizations who wanted to enhance their status and reputation as well as add or subtract certain features.

The counterpart for Hugging Face Transformers is the langchain_huggingface library. However, this library still requires loading the model and the tokenizer using the respective classes from the transformers library. In fact, we can reuse the model and the tokenizer we loaded earlier in the notebook. However, if needed, the code cell below includes all the required commands.

In [19]:
# Load a pretrained model and tokenizer from Hugging Face
#model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
#tokenizer = AutoTokenizer.from_pretrained(model_name)
#model = AutoModelForCausalLM.from_pretrained(model_name)

LangChain works best with Hugging Face pipelines. In the transformerslibrary, a pipeline is a high-level abstraction that makes it easy to use pretrained models for common tasks without worrying about the details of tokenization, model architecture, or output formatting. With just a few lines of code, you can load a model and run inference for tasks like text classification, sentiment analysis, translation, summarization, question answering, and more. Pipelines are especially useful for quick prototyping and experimentation since they hide much of the complexity of model handling. Instead of manually loading tokenizers, encoding inputs, passing them to the model, and decoding outputs, the pipeline takes care of all of this under the hood, allowing you to focus on the task at hand. This makes it a convenient entry point for beginners and a time-saver for experienced users who want fast results.

This code below creates a pipeline specialized for text generation. In this example, we give the pipeline both the model and the tokenizer object and set the maximum number of tokens that are generated to $100$. The HuggingFacePipeline class in LangChain is a wrapper that lets you use Hugging Face's transformers pipelines directly inside the LangChain framework. Instead of calling a Hugging Face pipeline separately, you can wrap it with HuggingFacePipeline and then use it like any other LangChain LLM or chain component. This makes it easy to integrate Hugging Face models into LangChain workflows such as prompt chaining, agents, or retrieval-augmented generation. It's particularly useful when you already have a custom Hugging Face model or pipeline configured (for example, a text-generation or text-classification pipeline) and you want to leverage LangChain's tooling around it.

In [20]:
# Create a Hugging Face pipeline for text generation
text_gen_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=100
)

# Wrap the pipeline in a LangChain LLM object
huggingface_llm = HuggingFacePipeline(pipeline=text_gen_pipeline)
Device set to use cuda:0

With the pipeline set up, we again can use the invoke() method to submit prompts to the model:

In [21]:
# Invoke the model with a query
response = huggingface_llm.invoke("How tall are the pyramids of Giza?")

# Print LLM-generated response
print(response)
How tall are the pyramids of Giza?

As mentioned before, although it can be used for it, LangChain is not just a tool for running pretrained models like large language models (LLMs) for inference; it is a full framework designed to integrate LLMs into larger, structured applications. Instead of treating a model as a standalone black box, LangChain provides abstractions for chaining prompts, incorporating external tools (such as databases, APIs, or search engines), managing context, and handling workflows. This makes it possible to build applications where LLMs interact with other systems in a coordinated way.

The advantages of using LangChain include modularity, extensibility, and orchestration. Developers can combine pretrained models with memory components, retrieval systems, or tool-using agents, all while benefiting from a consistent API. This allows them to build more reliable and context-aware applications than just raw inference would allow. LangChain also makes it easier to experiment with different LLMs, swap models, or connect them with specialized pipelines without rewriting the core logic. Example applications include chatbots and virtual assistants that can retrieve information from a knowledge base, document summarizers that work across large collections of text, question-answering systems powered by retrieval-augmented generation (RAG), and autonomous agents that can plan multi-step tasks using external tools. By embedding pretrained LLMs into such workflows, LangChain enables developers to move from simple text generation to building sophisticated AI-powered applications. However, such applications are beyond the scope of this notebook.

Other Popular Alternatives¶

In this section, we focused on three tools and frameworks to load, run, and use pretrained LLMs on your local machine. However, there are many other alternatives that may differ with respect to their usability, flexibility, hardware requirements, and so on. Let's briefly look at some of other popular alternatives to run pretrained models locally.

Studio LM¶

LM Studio is a desktop/local‐AI platform whose goal is to let users run, experiment with, and interact with open-source large language models (LLMs) entirely on their own machines (Windows, macOS, Linux). It provides a GUI (graphical user interface), built-in tools for downloading models (e.g. from Hugging Face and other sources), managing models, chatting with them, and performing document‐based retrieval tasks (i.e. RAG: retrieving content from local documents to assist in chat). LM Studio also exposes APIs / SDKs so that you can not only use it via its desktop UI but integrate model serving into your own scripts, workflows, or applications. For example, you can run a model locally and expose it via an OpenAI-compatible REST endpoint. So its purpose is to give both casual users and developers a way to experiment with LLMs locally, maintain data privacy (because inference is local), avoid cloud costs, and build prototypes or tools that use LLM capabilities without depending on cloud APIs.

LM Studio stands out for its ease of use, providing a simple graphical interface that removes the technical hurdles of running LLMs through command-line tools, dependency management, or runtime setup. Users can easily discover, download, and configure models, making it accessible even to those without deep technical expertise. Beyond usability, it offers strong flexibility and integration, functioning not just as a playground but as a local model server with API support, SDKs, and document-based retrieval (RAG) features. Combined with its support for a wide variety of open-source models in different sizes and formats, users can experiment, prototype, and deploy solutions tailored to their hardware and use cases, whether for personal assistants, apps, or research tools.

Text Generation Web UI¶

The Text Generation Web UI (often called oobabooga Text Generation Web UI) is an open-source, browser-based interface for running and interacting with large language models locally. It provides a simple web dashboard where users can load models (such as LLaMA, GPT-J, MPT, Falcon, etc.), configure generation parameters, and chat with them. The goal is to make experimenting with different open-source LLMs easier without needing to write Python code or deal heavily with command-line tools. It also supports features like role-play, multi-user chat, model fine-tuning (LoRA adapters), and extensions for retrieval or integrations. The main purpose of Text Generation Web UI is to offer a user-friendly, customizable, and extensible environment for people who want to run and test LLMs locally on their own hardware. By exposing advanced parameters (like temperature, top-p, repetition penalty, etc.) and allowing plugins, it enables fine-grained control over how the model behaves. It’s widely used by hobbyists, researchers, and developers who want more flexibility than a pre-packaged desktop app but still want a convenient interface.

Compared to LM Studio, pros of Text Generation Web UI include its open-source nature, high degree of customization, and an active community that contributes plugins, model support, and updates. Because it’s browser-based, you can also access it from multiple devices on the same network. It supports a wide range of models and inference backends, making it one of the most versatile local LLM frontends available. On the other hand, its cons are that setup can be more technical and complex** than LM Studio, since you need to install Python, dependencies, and sometimes GPU runtimes. The interface, while powerful, may feel overwhelming for casual users, and performance still depends heavily on your hardware. Unlike LM Studio's polished "ready-to-use" design, Text Generation Web UI leans more toward tinkerers who are comfortable adjusting settings and experimenting.

Text Generation Inference (TGI)¶

The Text Generation Inference (TGI) by Hugging Face is an open-source, production-ready server for running and serving large language models efficiently. Unlike graphical frontends like LM Studio or Text Generation Web UI, TGI is focused on high-performance inference in enterprise or developer environments. It provides an optimized backend for transformer-based models with support for quantization, batching, tensor parallelism, and streaming, all designed to maximize throughput and minimize latency when deploying LLMs at scale. Hugging Face uses TGI under the hood in their own Inference Endpoints and the Hugging Face Hub.

The main purpose of TGI is to make serving large language models reliable and efficient in production contexts. It exposes a simple OpenAI-compatible API for text generation, so developers can swap it in easily. Beyond raw generation, it also supports features like token streaming, log probabilities, and stopping criteria, which are important for building advanced AI applications. It is particularly well-suited for organizations that want to self-host powerful LLMs while taking advantage of optimizations without writing custom serving code. While LM Studio is a desktop-focused tool designed for personal, local experimentation and integration, TGI is an enterprise-grade inference solution, built for scalable deployments and efficiency in mind.


Summary¶

Running pretrained large language models (LLMs) locally has become increasingly accessible thanks to tools and frameworks such as Ollama, Hugging Face Transformers, and LangChain. These platforms allow developers to load models onto their own machines and interact with them programmatically using Python or other supported languages. Unlike cloud-based APIs, local LLM deployment provides full control over the model, data privacy, and lower latency, especially for sensitive data or offline use cases. With a few lines of code, developers can instantiate models, tokenize inputs, and generate outputs, making it suitable for experimentation, prototyping, or building custom applications.

The Hugging Face Transformers library is widely used for programmatic access to LLMs. Its Python API provides a unified interface to load pretrained models and tokenizers, run inference, and even fine-tune models. Using the pipeline abstraction or the AutoModel and AutoTokenizer classes, developers can quickly set up models for tasks such as text generation, classification, or question answering. Transformers also integrate well with PyTorch or TensorFlow, enabling both CPU and GPU execution. Its main advantage is ease of integration with existing Python workflows, along with a large catalog of models that can be accessed programmatically.

Ollama provides a different approach by combining a local model repository with a simple Python client and command-line interface. It allows developers to download pretrained models and interact with them in a chat-oriented fashion or programmatically. The Python client supports sending prompts, streaming responses, and chaining interactions, making it suitable for building chatbots, agents, or other conversational tools locally. One key benefit of Ollama is simplicity in model management, as the framework handles downloading, caching, and running models efficiently.

LangChain, on the other hand, is a higher-level framework that allows developers to embed LLMs into structured workflows and applications. Instead of just calling a model for inference, LangChain enables chaining prompts, adding memory, integrating retrieval systems, or connecting models with external tools. It supports using local or cloud-based models, including those from Hugging Face or Ollama, and exposes a consistent Python API for orchestrating multi-step interactions. This allows developers to build complex applications such as autonomous agents, document assistants, or retrieval-augmented generation systems programmatically.

The pros of running LLMs locally using these frameworks include data privacy, low latency, cost control, and full control over the model. Developers can experiment freely without cloud restrictions or API limits and integrate models into custom Python applications. The cons include hardware requirements, since larger models demand high RAM and GPUs, and complexity in setup for some frameworks, particularly when optimizing for performance. Additionally, keeping models up to date or integrating the latest model versions can require manual management. Overall, these frameworks provide a powerful, programmatic way to harness LLMs locally, balancing flexibility with control over resources and workflows.

In [ ]: