Disclaimer: This Jupyter Notebook contains content generated with the assistance of AI. While every effort has been made to review and validate the outputs, users should independently verify critical information before relying on it. The SELENE notebook repository is constantly evolving. We recommend downloading or pulling the latest version of this notebook from Github.
Working with the OpenAI API — An Introduction¶
The main benefit of accessing LLMs through provided APIs is that developers and organizations can leverage powerful, state-of-the-art models without needing to train or maintain them themselves. Training large language models requires massive datasets, specialized expertise, and extremely costly compute infrastructure — resources that are out of reach for most teams. With an API, all of that complexity is abstracted away, and you gain instant access to reliable, production-grade models through simple function calls in your code.
Another advantage is flexibility and scalability. By using an API, you can integrate language understanding, reasoning, and generation into a wide range of applications — such as chatbots, search systems, content generation tools, or code assistants — without being tied to a single platform or architecture. The API ensures you always have access to the latest improvements, updates, and safety features of the model, while scaling seamlessly with demand. This allows teams to focus on building innovative applications rather than managing machine learning infrastructure.
In this notebook, we will explore the basic use of the OpenAI API with Python. The OpenAI API provides programmatic access to powerful foundation large language models (LLMs) such as GPT, which are capable of understanding and generating natural language, writing code, analyzing data, and much more. By integrating these capabilities directly into Python workflows, developers and researchers can experiment with language models in an interactive and flexible way.
Setting up the Notebook¶
Make Required Imports¶
This notebook requires the import of different Python packages (particularly the official OpenAI Python API library) but also additional Python modules that are part of the repository. If a package is missing, use your preferred package manager (e.g., conda or pip) to install it. If the code cell below runs with any errors, all required packages and modules have successfully been imported.
import openai
import random
from IPython.display import Image, Audio
from src.utils.data.files import *
Download Required Data¶
Some code examples in this notebook use data that first need to be downloaded by running the code cell below. If this code cell throws any error, please check the configuration file config.yaml if the URL for downloading datasets is up to date and matches the one on Github. If not, simply download or pull the latest version from Github.
mp3_voice_sample_english, _ = download_dataset("audio/voice/samples/voice-sample-female-english-01.mp3")
mp3_voice_sample_spanish, _ = download_dataset("audio/voice/samples/voice-sample-female-spanish-01.mp3")
File 'data/datasets/audio/voice/samples/voice-sample-female-english-01.mp3' already exists (use 'overwrite=True' to overwrite it). File 'data/datasets/audio/voice/samples/voice-sample-female-spanish-01.mp3' already exists (use 'overwrite=True' to overwrite it).
Preliminaries¶
Before checking out this notebook, please consider the following:
The OpenAI API is not free; it operates on a prepaid, pay-as-you-go model, and you must add a valid payment method and prepay at least a small amount, to use the service. OpenAI no longer provides free trial credits for new accounts, so to utilize the API, you need to fund your account with credits (note: this information may not be up to date; check the official page for last information about pricing).
This notebooks serves as a basic introduction to the OpenAI API. The API offers many endpoints and endpoint parameters, making the API very flexible and powerful. For many more and more sophisticated examples, you can check out the official OpenAI Cookbook as well as the official API documentation.
The OpenAI API as well as the information this notebook refers to are subject to change. This means that some information in this notebook may be outdated or some of the code examples will no longer work ("as is"). This notebook was last updated in September 2025.
Authentication & API Keys¶
The purpose of authentication when accessing an API is to verify the identity of the user or application making the request, ensuring that only authorized parties can interact with the service. This protects the API from unauthorized access, misuse, or abuse, while also allowing the provider to track usage, enforce quotas, and associate requests with specific accounts for billing or monitoring. In short, authentication establishes trust between the client and the API provider, enabling secure and accountable communication.
API keys act as a secure identifier that authenticates a user or application when accessing an API. In other words, an API key proves to the service provider (like OpenAI) that the request is coming from a valid, authorized source. This ensures that only approved users can make requests and that usage can be tracked and billed appropriately. API keys also serve an important role in usage monitoring and security. They allow providers to enforce rate limits, track how much each user or organization is consuming, and prevent abuse or unauthorized access. For developers, API keys make it possible to integrate powerful external services into applications while maintaining control over who can access those services and under what conditions.
The OpenAI API distinguishes between two types of keys the; the main difference lies in ownership, usage scope, and billing management.:
Personal API keys are tied directly to an individual OpenAI account. They are meant for personal use — such as testing, learning, or building small projects. Billing is linked to the individual user's payment method, and usage is tracked under that person's account. These keys are ideal if you are experimenting on your own or developing prototypes without team collaboration.
Organization API keys are associated with an organization's OpenAI account. Multiple team members can share access, and all API usage is consolidated under the organization's billing and quotas. This makes it easier for companies, research groups, or teams to manage shared resources, monitor usage, and ensure that projects are funded centrally rather than through individual accounts.
In the following, we assume you want to create a personal API key. If you are a member of an organization, company, or team with an organization API key, you can check with the creator of the organization to get access to the API.
Creating an API Key¶
To create an API key for accessing the OpenAI API, you must first sign in to your OpenAI account. From the account dashboard, you can navigate to the API Keys section (left side bar). There, you can the click "Create new secret key" button (top right) to generate a new API key. Once created, the key will be displayed only once, so it is important to copy and store it securely (e.g., in a password manager or environment variable). You can then use this key in your code to authenticate requests to the OpenAI API. If your key is ever compromised (e.g., accidentally shared or leaked to other people), you should revoke it immediately and generate a new one.
Storing your API Key¶
It is not a good practice to include API keys directly in your code because doing so exposes them to potential misuse. If the code is shared publicly (for example, on GitHub, in a notebook, or even within a team), the keys can easily be copied and used by unauthorized parties. This can lead to unexpected costs, data breaches, or service abuse since the API provider cannot distinguish between your legitimate use and someone else's.
Instead, best practice is to store API keys securely outside of your code — for example, in environment variables, a configuration file that is excluded from version control, or a secrets manager. Your application can then load the key at runtime without exposing it in the source code, keeping access secure while still enabling smooth integration. In the following, we describe how you can store your newly generated OpenAI API key using an environment variable; this depends on your type of operating system.
Windows¶
To set a variable that persists across reboots and new sessions, use the setx command in the Windows PowerShell or the Windows Command Prompt. Note that changes made with setx won't be visible in your current command prompt session; you will need to open a new one. In case of setting an environment variable for the OpenAI API key, the required command is
setx OPENAI_API_KEY "your_api_key_here"
Of course, you need to replace your_api_key_here with the API you have just generated. Alternatively, you can use the graphical user interface. Navigate to System Properties, then click on Environment Variables. There, you can add or edit user-specific or system-wide variables.
MacOS & Linux¶
For a temporary definition of an environment variable, in both macOS and Linux, which are Unix-like systems, you can use the export command to set a variable for the current shell session. However this requires that you have to start your Jupyter environment in the same session you have defined the variable. A more convenient solution is therefore to store the API key permanently. To make a variable permanent, you need to add the export command to one of your shell's startup files. The correct file depends on the shell you're using. Some common files are:
~/.bash_profileor~/.bashrcfor Bash shells~/.zshrcfor Zsh shells (default on modern macOS)
You can edit these files using a text editor like nano or vim to add the following export command to the file (e.g., at the end):
setx OPENAI_API_KEY "your_api_key_here"
After adding the variable, you'll need to source the file or open a new terminal session for the changes to take effect, for example:
source ~/.bash_profile
To check if your environment variable is correctly set you can try to print in the terminal using the following command:
echo $OPENAI_API_KEY
If you see your API key, the environment variable for the key has been set correctly, and you are ready to go.
Creating an Client¶
Once you have set the environment variable for the OpenAI API key you can create a client to use the API for requests. In the OpenAI Python library, the client is the main object you use to communicate with the OpenAI API. It acts as a wrapper around all API endpoints, handling authentication, request formatting, and responses for you. You can create a client using the the following line of code:
client = openai.OpenAI()
The OpenAI class can directly read environment variables, so you do not need to manually pass the API key if it is already set as OPENAI_API_KEY; the library looks for the OPENAI_API_KEY environment variable by default. Again, this makes the code simpler as well as safer. You are now ready to use this client to send requests to the different available API endpoints, and receive the responses in a structured way for further processing in your application. This is what we look at next.
API Endpoints — Overview¶
OpenAI provides a variety of models via its API, each optimized for different tasks, levels of performance, and cost. The primary categories include GPT models for text generation and understanding, embedding models for semantic representations, and multimodal models that handle text, images, or audio. Within these categories, there are multiple versions and variants, such as the GPT-4 family, GPT-3.5, and specialized models like gpt-4o-mini, each designed to balance capabilities, speed, and cost.
GPT models differ mainly in performance, capabilities, and pricing. For instance, GPT-4 models are more capable than GPT-3.5 in terms of reasoning, creativity, and handling complex instructions, but they are also more expensive per token and may have higher latency. Lighter variants like gpt-4o-mini provide faster responses at a lower cost but may sacrifice some depth of reasoning or nuance. This allows developers to choose models based on the specific needs of their application, whether that’s high-quality content generation, interactive chatbots, or rapid prototyping.
Embedding models, on the other hand, are optimized for representing text or other data in high-dimensional vector spaces for tasks like search, clustering, or recommendation. They generally have lower costs than full GPT models because they are specialized and do not perform free-form text generation. Finally, multimodal models expand the range of tasks to include image understanding, audio transcription, and code generation, offering flexibility for applications that require multiple types of input or output. By offering a range of models with different strengths and pricing, OpenAI enables developers to select the most suitable tool for their project while balancing performance and cost.
The code cell use the client to retrieve a list of all available models; keep in mind that this list is likely to change over time:
models = client.models.list()
for model in models:
print(model.id)
gpt-4-0613 gpt-4 gpt-3.5-turbo gpt-audio gpt-5-nano gpt-audio-2025-08-28 gpt-realtime gpt-realtime-2025-08-28 davinci-002 babbage-002 gpt-3.5-turbo-instruct gpt-3.5-turbo-instruct-0914 dall-e-3 dall-e-2 gpt-4-1106-preview gpt-3.5-turbo-1106 tts-1-hd tts-1-1106 tts-1-hd-1106 text-embedding-3-small text-embedding-3-large gpt-4-0125-preview gpt-4-turbo-preview gpt-3.5-turbo-0125 gpt-4-turbo gpt-4-turbo-2024-04-09 gpt-4o gpt-4o-2024-05-13 gpt-4o-mini-2024-07-18 gpt-4o-mini gpt-4o-2024-08-06 chatgpt-4o-latest o1-mini-2024-09-12 o1-mini gpt-4o-realtime-preview-2024-10-01 gpt-4o-audio-preview-2024-10-01 gpt-4o-audio-preview gpt-4o-realtime-preview omni-moderation-latest omni-moderation-2024-09-26 gpt-4o-realtime-preview-2024-12-17 gpt-4o-audio-preview-2024-12-17 gpt-4o-mini-realtime-preview-2024-12-17 gpt-4o-mini-audio-preview-2024-12-17 o1-2024-12-17 o1 gpt-4o-mini-realtime-preview gpt-4o-mini-audio-preview o3-mini o3-mini-2025-01-31 gpt-4o-2024-11-20 gpt-4o-search-preview-2025-03-11 gpt-4o-search-preview gpt-4o-mini-search-preview-2025-03-11 gpt-4o-mini-search-preview gpt-4o-transcribe gpt-4o-mini-transcribe o1-pro-2025-03-19 o1-pro gpt-4o-mini-tts o3-2025-04-16 o4-mini-2025-04-16 o3 o4-mini gpt-4.1-2025-04-14 gpt-4.1 gpt-4.1-mini-2025-04-14 gpt-4.1-mini gpt-4.1-nano-2025-04-14 gpt-4.1-nano gpt-image-1 gpt-4o-realtime-preview-2025-06-03 gpt-4o-audio-preview-2025-06-03 gpt-5-chat-latest gpt-5-2025-08-07 gpt-5 gpt-5-mini-2025-08-07 gpt-5-mini gpt-5-nano-2025-08-07 gpt-3.5-turbo-16k tts-1 whisper-1 text-embedding-ada-002
Unfortunately, this request does not include any pricing information, which needs to be looked up on the official page. However, the table below showing an overview to the main model families includes — apart from the naming patterns and typical use cases — also a general pricing focus.
| Model Family | Naming Pattern | Examples | Typical Use Cases | Pricing Focus |
|---|---|---|---|---|
| GPT (Text & Chat) | gpt-<version>[-variant] |
gpt-4o, gpt-4o-mini, gpt-3.5-turbo |
Natural language generation, chatbots, reasoning, coding assistance | Mini/3.5 = cheaper, 4/4o = more powerful |
| Embeddings | text-embedding-<generation>-<size> |
text-embedding-3-small, text-embedding-3-large |
Semantic search, clustering, recommendations, similarity comparisons | Small = cost-efficient, Large = higher accuracy |
| Images | gpt-image-<version> |
gpt-image-1 |
Text-to-image generation, creative design, visual prototyping | Pay-per-image, higher resolution costs more |
| Audio | whisper-<version> |
whisper-1 |
Speech-to-text transcription, audio translation | Flat pricing per minute of audio |
| Moderation | <family>-moderation-<version> |
omni-moderation-latest, omni-moderation-dev |
Content safety filtering, compliance, harmful content detection | Free of charge |
In the following, we will go through each of the model families and show basic examples of their use in practice.
Chat / Conversation¶
The client.chat.completions.create endpoint is used to generate responses in a conversational format. This endpoint is optimized for chat-based interactions where input and output are structured as a sequence of messages between roles such as "system", "user", and "assistant". This structure allows developers to define context (e.g., rules or instructions with "system"), provide user queries ("user"), and receive coherent, context-aware responses ("assistant"). It is commonly used for building chatbots, virtual assistants, customer support agents, and other interactive applications that require back-and-forth dialogue. The endpoint supports advanced features like function calling, multi-turn memory, and response formatting, making it versatile for both natural conversation and structured tasks such as querying APIs, summarization, or data extraction.
Let's run this method with a simple example prompt. We use the gpt-3.5-turbo model as it offers a strong balance of performance, cost, and speed, making it one of the most practical models for real-world applications. It is significantly faster and cheaper than previous GPT-3 models while maintaining high-quality responses suitable for conversational AI, text generation, and reasoning tasks. Its optimization for chat-based interactions allows developers to build scalable applications like chatbots, support agents, and content tools without incurring the higher costs of larger models like GPT-4, making it an efficient choice for both experimentation and production.
Apart from the client and the model, the example below only includes a "user" prompt (and ignoring all other roles). However, notice how the user prompt is an element of a list which is the value for the argument messages. This list can easily be extended to include additional prompts from the other two roles ("system" and "assistant").
model = "gpt-3.5-turbo"
prompt = "What us the distance between the earth and the moon?"
try:
# Submit prompt to model
completion = client.chat.completions.create(model=model, messages=[{"role": "user","content": prompt}]
)
# Return reponse (content only)
print(completion.choices[0].message.content)
except Exception as e:
print(f"An error occured: {e}")
The average distance between the Earth and the moon is about 238,855 miles (384,400 kilometers). However, this distance can vary due to the elliptical orbit of the moon around the Earth.
The client.chat.completions.create endpoint is one of the most commonly used and flexible endpoints in the OpenAI API because it is designed specifically for conversational workflows while also supporting a wide range of other tasks. Its structured message format with roles ("system", "user", "assistant") makes it easy to guide the model's behavior, maintain context across turns, and create dynamic multi-turn conversations. This design aligns naturally with how most real-world applications — such as chatbots, assistants, or tutoring systems — interact with users. Beyond simple dialogue, the endpoint's flexibility comes from its ability to handle advanced features like function calling, tool integration, and structured output formatting. This allows developers to go beyond natural conversation and use it for automation, data processing, and API orchestration. Because it can serve both as a conversational engine and a general-purpose text generator, it has become the central entry point for building with OpenAI models.
Responses¶
The client.responses.create endpoint is a newer, more flexible alternative to the classic chat.completions.create endpoint in the OpenAI API. It is designed as a unified interface for generating model outputs, whether they are plain text, structured responses, or tool-augmented actions. Instead of requiring a structured array of messages, it accepts a simpler input field, and it can return richer, event-based outputs in addition to plain text. This makes it easier to integrate into applications that need structured data, streaming responses, or more complex workflows. The endpoint is primarily used for building stateful, agent-like applications. It supports features such as maintaining conversational context across requests (via previous_response_id), calling external tools (e.g., web search, file search, code execution), and producing structured outputs like JSON that align with schemas. In practice, this means developers can use this endpoint not just for chatbots, but also for advanced assistants, reasoning tasks, and applications where the model must interact with external systems in a controlled, reliable way.
A minimal working example for calling the client.responses.create endpoint is shown in the code cell below. Note that unlike client.chat.completions.create endpoint, which requires a messages array with roles ("system", "user", "assistant"), the newer client.responses.create endpoint works differently. You do not explicitly assign roles like "system" or "assistant" in the same way. Instead, you provide input, which can be plain text or a list of structured content items. If you need role-like behavior (e.g., system instructions vs. user input), that is handled through the input structure or by passing system-level instructions separately. The Responses API is designed to be simpler and more general-purpose, so it abstracts away explicit roles. We will see more practical examples in later in this notebook
model = "gpt-3.5-turbo"
prompt = "What us the distance between the earth and the moon?"
try:
# Submit prompt to model by call API endpoint
response = client.responses.create(model=model, input=prompt)
# Return reponse (content only)
print(response.output_text)
except Exception as e:
print(f"An error occured: {e}")
The average distance between the Earth and the Moon is about 384,400 kilometers (238,855 miles). However, this distance can vary due to the Moon's elliptical orbit around the Earth.
This is a newer responses.create endpoint is designed as a superset of chat.completions.create. It simplifies multi-turn dialogues and agent workflows by allowing the service to manage conversation state. You can pass a previous_response_id to continue a dialogue naturally without resending the full message history. It also unifies the arguments and usage with other endpoints. The table below provides a brief comparison between both endpoints.
| Feature | chat.completions.create |
responses.create |
|---|---|---|
| Context Management | Client-managed message history | Server-managed via previous_response_id |
| Input Format | Array of messages (roles + content) |
Simple input (text or structured list) |
| Response Format | choices array |
Structured output with semantic events and output_text |
| Tooling / Agent Capabilities | Manual integration | Native support for web search, file search, computer use |
| Statefulness | Stateless, conversation managed client-side | Stateful, maintained by service |
| Ease of advanced flows | Requires manual orchestration | Simplified orchestration with built-in support |
| Status / Future Role | Long-standing standard; remains supported | The new, extensible, forward-looking default for agents |
Both endpoints are very flexible and powerful and the choice of endpoint generally depends on the kind of task you want to solve using the OpenAI API.
Moderation¶
The moderations.create endpoint of the OpenAI API is a specialized tool for content safety. Its main purpose is to analyze a given piece of text (or an image, with newer models) and determine if it violates OpenAI's usage policies. This is a critical step for developers to proactively filter or flag harmful content, such as hate speech, self-harm, sexual content, or violence, before it is displayed to users or processed by a language model. The API returns a classification of the content across a variety of categories, along with a boolean flag indicating if any policy was violated. The models suitable for this endpoint are specifically trained for moderation tasks. The most up-to-date and recommended model is omni-moderation-latest, which is built on GPT-4o and supports both text and image inputs, with improved accuracy, especially for non-English languages. There is also a legacy model, text-moderation-latest, which only handles text. The use of the moderation API is free of charge, and it is a vital component for any application that handles user-generated content, acting as a powerful and scalable guardrail for content safety.
The code cell below lists a few example prompts you can try with this endpoint, but you can also write your own prompts to see how well the model performs. We use the omni-moderation-latest model as it should be used when calling the moderation endpoint because it is OpenAI's most up-to-date, accurate, and cost-efficient moderation model. This model is continuously improved to better detect harmful or policy-violating content. By always pointing to -latest, developers automatically benefit from ongoing updates without needing to change their code whenever a new moderation model is released.
model = "omni-moderation-latest"
#prompt = "Earth and venus are two planets of very similar size" # (harmless prompt)
prompt = "I wish all my enemies would suffer!"
#prompt = "I hate the world. I wish I wouldn't wake up tomorrow."
#prompt = "Let's hunt those bastards down and give them a beating."
try:
# Submit prompt to model by call API endpoint
response = client.moderations.create(model=model, input=prompt)
# Extract main moderation output from JSON object
moderation_output = response.results[0]
except Exception as e:
print(f"An error occured: {e}")
The output of the moderations.create endpoint is returned in JSON format, structured to give moderation results for each input text. The most basic result is a boolean flag, where True indicates that the moderation identified problems with the prompt.
print(f"Response flag: {moderation_output.flagged}")
Response flag: True
Apart from just a simple flag whether a prompt violates any usage policy, the response also includes more fine-grained information about various categories that indicate whether the input text potentially falls into certain sensitive or restricted content areas. Some of the main categories (as of the current omni-moderation-latest model) are:
- hate: content that expresses, incites, or promotes hate based on identity.
- hate/threatening: hate content that also includes threats of violence.
- self-harm: content promoting, encouraging, or depicting self-harm.
- sexual: sexually explicit or pornographic content.
- sexual/minors: sexual content involving or targeting minors.
- violence: content that promotes or depicts violence.
- violence/graphic: explicit or gory depictions of violence.
The response provides boolean values (true/false) for each category, indicating whether the content is likely to fall under it. Additionally, there are category scores, which gives a confidence score (between 0 and 1) for each category. This allows developers to build moderation logic that is either strict (blocking on any flag) or flexible (using thresholds on category scores). The code snippet below shows for each available category the boolean values as well as the score.
for category, score in zip(moderation_output.categories, moderation_output.category_scores):
print(f"{category[0]}: {category[1]} ({score[1]:.4f})")
harassment: True (0.5244) harassment_threatening: True (0.4626) hate: False (0.0084) hate_threatening: False (0.0046) illicit: False (0.0166) illicit_violent: False (0.0002) self_harm: False (0.0005) self_harm_instructions: False (0.0002) self_harm_intent: False (0.0003) sexual: False (0.0001) sexual_minors: False (0.0000) violence: True (0.4232) violence_graphic: False (0.0016) harassment/threatening: True (0.4626) hate/threatening: False (0.0046) illicit/violent: False (0.0002) self-harm/intent: False (0.0003) self-harm/instructions: False (0.0002) self-harm: False (0.0005) sexual/minors: False (0.0000) violence/graphic: False (0.0016)
At least for the simple examples given in the previous code cell, both the boolean flags and numerical scores for each category are arguably a good match. Of course, language is very expressive and can also be very subtle, typically making it harder for the model to guarantee a reliable moderation.
Embeddings¶
The embeddings.create endpoint of the OpenAI API is used to convert text (or other data types) into high-dimensional numerical vectors, known as embeddings. Embeddings capture the semantic meaning of the input, so texts with similar meanings have vectors that are close to each other in the vector space. Unlike text generation endpoints, which produce human-readable outputs, embeddings are primarily used for mathematical and computational operations such as similarity comparison, clustering, or search. This endpoint is commonly used for semantic search and information retrieval. By embedding both a query and a collection of documents, developers can measure the similarity between them using cosine similarity or other distance metrics. This allows applications to retrieve the most relevant documents even if the exact words in the query do not appear in the documents, making it far more powerful than traditional keyword-based search. Embeddings are also widely used in recommendation systems, clustering, and classification tasks. For example, they can help group similar customer reviews, recommend similar products, or categorize content based on semantic meaning.
A minimal example using this endpoint is, again, very easy to implement. We use a simple prompt and the text-embedding-3-small model. This model offers a highly efficient and cost-effective solution for generating embeddings while still maintaining strong performance across semantic search, clustering, classification, and recommendation tasks. Compared to larger embedding models, it is much faster and cheaper, making it ideal for large-scale applications where millions of vectors need to be computed or stored. Despite its smaller size, it delivers competitive accuracy in capturing semantic meaning, making it a practical default choice for most use cases that balance quality with affordability.
model = "text-embedding-3-small"
prompt = "I wish all my enemies would suffer!"
try:
# Generate embedding vector by calling endpoint
response = client.embeddings.create(model=model, input=prompt)
# Extract embedding vector from response object
embedding_vector = response.data[0].embedding
# Print the length of the embedding vector and the first 3 elements as examples
print(f"Length of embedding vector: {len(embedding_vector)}")
print(f"First 3 vector elements: {embedding_vector[:3]}")
except Exception as e:
print(f"An error occured: {e}")
Length of embedding vector: 1536 First 3 vector elements: [0.0013678136747330427, -0.027956858277320862, 0.017621532082557678]
These embedding vectors can then be used to perform a variety of tasks that rely on measuring similarity or clustering of texts. For example, in semantic search, you can compare the embedding of a query with embeddings of a document database using cosine similarity or Euclidean distance to find the most relevant documents. Similarly, embeddings can be used for clustering similar texts, which is useful for topic modeling, grouping customer feedback, or organizing large text corpora. Embeddings are also commonly used for recommendation systems, where items (e.g., products, articles, or media) are represented by embeddings and recommendations are made by finding items with vectors closest to a user's preferences. In classification tasks, you can feed embeddings into a downstream machine learning model to predict categories based on the semantic content of the text. Overall, embeddings serve as a versatile, numerical representation of language that enables efficient similarity comparisons, clustering, search, and integration into AI-driven applications.
Audio / Speech¶
The OpenAI API features an endpoint for audio content in the form of speech: audio.transcriptions.create to convert audio to text (speech-to-text), and audio.translations.create to translate audio into another language, typically English. Under the hood, both endpoints are very similar and may also use the same model.
Transcriptions¶
The audio.transcriptions.create endpoint is designed to convert spoken audio into written text, essentially performing speech-to-text transcription. You provide an audio file in a supported format (such as MP3, WAV, or M4A), and the model processes the audio to produce an accurate textual representation of the speech. This endpoint leverages the Whisper model, which is optimized for understanding natural language in audio across multiple languages. This endpoint is commonly used for transcribing meetings, interviews, podcasts, or lectures into written form, making spoken content easier to search, analyze, or archive. It is also useful for building applications such as voice assistants, automated captioning systems, or multilingual transcription services, where converting audio to text is a critical step for further processing or user interaction. By automating transcription, it saves time and reduces the need for manual note-taking or transcription services.
To test the endpoint, we provide a small mp3 file from a speech recognition dataset containing a voice recording of an English speaker. You can listen to the recording using the built-in audio play below to check if the transcription indeed matches the recording.
Audio(mp3_voice_sample_english)
We can now take this mp3 file, load it as a binary file, and pass it as value for the file argument of the endpoint. This is rather different to previous endpoints, since audio.transcriptions.create does not expect a text prompt as input. The response, however, is a simple JSON document containing the string representing the transcript of the file. The whisper-1 model is OpenAI’s speech-to-text system, designed for automatic speech recognition (ASR) tasks such as transcribing spoken language into written text. It is built on top of the open-source Whisper architecture, which was trained on a large, diverse, and multilingual dataset, giving it strong performance across many languages, dialects, and noisy environments.
model = "whisper-1"
try:
# Read mp3 file as binary file
audio_file = open(mp3_voice_sample_english, "rb")
# Generate transcript by calling the endpoint
response = client.audio.transcriptions.create(model=model, file=audio_file)
# Show generated transcription
print(f"Transcription:\n{response.text}")
except Exception as e:
print(f"An error occured: {e}")
Transcription: The invention of movable metal letters in the middle of the 15th century may justly be considered as the invention of the art of printing.
Translations¶
In contrast to transcriptions, the audio.translations.create endpoint is designed to translate spoken audio from one language into another written language, typically English. You provide an audio file in a supported format (like MP3, WAV, or M4A), and the model processes the audio to generate a translated text output. This endpoint also uses the Whisper model, which is capable of both transcription and translation, making it suitable for multilingual audio content. This endpoint is commonly used for localizing audio content**, such as interviews, lectures, podcasts, or international meetings, into a target language for broader accessibility. It enables applications like multilingual transcription services, global content distribution, and real-time translation tools, where users can understand spoken content in languages they may not speak. By automating translation of audio, it reduces the need for manual translation and allows developers to integrate multilingual capabilities directly into their software.
As before, let's use a short mp3 file containing a voice recording of a Spanish speaker to test the endpoint.
Audio(mp3_voice_sample_spanish)
The use of this endpoint is very similar to the one for generating transcription; see the code cell below. Again, this important difference compared to most endpoints is that one of the main arguments is an audio file instead of a text prompt. We also use the same model whisper-1 as it was trained on a multilingual dataset and is therefore also capable of translating voice recordings (here from Spanish to English).
model = "whisper-1"
try:
# Read mp3 file as binary file
audio_file = open(mp3_voice_sample_spanish, "rb")
# Generate transcript by calling the endpoint
response = client.audio.translations.create(model=model, file=audio_file)
# Show generated translation
print(f"Translation:\n{response.text}")
except Exception as e:
print(f"An error occured: {e}")
Translation: Now, he said, the door below is well closed. With keys and locks, Tobias replied. The boards are solid, lined with iron, and the posts, the posts too.
The audio endpoint using the Whisper model for speech-to-text is powerful for transcription and translation but has several limitations. Accuracy can be affected by background noise, overlapping speakers, heavy accents, or low-quality audio recordings, which may lead to misinterpretations or incomplete transcriptions. The model may also struggle with specialized vocabulary, technical jargon, or rare languages, limiting its reliability for domain-specific applications. There are also practical constraints like latency, file size limits, and rate restrictions. Large audio files require longer processing times and more tokens, which can increase costs and slow real-time applications. Additionally, the API outputs primarily textual transcriptions, meaning it does not natively provide speaker diarization, emotion detection, or other enriched audio analyses without additional processing. These factors make the audio endpoints best suited for general-purpose transcription and translation rather than high-precision or highly interactive audio analysis tasks.
Images¶
The images.create endpoint is used for text-to-image generation, where you provide a written prompt and the model produces an image that matches the description. This endpoint leverages models like gpt-image-1, which are trained to interpret natural language prompts and translate them into detailed, coherent visuals. You can also specify parameters such as image size or the number of variations, giving flexibility in how images are generated. It is commonly used for creative content generation, design prototyping, and visualization tasks. For example, developers and artists can use it to generate concept art, marketing materials, or illustrative graphics without requiring manual design work. Businesses can also use it to create product mockups, social media visuals, or personalized artwork at scale. By enabling programmatic image generation, this endpoint allows applications to dynamically produce visual content tailored to user input or specific scenarios.
Let's run — or at least try to run — this function with some simple prompt. The gpt-image-1 model is designed for image generation and editing based on text prompts. It allows users to create new images from scratch, modify existing ones (such as adding, removing, or altering objects), and even apply style transformations, all guided by natural language descriptions. In the example below, we only use the model to generate an image for a given text prompt. Apart from the model and the prompt, the size of the generated images is an important argument here. At the time of writing, the only supported values are: "1024x1024", "1024x1536", "1536x1024", and "auto". The response of the call is not the image itself but a URL pointing to the generated image. If needed, this image can then be downloaded using some additional simple code.
model = "gpt-image-1"
prompt = "A snowman with an colorful drink standing on a beach."
try:
# Generate image by calling the endpoint
response = client.images.generate(model=model, prompt=prompt, size="1024x1024")
# Print URL pointing to generate image
print("Generated image URL:", response.data[0].url)
except Exception as e:
print(f"An error occured: {e}")
An error occured: Error code: 403 - {'error': {'message': 'Your organization must be verified to use the model `gpt-image-1`. Please go to: https://platform.openai.com/settings/organization/general and click on Verify Organization. If you just verified, it can take up to 15 minutes for access to propagate.', 'type': 'invalid_request_error', 'param': None, 'code': None}}
Important: Depending on your exact OpenAPI account, you are likely to get an error. The OpenAI API endpoint for generating images requires a verification process primarily to prevent misuse and ensure responsible access. Image generation models can be used to create realistic or harmful visual content, including deepfakes, violent imagery, or copyrighted material. By requiring identity verification, OpenAI can confirm that users are accountable for how they use the service, which helps mitigate risks associated with the distribution of unsafe or illegal content. This also allows OpenAI to enforce usage policies and comply with legal and regulatory requirements. Additionally, verification helps manage resource allocation and abuse prevention. Generating images is computationally intensive, and unrestricted access could lead to excessive or malicious usage that impacts service availability for legitimate users. By verifying accounts, OpenAI can tie usage to a responsible entity, monitor usage patterns, and apply safeguards such as rate limits, ensuring that the API remains reliable and sustainable for everyone. This process balances accessibility with safety and ethical considerations for AI-generated visual content.
Files & Fine-tuning¶
The OpenAI API provides functionality not only for interacting with pre-trained models but also for customizing models through fine-tuning. Two key endpoints in this process are files.upload and fine_tunes.create. These endpoints enable developers to prepare training data, upload it to OpenAI, and create a fine-tuned version of an existing model that is better suited for specific tasks or domains:
files.upload: this endpoint allows users to securely upload files to OpenAI's servers for use in fine-tuning or other purposes. Typically, these files are in JSONL format, where each line represents a training example with a"prompt"and a"completion"field. This endpoint ensures that the training data is properly formatted, stored, and accessible for subsequent fine-tuning processes. Users receive a file ID after uploading, which acts as a reference to the uploaded file in other API calls.fine_tunes.createthis endpoint is used to initiate a fine-tuning job. This endpoint takes the base model (e.g.,davinciorcurie) and the uploaded file ID as input and begins the process of training a new model that adapts the base model to the custom dataset. Fine-tuning allows the model to perform better on domain-specific tasks, produce consistent outputs, and follow particular styles or instructions defined in the training data.
Together, these endpoints form a workflow that makes model customization accessible and manageable. By uploading curated training datasets and initiating fine-tuning jobs, developers can leverage OpenAI's powerful models while tailoring them to specific business, research, or application requirements. This capability is particularly valuable for applications that require high accuracy in niche domains, specialized terminology, or unique response patterns that generic pre-trained models may not handle optimally. However, since fine-tuning a model involves the collection, cleaning, and preparation of a training dataset, using those endpoints is beyond the scope of this notebook.
Legacy Endpoints¶
The legacy endpoints of the OpenAI API are the older interfaces that were used before the introduction of the chat-based API and other specialized endpoints. They are still available for backward compatibility, but OpenAI recommends using the newer endpoints for most tasks. The main legacy endpoints are briefly covered in this section.
Text Completion¶
The completions.create endpoint is designed for single-turn text generation. You provide a prompt, and the model generates a continuation or response based on that prompt. This endpoint is suitable for tasks such as summarization, creative writing, code generation, or completing a text snippet. It uses the traditional completion models (like text-davinci-003) and does not require the chat-style messages format. You can control the output using parameters like max_tokens, temperature, and stop sequences.
Note that this endpoint is considered a legacy endpoint because it predates the newer chat-based API (chat.completions.create) that is now the recommended standard for most text generation tasks. While it still works for generating text from a prompt, it lacks the built-in mechanisms for managing conversational context that chat models provide. OpenAI’s newer chat endpoints are more flexible and better suited for modern applications, including multi-turn interactions, role conditioning (system, user, assistant), and improved handling of instructions. The chat API also supports newer, more capable models like GPT-4 and GPT-4o, which are optimized for conversational reasoning. Because of these advantages, completions.create is now mostly maintained for backward compatibility, while developers are encouraged to use chat.completions.create for new projects.
Edits¶
The edits.create endpoint of the OpenAI API is a legacy endpoint that was specifically designed to edit or correct a piece of text. Its main purpose was to take an existing input string and an instruction, then return the edited version of the text. This was a dedicated API for tasks like proofreading, rewriting a sentence to be more concise, or fixing grammatical errors. The key advantage was its focused functionality for a specific task, differing from general text generation. However, this endpoint, along with the models it was designed for, has been officially deprecated by OpenAI. The models text-davinci-edit-001 and code-davinci-edit-001 were created for this purpose. OpenAI now recommends using the chat.completions.create ot responses.create endpoints with their newer, more capable models for all text editing and correction tasks. The modern chat-based models, such as gpt-3.5-turbo and gpt-4, are more versatile and can perform editing tasks with greater accuracy and flexibility by simply providing a clear instruction in the prompt.
Responses API — Practical Examples¶
The responses.create endpoint is the next-generation interface of the OpenAI API, designed to go beyond traditional chat completions. Unlike the older chat.completions.create endpoint — which relies on a fixed messages structure with roles —the Responses API accepts a more flexible input format and provides richer, structured outputs. It supports not only text generation, but also stateful conversations, tool usage, and structured response formats such as JSON. This makes it especially well-suited for building agent-like applications, assistants that interact with external systems, and workflows requiring precise control over output.
In the previous overview to all endpoints, we only included a most basic example for using this endpoint. This section now provides more details. We will first walk through the different arguments supported by responses.create, explaining how each affects the model's behavior. We will then explore a series of practical examples, starting with simple text generation and moving towards more advanced use cases such as structured outputs, reasoning control, and multi-turn interactions with memory. By the end, you should have a clear understanding of how to leverage the Responses API to build powerful and flexible AI applications.
Input Arguments¶
In our introductory example for the reponses.create endpoint, we only considered the two required arguments model and input. However, the endpoint provides a whole series of arguments to make its use more flexible and its output more customizable. The table below provides an overview over the input arguments supported by the endpoint.
| Argument | Purpose |
|---|---|
model (required) |
Chooses which language model to use (e.g., gpt-3.5-turbo) |
input (required) |
Main prompt content (text or structured list) |
previous_response_id |
Enables context retention from prior responses |
temperature |
Adjusts randomness and creativity of responses |
top_p |
Alternative sampling parameter to control diversity |
max_output_tokens |
Caps the length of generated output |
stream |
Enables streaming of response data |
tools |
Activates built-in or custom tools like search or function calling |
reasoning |
Controls internal reasoning style or effort |
truncation |
Manages input truncation behavior automatically |
instructions |
Higher-level instructions embedded in the prompt |
metadata |
Attach custom metadata for tracking purposes |
user |
Identifies the end user for logging or policy use |
The purpose/effect of some arguments is arguably more intuitive than for others. For example, max_output_tokens allows you to manually restrict the length of the response. The temperature argument controls how deterministic or creative the model's output is. A lower value (close to 0) makes the model more focused and deterministic, consistently choosing the most likely responses, which is useful for tasks requiring accuracy and reliability. A higher value (closer to 1 or above) increases randomness, encouraging more diverse and creative outputs, which is useful for brainstorming or generating varied text; top_p behaves in a similar way. We will see in more detail how some of the more interesting arguments work when going through practical examples next.
Basic Text Generation¶
We already saw the most basic use of the response.create endpoint for generating a response text for a given prompt; and the example below is almost identical to the one above. However, instead of just printing the generating text, here we print the complete JSON response object just to see at least once what kind of information a response from the API typically contains.
model = "gpt-3.5-turbo"
prompt = "What us the distance between the earth and the moon?"
try:
# Generate model response by calling the endpoint
response = client.responses.create(model=model, input=prompt)
# Print complete JSON object of response
print(response.model_dump_json(indent=2))
except Exception as e:
print(f"An error occured: {e}")
{
"id": "resp_68bbf28c88f48193a41c12f3cad3dffc0b65059e0aec4874",
"created_at": 1757147788.0,
"error": null,
"incomplete_details": null,
"instructions": null,
"metadata": {},
"model": "gpt-3.5-turbo-0125",
"object": "response",
"output": [
{
"id": "msg_68bbf28cf9808193aeb04fd45f4eecdc0b65059e0aec4874",
"content": [
{
"annotations": [],
"text": "The average distance between the Earth and the Moon is about 238,855 miles (384,400 kilometers). However, this distance can vary due to the elliptical orbit of the Moon around the Earth.",
"type": "output_text",
"logprobs": []
}
],
"role": "assistant",
"status": "completed",
"type": "message"
}
],
"parallel_tool_calls": true,
"temperature": 1.0,
"tool_choice": "auto",
"tools": [],
"top_p": 1.0,
"background": false,
"max_output_tokens": null,
"previous_response_id": null,
"reasoning": {
"effort": null,
"generate_summary": null,
"summary": null
},
"service_tier": "default",
"status": "completed",
"text": {
"format": {
"type": "text"
},
"verbosity": "medium"
},
"truncation": "disabled",
"usage": {
"input_tokens": 18,
"input_tokens_details": {
"cached_tokens": 0
},
"output_tokens": 42,
"output_tokens_details": {
"reasoning_tokens": 0
},
"total_tokens": 60
},
"user": null,
"max_tool_calls": null,
"prompt_cache_key": null,
"safety_identifier": null,
"store": true,
"top_logprobs": 0
}
A JSON response from the OpenAI API typically contains the model's generated output along with metadata about the request and response. For example, in the case of the responses.create or chat.completions.create endpoint, the JSON includes the generated text (or multiple possible outputs), information about the role of the message (e.g., "system", "assistant", "user"), and usage statistics like token counts. This structure ensures that both the content and important context are returned, allowing developers to handle not only the model's answer but also performance and cost-tracking details.
Text Generation with Instructions¶
The instructions field in the responses.create endpoint is used to provide high-level guidance to the model, similar to how the "system" works in the older Chat Completions API. Instead of mixing long-winded setup text into the main input, you can separate global directives (e.g., "You are a helpful tutor who explains concepts step by step" or "Always respond in JSON format") into the instructions parameter. This helps clearly distinguish meta-level guidance from the actual user request, making prompts easier to manage, reuse, and maintain. This is especially useful when building applications where the style, tone, or constraints must remain consistent across multiple interactions. By centralizing these instructions, you avoid repeating them in every input and reduce prompt engineering overhead. For example, you might set instructions to ensure the assistant always responds formally, while the input can change dynamically with each user query. This separation of concerns makes your system more modular and easier to scale for complex or multi-turn use cases.
model = "gpt-3.5-turbo"
prompt = "Explain what photosynthesis is."
instructions = "Write all reponses in Internet slang."
try:
# Create a response with high-level instructions
response = client.responses.create(model=model, instructions=instructions, input=prompt)
# Print the model's output
print(response.output_text)
except Exception as e:
print(f"An error occured: {e}")
Photosynthesis iz da process by which green plantz, algae, Nd some bacterias convert light energy into chemical energy in da form of glucose. It involves de use of carbon dioxide, water, nd sunlight to produce oxygen nd glucose. Itz basically how plantz make their own food nd release oxygen into da atmosphere. #ScienceIsLit
Chaining Responses¶
By default, as done in the previous example, all endpoint calls are independent from each other. However, this causes problems for creating conversation needed in applications such as chatbots. A conversation is characterized by interdependent statements, where each utterance depends on and shapes the meaning of the next. Unlike isolated sentences, conversational statements are linked through context, reference, and coherence: a response usually presupposes what was said before (e.g., answering a question, clarifying a claim, or challenging a point). This creates a chain of dependencies, where meaning is built incrementally rather than contained in single, standalone statements. These dependencies can take different forms, mainly:
Referential dependencies: Referential dependencies in a conversation occur when an utterance uses words or expressions (like pronouns, definite descriptions, or ellipses) that rely on earlier statements for their meaning. For example, in "I met Sarah yesterday." — "Oh, what did she say?", the pronoun she depends on the prior mention of Sarah. These dependencies help maintain coherence while avoiding repetition.
Logical dependencies: These dependencies in a conversation arise when one statement relates to another through reasoning, such as agreement, contradiction, cause-and-effect, or providing evidence. For instance, if someone says, "It's going to rain", and the other replies, "Then we should bring umbrellas", the second statement logically depends on the first. These dependencies drive the progression of ideas and support coherent argumentation or decision-making.
Pragmatic dependencies: A pragmatic dependency in a conversation occurs when a statement's meaning or appropriateness depends on the social context, speaker intentions, or conversational norms rather than just literal content. For example, if someone says, "Can you pass the salt?" the expected response is to hand over the salt, not to comment on one's physical ability. These dependencies ensure that contributions remain relevant, polite, and aligned with the shared goals of the interaction.
In principle, you can preserve the context "manually" by adding all previous prompts and their responses to a history (e.g., a list of previous prompts and responses), and call the endpoint with this history together with the new prompt. The Responses API simplifies this by introducing the concept of chaining.
Chaining responses using the responses.create endpoint allows you to maintain stateful, multi-turn conversations without manually resending the entire conversation history. By passing the previous_response_id from an earlier response, the API can automatically carry forward context, instructions, and relevant conversation data. This simplifies building applications where the model needs to remember prior interactions, such as chatbots, tutoring systems, or task-oriented assistants. It effectively enables the model to "remember" what has been discussed, reducing the need for clients to reconstruct full conversation context on every call. This approach is particularly useful for agentic or multi-step workflows, where the model’s response in one step might influence the next step. For example, you could generate an initial answer, then ask the model to refine, expand, or perform additional calculations based on that output. Chaining responses in this way not only improves efficiency and reduces token usage but also makes your application logic cleaner and easier to maintain, since the API handles context propagation automatically.
The code cell below shows a simple example using chaining by calling the endpoint twice, and the second call uses the id of the first response as the value for the previous_response_id argument. Notice how the second prompt would be completely meaningless without the context of the first prompt and its returned response. Using chaining allows you to much better mimic natural conversation.
model = "gpt-3.5-turbo"
prompt1 = "What is the largest planet in the solar system?"
prompt2 = "And the smallest?"
try:
# Step 1: Initial prompt
response1 = client.responses.create(model=model, input=prompt1)
print(f"Step 1 response:\n{response1.output_text}\n")
# Step 2: Follow-up prompt using previous_response_id to continue the conversation
response2 = client.responses.create(model=model, input=prompt2, previous_response_id=response1.id)
print(f"Step 2 response:\n{response2.output_text}")
except Exception as e:
print(f"An error occured: {e}")
Step 1 response: Jupiter is the largest planet in our solar system. Step 2 response: Mercury is the smallest planet in the solar system.
Of course, chaining can be used across more than just two prompts. However, the overall process remains the same: The value for argument previous_reponse_id of the next API call is set to the id of the response from the previous call.
While chaining can enable sophisticated workflows (e.g., multi-step reasoning, retrieval-augmented generation, or structured data extraction) it introduces the risk of error propagation. If an early step produces an incorrect, ambiguous, or incomplete result, subsequent steps may compound the mistake, ultimately affecting the accuracy and reliability of the final output. Unlike a single-step response, the correctness of the chain depends not only on the model’s reasoning but also on the design of the intermediate prompts and how information is passed between steps. Chaining also comes with practical constraints. Each additional step requires a separate API call, which increases latency and cost, potentially making real-time applications slower or more expensive. Maintaining context across multiple steps can be challenging, especially for long or complex workflows, as the model may lose track of earlier details if prompts are not carefully structured. Developers often need to implement state management, caching, or summarization strategies to preserve context and ensure consistency, adding complexity to the application. While powerful, chaining requires careful planning to balance accuracy, performance, and cost.
Reasoning Levels¶
The reasoning argument allows you to control how the model approaches complex tasks or multi-step problem solving. Essentially, it provides a way to guide the model's internal reasoning process, such as specifying whether it should use more thorough, step-by-step thinking or produce a quicker, less detailed response. This can be particularly useful when accuracy, logical consistency, or structured problem solving is important, for example in coding tasks, math problems, or data analysis. By adjusting the reasoning parameter, developers can balance performance and computational cost. For instance, a higher reasoning effort might generate more reliable and detailed explanations but could take slightly longer and consume more tokens, while a lower reasoning setting could be used for simpler tasks where speed is more critical than depth. This makes it a powerful tool for tailoring the model’s behavior to the specific needs of your application, ensuring that outputs are aligned with your desired level of rigor and reliability. Valid values for reasoning.effort are
low: The model performs minimal reasoning, leading to faster responses with less detailed explanations. This setting is suitable for straightforward tasks where speed is prioritized over depth.medium: The default setting, balancing between speed and detailed reasoning. It provides adequate depth for most tasks without significant delays.high: The model engages in extensive reasoning, offering thorough and detailed explanations. This setting is ideal for complex problems requiring careful analysis and step-by-step breakdowns.
However, it is important to note that not all models support all parameters within reasoning. Models that support it include o1, o3, o3-mini and o4-mini. If you try using a different model, the endpoint will respond with a corresponding error code and error message. The example below use the o4-mini model to process the same prompt using the three available levels of reasoning effort.
model = "o4-mini"
prompt = "What is heavier, 1 kg of lead or 1 kg of wood?"
try:
# Low reasoning effort
response_low = client.responses.create(model=model, input=prompt, reasoning={"effort": "low"})
print(f"Low effort response:\n{response_low.output_text}\n")
# Medium reasoning effort
response_medium = client.responses.create(model=model, input=prompt, reasoning={"effort": "medium"})
print(f"Medium effort response:\n{response_medium.output_text}\n")
# High reasoning effort
response_high = client.responses.create(model=model, input=prompt, reasoning={"effort": "high"})
print("High effort response:\n", response_high.output_text)
except Exception as e:
print(f"An error occured: {e}")
Low effort response: Both have the same mass—1 kg—so in that sense neither is heavier. (If you put each on a scale in air, the lead will register very slightly heavier, because the bulkier wood displaces more air and so experiences a larger buoyant lift. But by definition of mass, 1 kg of lead = 1 kg of wood.) Medium effort response: They weigh the same: 1 kg of lead is 1 kg of wood. (If you put them on a scale in air, the wood will actually register a tiny bit less—because it displaces more air and so experiences a slightly larger buoyant force—but their true masses are identical.) High effort response: Neither one is heavier—1 kg of lead and 1 kg of wood both have the same mass (1 kg). The only difference is density: the wood takes up much more volume than the lead. (In air, the wood’s greater volume displaces more air and so experiences a slightly larger buoyant force, which can make it register a tiny bit “lighter” on a scale—but their true masses are identical.)
Keep in mind that different reasoning levels often return the same output because the underlying model may already produce an answer that is sufficiently confident and unambiguous. In cases where the question is straightforward or the context is clear, increasing the reasoning level does not change the result, since the model's standard reasoning is already adequate to generate a correct or complete response. Essentially, the extra reasoning capacity may not be "activated" if the problem does not require deeper or multi-step inference. Another factor is that the reasoning levels primarily influence the model's internal deliberation process rather than guaranteeing a different external output. The model may internally consider more reasoning chains at higher levels, but if all chains converge to the same conclusion, the final response remains unchanged. This behavior is typical for deterministic or high-confidence questions, whereas more complex, ambiguous, or multi-step problems are where differences between reasoning levels tend to manifest more clearly.
Analyzing Images¶
The responses.create endpoint supports image analysis by allowing you to send an image (or multiple images) as part of the input alongside text. Instead of only handling text prompts, the model can process visual data, extract relevant features, and generate insights in natural language. This enables tasks such as describing an image, answering questions about its content, identifying objects, reading text within an image, or reasoning about diagrams and charts.
The key idea is that the model treats the image as an input modality in the same way it handles text, enabling multimodal reasoning. You simply provide an image file or URL in the request, and the model integrates the visual information with the text prompt to produce a coherent response. This makes it possible to build applications like visual assistants, content moderation tools, document analyzers, or tutoring systems that can explain diagrams or handwritten notes. The models that support image inputs through the endpoint are: gpt-4o, gpt-4.1, gpt-4.5, o3, and o4-mini.
Let's consider a public domain image of a dog from Wikimedia Commons, but you can change the URL to test the endpoint using different images.
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/7/77/Hazel_the_German_Shepherd_by_a_river.png/330px-Hazel_the_German_Shepherd_by_a_river.png"
Image(url=image_url)
Since we now have to combine both a text prompt and the information (i.e., URL) about the image, we cannot just submit a simple string as the prompt. Instead we have to define the prompt again as a list containing all the relevant "parts". The code below shows a simple example of how this can be done.
# Define prompt incl. the link to the image to be analyzed
prompt = [
{
"role": "user",
"content": [
{"type": "input_text", "text": "Please describe this image in 1-2 sentences."},
{"type": "input_image", "image_url": image_url}
]
}
]
Using this structured prompt as the input, we can call the response.create endpoint as usual. The output itself is a generated text, and for our case, describing the content of the provided image.
model = "o4-mini"
try:
# Analyze image by calling the endpoint
response = client.responses.create(model=model, input=prompt)
# Print the generates response text
print(response.output_text)
except Exception as e:
print(f"An error occured: {e}")
A tan German Shepherd–type dog stands alert on a rocky bank beside a calm creek, its ears perked as it gazes off-camera. Lush green trees and shrubs line the water’s edge, with sunlight filtering through the foliage and reflecting on the stream.
While the endpoint can describe content, answer questions about an image, or extract text, its understanding is bounded by the model's training and inference capabilities. It may struggle with fine-grained visual details, subtle patterns, or highly technical diagrams. Complex visual reasoning, such as interpreting charts with multiple variables, detecting anomalies, or performing precise measurements, can be unreliable. Similarly, the model may misinterpret ambiguous elements or overlook small but important features, limiting its accuracy for tasks that require expert-level visual analysis.
Another limitation is contextual dependency and output variability. The model's responses are influenced by the prompt, the way the image is presented, and the reasoning level selected, which can result in inconsistent or incomplete analyses across similar images. Additionally, the endpoint does not perform true symbolic computation or object-level recognition like specialized computer vision models; it produces text-based interpretations, meaning it cannot directly return structured visual data (e.g., bounding boxes, pixel-level labels) without extra tooling or post-processing. These factors make it best suited for general image understanding rather than precise or highly technical visual tasks.
Tools¶
In the OpenAI Responses API, tools are specialized functionalities that extend the capabilities of language models, enabling them to perform specific tasks beyond text generation. These tools allow models to interact with external systems, process data, and execute actions, making them suitable for building intelligent agents capable of handling complex workflows. The purpose of tools are
Enhancing Model Capabilities: By integrating tools, models can access real-time information, perform computations, and interact with external services, thereby providing more accurate and context-aware responses.
Streamlining Workflows: Tools enable models to automate tasks such as web searches, file retrieval, and code execution, reducing the need for manual intervention and improving efficiency.
Building Intelligent Agents: Combining models with tools allows developers to create agents that can autonomously handle a variety of tasks, from answering questions to performing actions on behalf of users.
The following table outlines the built-in tools available in the Responses API:
| Tool Name | Description |
|---|---|
| Web Search | Retrieves up-to-date information from the internet. |
| Function Calling | Give models access to new functionality and data they can use to follow instructions and respond to prompts. |
| File Search | Searches and retrieves information from uploaded files. |
| Computer Use | Executes tasks on a computer, such as opening applications or interacting with the file system. |
| Code Interpreter | Executes code in a sandboxed environment for computations and data analysis. |
| Image Generation | Generates images from text prompts. |
| Hosted MCP Tools | Integrates with external Model Context Protocol (MCP) servers to access custom tools. |
These tools can be utilized by specifying them in the tools parameter when making a request to the responses.create endpoint. This integration allows models to perform actions such as fetching real-time data, executing code, and interacting with external systems, thereby enhancing their utility in various applications. Let's look at two tool — Web Search and Function Calling — more closely by providing working examples.
Web Search¶
The web_search tool allows the model to access real-time information from the internet. Instead of relying solely on its pretrained knowledge, which may be outdated or limited, the model can perform web searches to retrieve the latest data, news, or specific details relevant to a user query. This is particularly useful for questions about current events, recent research, or niche topics that are unlikely to be fully covered in the model's training data. By integrating web search, responses can be more accurate, timely, and factually grounded. The benefits of using the web_search tool include improved reliability and relevance of answers, as the model can validate or supplement its knowledge with external sources. It also allows developers to build applications that handle dynamic, real-world queries, such as generating summaries of recent news, providing up-to-date statistics, or retrieving specific product information. Overall, the web search tool expands the model's utility beyond static knowledge, enabling more versatile and trustworthy AI-powered solutions.
model = "gpt-4o-mini"
prompt = "What is the current temperature in Singapore?"
try:
response = client.responses.create(model=model, input=prompt, tools=[{"type": "web_search"}])
print(response.output_text)
except Exception as e:
print(f"An error occured: {e}")
As of 4:36 PM local time in Singapore, the current temperature is 29°C (84°F) with partly cloudy skies. ## Weather for Singapore, Singapore: Current Conditions: Clouds and sun, 91°F (33°C) Daily Forecast: * Saturday, September 6: Low: 80°F (26°C), High: 91°F (33°C), Description: Some sun, then turning cloudy * Sunday, September 7: Low: 78°F (26°C), High: 87°F (30°C), Description: Cloudy with a thunderstorm in spots * Monday, September 8: Low: 77°F (25°C), High: 86°F (30°C), Description: Cloudy with a thunderstorm in spots * Tuesday, September 9: Low: 77°F (25°C), High: 85°F (30°C), Description: Remaining cloudy with a couple of thunderstorms * Wednesday, September 10: Low: 78°F (26°C), High: 89°F (31°C), Description: Cloudy; a morning thunderstorm in parts of the area followed by a drenching thunderstorm in the afternoon * Thursday, September 11: Low: 79°F (26°C), High: 87°F (30°C), Description: Remaining cloudy with a couple of thunderstorms, especially early in the day * Friday, September 12: Low: 79°F (26°C), High: 88°F (31°C), Description: Mostly cloudy with a thunderstorm in spots
When using the web search tool, one key limitation is latency and reliability. The model relies on external search results, which can vary in availability, relevance, and freshness. Network issues, temporary unavailability of certain pages, or poorly indexed content may lead to incomplete or outdated information. Additionally, the model's interpretation of search results can introduce errors if it misreads context or overgeneralizes from limited snippets, potentially producing responses that are inaccurate or misleading.
Another limitation is scope and filtering. The web search tool retrieves content from the open web, which may include biased, low-quality, or unverified sources. The model cannot independently verify facts beyond what is retrieved, so the quality of answers heavily depends on the sources found. Furthermore, there are practical rate limits and token constraints that affect how many searches can be performed in a single session, which can restrict the depth of research and the number of results the model can effectively process. These factors make it most suitable for general information retrieval rather than high-stakes decision-making or specialized research.
Function Calling¶
The idea behind function calling in the OpenAI API is to let the model act as a natural language interface between the user and external functions or tools. Instead of returning only plain text, the model can recognize when a user’s request matches a function you’ve defined (e.g., retrieving weather data, querying a database, or running calculations) and respond by outputting a structured JSON object with the function name and its arguments. This structured output can then be programmatically executed by your system, ensuring reliable integration between natural language inputs and precise function calls. Function calling bridges human-friendly conversation with machine-executable actions, enabling more interactive and automated workflows.
Example scenario: Let's assume we are building a chatbot for the city that allows users to ask how crowded public places such as shopping malls, parks, museums, etc. are. A traditional LLM has no access to this localized and real-time information. However, we have a system that is able to estimate the crowd level at different locations, maybe analyzing sensor data, noise level, or camera feeds. In short, for a given location, we can get the crowd level at the location at that location as a value between, say, $0$ (low) and $1$ (high). The function get_crowd_level() below illustrates this idea by returning a random value for the crowd level. Of course, in practice, this function would contain the actual logic to retrieve the actual crowd level (e.g., from a database).
def get_crowd_level(location):
# The proper logic to retreive the crowd level would be here
return {"location": location, "crowd_level": random.random()}
The endpoint and the LLM now needs to know about your function and how it can be used. This is again done using the tools argument; see the code cell below. In case of a function, we need to specify the name as well as the expected parameters (arguments) of the function. We also need to add a meaningful description so that the LLM "knows", given a prompt, if the function should be used and which parts of the prompt will represent the parameters.
tools = [
{
"type": "function",
"name": "get_crowd_level",
"description": "Get the current crowd level at a given location from 0 (low) to 1 (high)",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The public space (shopping mall, museum, train station, parks, tourist spots, etc.)",
},
},
"required": ["location"],
},
}
]
We can now prompt the response.create endpoint; see the example below. Instead of giving the endpoint direct the prompt as a simple string, we wrapped it into a conversation variable which is just a simple list — we see in a bit why we are doing this. The code cell below shows two example prompts to illustrate that not all prompts are required to call our function. The prompt "How many people are currently in VivoCity?" clearly indicates that a user is interested in the crowd level at VivoCity (a large shopping mall in Singapore), and therefore would rely on our function to give proper response. In contrast, the prompt "How far is the VivoCity mall from Changi Airport?" would not. By default, we let the LLM decide if our function is needed or not by setting tool_choice="auto".
model = "gpt-4o-mini"
conversation = [{"role": "user", "content": "How many people are currently in VivoCity?"}]
#conversation = [{"role": "user", "content": "How far is the VivoCity mall from Changi Airport?"}]
try:
response = client.responses.create(
model=model,
input=conversation,
tools=tools,
tool_choice="auto"
)
except Exception as e:
print(f"An error occured: {e}")
The response is as usual an elaborate JSON object. Most importantly here is that the response tells us if the LLM requires a function call or not; see the first if statement in the code cell below. Note that we assume there is only a single function call to be required to keep the example simple. In principle, the output attribute of the JSON object for the response is a list which mainly contains multiple function calls.
If indeed a function call is required, the output also tells us which function that is by its name as well as the arguments to make to the respective function call. Again, this is the main purpose of the API endpoint for this: taking a prompt in natural language and (a) deciding if and which function calls are required, and (b) extract all relevant arguments to call those functions. After call our function get_crowd_level() with the extracted argument, we add both the call itself as well as the result of our function to the conversation and make another call the the API endpoint. Sending the whole conversation ensure that the LLM has the full context to generate an appropriate reply.
However, if the response to the initial prompt does not require any function call, we just print the response text; see the else branch at the end of the code cell below.
# Check if the model wants to call a function
if response.output and response.output[0].type == "function_call":
tool_call = response.output[0]
# Check which function the model wants to call
if tool_call.name == "get_crowd_level":
# Get extracted arguments and convert to JSON obect
args = json.loads(tool_call.arguments)
print(f"Call {tool_call.name} with arguments {args}\n")
# Call function with extract parameters to get result
data = get_crowd_level(**args)
# Add the function call and its result to the conversation
conversation.append(tool_call)
conversation.append({
"type": "function_call_output",
"call_id": tool_call.call_id,
"output": json.dumps(data)
})
try:
# Get the final response with the function results incorporated
final_response = client.responses.create(model=model, input=conversation, tools=tools)
# Print the final reponse text
print(f"Final response:\n{final_response.output_text}")
except Exception as e:
print(f"An error occured: {e}")
else:
# Just print the response text of not function call was involved
print(f"Final response:\n{response.output_text}")
Call get_crowd_level with arguments {'location': 'VivoCity'}
Final response:
Currently, the crowd level at VivoCity is approximately 15.6%, indicating a low number of people.
As the example shows, function calling provides a structured and reliable bridge between natural language and programmatic actions. Instead of parsing free-form text, developers get well-defined JSON outputs that specify the function name and its arguments, which makes automation more accurate and reduces the risk of misinterpretation. This enables seamless integration with external systems like databases, APIs, or computational tools, and allows users to interact with complex systems through simple natural language queries. Another strength is that developers can constrain the model's outputs by defining available functions and expected parameters, which improves safety, predictability, and alignment with application needs.
However, function calling also has limitations. The model does not actually execute the function but only suggests when and how it should be called. This means developers still need to validate the arguments, handle errors, and ensure security when passing data to external systems. Additionally, function calling is limited by the model's understanding of context: if the user’s request is ambiguous, the model might generate incomplete or incorrect parameters. In high-stakes or mission-critical applications, relying solely on the model's judgment without proper checks can be risky. In practice, function calling is therefore most effective when combined with robust backend logic — using the model for intent recognition and argument filling, while letting the application enforce constraints, validate inputs, and control execution. This balance ensures that developers can take advantage of natural language flexibility without sacrificing reliability or safety.
Summary¶
OpenAI provides a suite of APIs that allow developers to integrate powerful large language models (LLMs) into their applications. At its core, the Responses API is the most flexible endpoint, capable of generating text, handling conversations, calling functions, and even using tools like web search or file analysis. Other specialized endpoints include the Embeddings API (for generating numerical vector representations of text, useful for semantic search, clustering, and recommendation systems), the Audio API (for speech-to-text transcription and translation using Whisper), and the Image API (for generating or editing images with models like DALL·E). Together, these endpoints cover a wide range of tasks — from natural language understanding and reasoning to multimodal generation.
Using these APIs, developers can build applications such as conversational agents, code assistants, knowledge retrieval systems, personalized tutors, and content generation tools. The ability to combine endpoints is especially powerful. For example, embeddings can be paired with a vector database for retrieval-augmented generation (RAG), while function calling in the Responses API can connect an LLM to external APIs or business logic. This creates systems where the LLM not only generates text but also acts as an orchestrator of workflows, enabling richer interactivity and automation.
From a practical standpoint, pricing is an important consideration. OpenAI charges on a usage basis, typically measured in tokens for text models and in seconds or images for audio and image models. More capable models (such as GPT-4.1) are priced higher than smaller, faster models (like GPT-4.1-mini), which encourages developers to balance performance with cost depending on their use case. For large-scale applications, optimizing prompts, caching responses, or mixing model tiers can help reduce expenses while maintaining quality.
Another practical issue is rate limits, which cap the number of requests per minute and the maximum tokens per request. These limits vary by model and account tier, and they play a crucial role in scaling production systems. Applications that expect high user traffic need to be designed with queuing, batching, or fallback strategies to ensure a smooth experience. Additionally, developers must handle errors gracefully, as exceeding rate limits or encountering transient service issues can disrupt workflows.