The Ultimate Guide to LLM Structured Outputs

Jul 20, 2025 · 1602 words

Every developer who wants to use LLMs for Agent systems has likely encountered the same confusion: How can I make the LLM output accurate structured content?

It is important to understand that Large Language Models are essentially probabilistic generative models; the content returned by an LLM can differ every time. This is perfectly fine for chat conversations. However, in an Agent system, we often need the LLM to output structured data that can be parsed and passed to downstream components. I need its output to be completely accurate, because if the output format is wrong, the entire system breaks down!

For example, suppose we want to extract a user's name, age, and city from a piece of text.

The most appropriate method is to have the LLM output data in JSON format and then parse it:

{
    "name": "Zhang San",
    "age": 25,
    "city": "Shanghai"
}

Method 1: Direct Prompting

The prompt you write might look something like this:

Extract the user's name, age, and city from the following user input and output it strictly in JSON format. Note: The output should only contain the JSON content.

Example:
{"name": "Zhang San", "age": 25, "city": "Shanghai"}

User input:
Hello, my name is Zhang San, I am 25 years old, and I currently live in Shanghai.

This actually uses a few Prompt Engineering techniques:

Emphasizing only outputting JSON content to prevent the LLM from mixing in other text.
Providing an example, which is "One-shot prompting," allowing the LLM to imitate the correct JSON keys directly.

However, the prompt above is often not enough. You will find that the LLM's output might come in two formats:

Format 1 (JSON only):

{"name": "Zhang San", "age": 25, "city": "Shanghai"}

Format 2 (Wrapped in a code block):

```json
{"name": "Zhang San", "age": 25, "city": "Shanghai"}
```

While the latter format technically "only includes JSON" within the code block, it cannot be directly passed to a JSON parser. Consequently, we have to add a processing step: stripping extra characters from the beginning and end of the result to keep only the JSON information within the curly braces.

Soon, you'll realize the system is filled with "patches" just to handle the uncertainty of LLM outputs. So, is there a way to make the LLM behave and give us perfectly accurate JSON?

Method 2: JSON Mode

Making large models output valid JSON is a very common requirement. Therefore, model providers have introduced their own solutions.

OpenAI introduced JSON Mode early on. Enabling it simply requires adding a response_format parameter during the API call. For example:

client.chat.completions.create(
  model="gpt-4o-mini",
  # Enable JSON Mode
  response_format={ "type": "json_object" },
  messages=[
    {"role": "user", "content": "Hello, my name is Zhang San, I am 25 years old, and I currently live in Shanghai."}
  ]
)

JSON Mode guarantees that the model's output will be a valid JSON object. Subsequently, OpenAI also supported passing a JSON Schema into response_format to ensure the generated JSON follows the exact structure you want.

Unfortunately, while JSON Mode is great, OpenAI did not promote it as a standard protocol (which seems to be their usual style, similar to Function Calling). As a result, different model providers have implemented similar capabilities but with different protocols. Since our Agent systems often need to connect to various models, adapting to each one individually is quite unfriendly.

So, is there a method that works across different models to output structured JSON data?

Method 3: Forced Tool Calling

Most people are familiar with Tool Calling (Function Calling). To build an Agent system that performs truly useful work, tool calling is indispensable.

Consequently, most mainstream models now support tool-calling capabilities. While the parameters for implementing tool calls vary slightly between models, they are generally compatible with OpenAI's Function Calling protocol. This is much better than the heavily fragmented JSON Mode protocols.

If we look closely at the Function Calling protocol, we see that tool calls also involve strict Schema definitions. How each function's parameters should be filled is strictly defined, which happens to cover the capabilities of JSON Schema.

Therefore, we only need a clever shift in perspective to use tool calling for generating structured data.

Original Idea: Ask the LLM for information and demand the return format strictly match JSON.
New Idea: Provide the LLM with a "questionnaire," and the LLM must strictly fill in the information according to the format on the questionnaire.

This "questionnaire" is the Function Calling protocol.

For the same user information extraction task, our code could look like this:

response = client.chat.completions.create(
  model="gpt-4o-mini",
  # Define the user_profile tool as a "questionnaire"
  tools=[{
    "type": "function",
    "function": {
      "name": "user_profile",
      "description": "Fill in user information",
      "parameters": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "age": {"type": "number"},
          "city": {"type": "string"}
        },
        "required": ["name", "age", "city"]
      }
    }
  }],
  # Force the model to call the user_profile tool
  tool_choice={"type": "function", "function": {"name": "user_profile"}}
  messages=[
    {"role": "user", "content": "I am Wang Wu, I am 40 years old, and I live in Shenzhen."}
  ]
)

Code Explanation:

The tools parameter provides the function definition, including a strict Schema to ensure information is filled in the required format.
The tool_choice parameter implements forced tool calling, requiring the model to fill out this questionnaire instead of outputting random text.

Bingo! We have cleverly achieved structured information extraction through tool calling.

Compared to the previous two methods, the benefits of this approach are clear:

Compared to Method 2: The biggest issue with JSON Mode is the lack of a unified protocol across different models. Function Calling protocols are largely standardized, allowing us to write one piece of code that works across multiple models.
Compared to Method 1: As a popular protocol, model providers have specifically fine-tuned their models to strictly follow JSON Schema during Function Calling. This is much more reliable than a temporary prompt.

Code in Action

Let's put this into practice using the previous example. We will use the forced tool calling method combined with Pydantic for data validation.

First, we define the data structure for user information using Pydantic:

import pydantic

class UserProfile(pydantic.BaseModel):
    """Pydantic model for storing and validating user information"""
    name: str = pydantic.Field(description="The user's name")
    age: int = pydantic.Field(description="The user's age")
    city: str = pydantic.Field(description="The city where the user lives")

The power of Pydantic lies in its ability to automatically generate a JSON Schema from a data structure, eliminating the risk of errors in manual Schema writing.

schema = UserProfile.model_json_schema()

The way we call the LLM is similar to what was mentioned before: we provide the tool and force the model to call it. The only difference is that we directly use the Schema automatically generated by Pydantic.

import os
from openai import OpenAI

# Assuming you have set the environment variable OPENAI_API_KEY
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

response = client.chat.completions.create(
  model="gpt-4o-mini",
  # Define the user_profile tool as a "questionnaire"
  tools=[{
    "type": "function",
    "function": {
      "name": "user_profile",
      "description": "Fill in user information",
      "parameters": UserProfile.model_json_schema()
  }],
  # Force the model to call the user_profile tool
  tool_choice={"type": "function", "function": {"name": "user_profile"}}
  messages=[
    {"role": "user", "content": "I am Wang Wu, I am 40 years old, and I live in Shenzhen."}
  ]
)

This time, the LLM's response is no longer text but a tool_calls object. We need to extract the user information filled in by the model.

# Extract the tool call arguments returned by the model (this is a JSON string)
tool_call = response.choices[0].message.tool_calls[0]
arguments_json = tool_call.function.arguments

print(arguments_json)
# Output: {"name":"Wang Wu","age":40,"city":"Shenzhen"}

Finally, Pydantic comes to the rescue again. We use the data structure defined at the beginning; Pydantic can automatically parse the JSON and convert it into a Python object, validating the data in the process:

# Use the Pydantic model for final parsing, validation, and instantiation
try:
    user_instance = UserProfile.model_validate_json(arguments_json)
    
    print("Python object after successful parsing and validation:")
    print(user_instance)
    
    # Now you can use it like any other Python object
    print(f"Name: {user_instance.name}, Age: {user_instance.age}")

except Exception as e:
    print(f"Data validation failed: {e}")

In this way, we have elegantly implemented structured data output from an LLM. This solution not only allows the model to output structured data stably but also remains compatible with various models. More importantly, we no longer need to write tedious examples or manual Schemas; we just define a data structure, and everything else happens naturally. When we need the LLM to output a different structure, we simply define a new data structure.

Summary

Looking through these three solutions, we can see that using tool calling for structured output is currently the most reliable approach. Combined with Pydantic, it allows for an elegant engineering implementation.

You might have heard of industry libraries that encapsulate similar capabilities, such as Instructor, Pydantic AI, or LangChain's with_structured_output. However, I won't directly recommend a specific library because everyone uses different Agent frameworks, and some libraries might not be easy to integrate into your specific project.

Additionally, some libraries secretly modify your prompts. If you want fine-grained manual control, you probably won't like the feeling of your prompts being altered behind the scenes.

Therefore, the best approach is to familiarize yourself with the principles first and master the "Forced Tool Calling + Pydantic" baseline solution. Once you understand the underlying mechanics, you will know which library to use (as they are essentially variations of the solutions we discussed, with added compatibility for different models). This way, you can truly master the toolsets without feeling lost in their abstractions, and you'll be able to identify the root cause immediately when issues arise.

Good luck with implementing structured LLM outputs!

Method 1: Direct Prompting
Method 2: JSON Mode
Method 3: Forced Tool Calling
Code in Action
Summary