Enforce Structured Outputs from LLMs with PydanticAI

Table of Contents

Enforce Structured Outputs from LLMs with PydanticAI

Introduction – LLMs and Structured Outputs

Language models like OpenAI are increasingly being used to build tools that understand and generate human-like text

However, one major challenge is that these models often return unstructured text, which can be unpredictable and difficult to interpret. If you’re expecting clean, structured data, such as a JSON object with keys like ‘first_name’, ‘last_name’, ‘experience’, and ‘primary_skill’, you may find the model returning values in an unstructured form.

Here’s a basic example using OpenAI’s API without any validation to demonstrate this:

#| eval: false import os from openai import OpenAI client = OpenAI( api_key=os.environ.get("OPENAI_API_KEY"), ) response = client.responses.create( model="gpt-4o-mini-2024-07-18", instructions="Extract first name, last name, years of experience, and primary skill from the job applicant description.", input="Khuyen Tran is a data scientist with 5 years of experience, skilled in Python and machine learning.", ) print(response.output_text) 

This might output:

- **First Name:** Khuyen - **Last Name:** Tran - **Years of Experience:** 5 years - **Primary Skill:** Python and machine learning 

While this is readable to a human, it lacks a structured format like JSON, which makes it difficult to reliably extract and run the query in downstream applications.

Pydantic AI and Structured LLM Outputs

PydanticAI helps solve this problem. It combines the power of language models with Pydantic, a Python library for data validation. By doing so, it allows you to define exactly what kind of output you expect and ensures the model sticks to that format.

In this guide, you’ll learn how to:

  • Understand what outputs to expect from language models
  • Use Pydantic to define a “schema” for expected outputs
  • Validate and structure LLM responses automatically
  • Safely build reliable AI agents for real-world data science workflows

The source code for this article can be found here:

Source Code

Prerequisites

Make sure you have the following packages installed:

#| eval: false pip install pydantic openai pydantic-ai 

You also need access to the OpenAI API with a valid key:

#| eval: false export OPENAI_API_KEY="your-api-key" 

Core Workflow: Building a Type-Safe Agent

First, define a Pydantic Model that describes the expected structure of your agent’s output:

#| eval: false from pydantic import BaseModel from typing import List class ApplicantProfile(BaseModel): first_name: str last_name: str experience_years: int primary_skill: List[str] 

This model acts as a contract, ensuring that the language model returns a structured object with the correct fields and types.

Now, use the output_type parameter to connect this model to your agent:

#| eval: false from pydantic_ai import Agent agent = Agent( 'gpt-4o-mini-2024-07-18', system_prompt='Extract name, years of experience, and primary skill from the job applicant description.', output_type=ApplicantProfile, ) result = agent.run_sync('Khuyen Tran is a data scientist with 5 years of experience, skilled in Python and machine learning.') print(result.output) 

Output:

first_name='Khuyen' last_name='Tran' experience_years=5 primary_skill=['Python', 'machine learning'] 

This structured output is safe to pass directly into downstream applications without modification.

result.output returns a Pydantic object. To convert it into a standard Python dictionary for further use, call:

#| eval: false result.output.model_dump() 

Output:

{ "first_name": "Khuyen", "last_name": "Tran", "experience_years": 5, "primary_skill": [ "Python", "machine learning" ] } 

You can now easily integrate this into other data workflows. For example, to convert the output into a pandas DataFrame:

#| eval: false import pandas as pd df = pd.DataFrame([result.output.model_dump()]) df 

Output:

 first_name last_name experience_years primary_skill 0 Khuyen Tran 5 Python 1 Khuyen Tran 5 machine learning 

Using the DuckDuckGo Search Tool

Have you ever tried to make your AI app respond to current events or user queries with real-world data without managing a custom search backend?

PydanticAI supports integrating tools like DuckDuckGo search to enhance your AI agents with live web results.

#| eval: false from pydantic import BaseModel from pydantic_ai import Agent from pydantic_ai.common_tools.duckduckgo import duckduckgo_search_tool from typing import List class UnemploymentDataSource(BaseModel): title: List[str] description: List[str] url: List[str] # Define the agent with DuckDuckGo search tool search_agent = Agent( 'gpt-4o-mini-2024-07-18', tools=[duckduckgo_search_tool()], system_prompt='Search DuckDuckGo and return links or resources that match the query.', output_type=UnemploymentDataSource, ) # Run a search for unemployment rate dataset unemployment_result = search_agent.run_sync( 'Monthly unemployment rate dataset for US from 2018 to 2024' ) print(unemployment_result.output) 

Example output:

title=[ 'Civilian unemployment rate - U.S. Bureau of Labor Statistics', 'Databases, Tables & Calculators by Subject - U.S. Bureau of Labor Statistics', 'Unemployment Rate (UNRATE) | FRED | St. Louis Fed', 'US Unemployment Rate Monthly Analysis: Employment Situation - YCharts', 'U.S. Unemployment Rate 1991-2025 - Macrotrends' ] description=[ 'The U.S. Bureau of Labor Statistics provides information on the civilian unemployment rate.', 'Access various data tables and calculators related to employment situations in the U.S.', "Access historical unemployment rates and data through the St. Louis Fed's FRED database.", 'In-depth view into historical data of the U.S. unemployment rate including projections.', 'Details on U.S. unemployment rate trends and statistics from 1991 to 2025.' ] url=[ 'https://www.bls.gov/charts/employment-situation/civilian-unemployment-rate.htm', 'https://www.bls.gov/data/', 'https://fred.stlouisfed.org/series/UNRATE/', 'https://ycharts.com/indicators/us_unemployment_rate', 'https://www.macrotrends.net/global-metrics/countries/USA/united-states/unemployment-rate' ] 

This output is fully structured and aligns with the UnemploymentDataSource schema. It makes the data easy to load into tables or use in downstream analytics workflows without additional transformation.

Comparison with LangChain Structured Output

How PydanticAI Handles Structured Output

PydanticAI returns Pydantic objects directly, so you can immediately access structured fields like cook_time without extra parsing.

#| eval: false from typing import Optional, List from pydantic import BaseModel from pydantic_ai import Agent class RecipeExtractor(BaseModel): ingredients: List[str] instructions: str cook_time: Optional[str] recipe_agent = Agent( "gpt-4o-mini-2024-07-18", system_prompt="Pull ingredients, instructions, and cook time.", output_type=RecipeExtractor, ) recipe_result = recipe_agent.run_sync( "Sugar, flour, cocoa, eggs, and milk. Mix, bake at 350F for 30 min." ) print(recipe_result.output.cook_time) # 30 minutes 

PydanticAI simplifies standalone LLM tasks, meaning tasks where you prompt a model once and immediately use the structured output without needing multiple steps, chaining, or external orchestration.

How LangChain Handles Structured Output

LangChain binds a Pydantic model to the tool, but you must manually extract values from tool_calls, adding an extra step.

#| eval: false from typing import Optional, List from pydantic import BaseModel from langchain_openai import ChatOpenAI from langchain_core.messages import HumanMessage, SystemMessage # Initialize the chat model model = ChatOpenAI(model="gpt-4o-mini-2024-07-18", temperature=0) # Bind the response formatter schema model_with_tools = model.bind_tools([RecipeExtractor]) # Create a list of messages to send to the model messages = [ SystemMessage("Pull ingredients, instructions, and cook time."), HumanMessage("Sugar, flour, cocoa, eggs, and milk. Mix, bake at 350F for 30 min."), ] # Invoke the model with the prepared messages ai_msg = model_with_tools.invoke(messages) # Access the tool calls made during the model invocation print(ai_msg.tool_calls[0]['args']['cook_time']) # 30 minutes 

LangChain is better suited for multi-step workflows, such as combining several tools, using routing logic, or building custom chains.

Final Thoughts

I find PydanticAI to be an easy-to-use tool for structuring LLM outputs effectively. It helps keep workflows organized and predictable, which can significantly enhance focus and efficiency.

With just a few lines of code, you get robust schema validation that integrates naturally into Python-based pipelines. That makes it a practical choice for data scientists aiming to move beyond simple prototypes.

Leave a Comment

0
    0
    Your Cart
    Your cart is empty
    Scroll to Top

    Work with Khuyen Tran

    Work with Khuyen Tran