Tutorials

LangChain Structured Output with Pydantic

Unpredictable LLM output got you down? If so, you're in the right place!

In this blog post, we'll dive into how to leverage the power of Pydantic models to not only structure your LangChain LLM outputs but also to guide the LLM itself to generate data in the format you need. Say goodbye to parsing headaches and hello to predictable, well-defined data, generated because of your structured schema.

Pydantic is a fantastic Python library for data validation and parsing using type hints. But when combined with LangChain's structured output capabilities, it becomes much more than just validation. Your Pydantic model, with its docstrings and field descriptions, acts as a blueprint and a set of instructions for the LLM, influencing the very output it creates.

Why is this "prompting through schema" approach so powerful?

  • Schema as Prompt: Your Pydantic model is the prompt for structured output. The LLM understands the desired format from your schema definition.
  • Validation is Built-in: Guarantee that the LLM output conforms to your defined schema. No more unexpected formats breaking your downstream processes. Validation is a result of the prompting.
  • Effortless Serialization: Easily convert your structured data into JSON (for APIs), Markdown (for content), or any format you need.
  • Pythonic Type Safety: Benefit from Python's type hints to catch data structure errors early in development, making your code more robust.
  • Tailored Output Formats: Create custom methods to render your structured data in user-friendly formats like Markdown, making it immediately usable for different teams.

Let's get practical and explore the critical aspects of using Pydantic models in your LangChain workflows, focusing on how they act as prompts.

Key Techniques: Pydantic for LangChain Output

1. Defining Your Data Schema with Pydantic Fields (Schema as Prompt)

The pydantic.Field function is your best friend when defining your data schema, and crucially, when instructing the LLM. It allows you to specify field types, descriptions, and validation rules, all of which contribute to the prompt LangChain sends to the LLM. Here's a breakdown of common patterns:

  • Mandatory Fields: The Field(...) Power Move - Instructing "This is Required"

Need a field to be absolutely present in the LLM output? Use ... (ellipsis) within Field(). This signals that the field is required, and this requirement is communicated to the LLM as part of the prompt.

from pydantic import BaseModel, Field
from typing import List

class SubSection(BaseModel):
    """Represents a sub-section within a larger section.""" # Docstring for the class - prompt for the LLM about the object

    header: str = Field(..., description="Sub-section header (H3).") # Field description - prompt for the LLM about the 'header' field
    content_outline: str = Field(..., description="Detailed content points.") # Field description - prompt for the LLM about the 'content_outline' field

class Section(BaseModel):
    """Represents a section of content, potentially with sub-sections.""" # Docstring for the class - prompt for the LLM about the object
    header: str = Field(..., description="The main section header (H2).") # Field description - prompt for the LLM about the 'header' field
    sub_sections: List[SubSection] = Field(default_factory=list, description="List of sub-sections (H3).") # Field description - prompt for the LLM about 'sub_sections'

Crucially, the docstrings for Section and SubSection, and the description arguments within Field(), are not just for documentation – they are part of the prompt that tells the LLM what kind of output to generate. LangChain uses this information to guide the LLM towards producing structured data that matches your schema.

  • Setting Default Values: Field(default_value) - Affecting Output, Not Prompt

Default values, specified using Field(default_value), primarily affect the output of the parser. They determine what value will be assigned to a field if the LLM doesn't provide one in its response. They do not directly influence the prompt sent to the LLM. They are part of the schema definition, but their role is in the parsing stage, not the prompting stage.

class Section(BaseModel):
    title: str = Field(default="Default Section Title", description="The title of the section.")
    # ... other fields

In this case, if the LLM doesn't provide a value for title, the parser will use "Default Section Title". The LLM is still prompted to provide a title based on the description, but the default value is only used if the LLM fails to do so.

  • Handling Mutable Defaults with default_factory=... - Output Behavior

Similar to default values, default_factory primarily affects the output of the parser. It ensures that mutable default values (like lists or dictionaries) are created anew each time, preventing unintended sharing of default values between instances. Like default_value, it does not directly influence the prompt sent to the LLM.

class Section(BaseModel):
    tags: List[str] = Field(default_factory=list, description="Keywords or tags for the section.")
    # ... other fields

Here, default_factory=list ensures that each Section instance gets its own empty list for tags if the LLM doesn't provide one. The LLM is still prompted to provide a list of strings for tags based on the description.

Quick Reference: Field Syntax and Prompting Power

SyntaxPurposeExamplePrompting Effect
Field(...)Mandatory Fieldtitle: str = Field(...)Instructs the LLM that title is a required part of the output.
Field(default_value)Default Valuetitle: str = Field(default="Default Title")Value returned if the LLM's doesn't provide a value for the title field. Does not directly affect the prompt.
Field(default_factory=list)Default Factorytags: List[str] = Field(default_factory=list)Defines tags as a list within the schema. Does not directly affect the prompt.
description="..."Field Descriptionheader: str = Field(..., description="...")Provides explicit instructions to the LLM about the content of the header field.
Class DocstringClass Descriptionclass Section(BaseModel): """..."""Provides high-level instructions to the LLM about the overall object/structure.

2. Crafting Custom Output Formats (e.g., to_markdown and to_dict)

These methods are for after the LLM has generated the structured output based on your Pydantic models and prompts. They are about how you use the structured data, not about prompting the LLM itself.

Here are two examples I use frequently:

  • to_markdown Property: For Human-Readable Output

Create a @property method to transform your structured data into a Markdown string. This is perfect for displaying content to users or for content teams.

class SubSection(BaseModel):
    header: str = Field(..., description="Sub-section header (H3).")
    content_outline: str = Field(..., description="Detailed content points.")

    @property
    def to_markdown(self) -> str:
       return f"- **H3: {self.header}**\n    - **Content Outline:** {self.content_outline}"
  • to_dict Method: For API and Machine Consumption

Pydantic models already have a built-in .dict() method to convert to dictionaries. However, you might want a custom to_dict method if you need to perform any pre-processing or transformations before exporting to a dictionary (and potentially JSON). In many cases, simply using .dict() directly is sufficient!

class SubSection(BaseModel):
    # ... (fields and to_markdown as above)

    def to_dict(self) -> dict:
       return self.dict() # Or add custom processing here if needed

3. Seamless Integration with LangChain's with_structured_output (Prompting in Action)

LangChain's .with_structured_output method is the magic that takes your Pydantic model (your schema-as-prompt) and uses it to guide the LLM. It leverages the model's underlying capabilities (like function calling or JSON mode) to interpret your schema and generate output accordingly.

Let's see this in action:

from langchain_openai import ChatOpenAI
from pydantic import BaseModel, Field
from typing import List

# Define your Pydantic models (Section and SubSection - as defined earlier)
class SubSection(BaseModel): # Docstring acts as prompt: "Represents a sub-section..."
    """Represents a sub-section within a larger section."""
    header: str = Field(..., description="Sub-section header (H3).") # Field description acts as prompt
    content_outline: str = Field(..., description="Detailed content points.") # Field description acts as prompt

    @property
    def to_markdown(self) -> str:
        return f"- **H3: {self.header}**\n    - **Content Outline:** {self.content_outline}"

    def to_dict(self) -> dict:
        return self.dict()

class Section(BaseModel): # Docstring acts as prompt: "Represents a section..."
    """Represents a section of content, potentially with sub-sections."""
    header: str = Field(..., description="The main section header (H2).") # Field description acts as prompt
    sub_sections: List[SubSection] = Field(default_factory=list, description="List of sub-sections (H3).") # Field description acts as prompt

    @property
    def to_markdown(self) -> str:
        md = f"**H2: {self.header}**\n"
        for sub in self.sub_sections:
            md += sub.to_markdown + "\n"
        return md

    def to_dict(self) -> dict:
        return self.dict()


# Initialize your LangChain LLM (e.g., ChatOpenAI)
model = ChatOpenAI(model="gpt-4", temperature=0) # Make sure you have your API key set!

# Wrap your LLM with structured output, using your Section Pydantic model - SCHEMA AS PROMPT!
structured_llm = model.with_structured_output(Section)

# Invoke the LLM with your prompt - the *user prompt* is still important, but the *structure* is guided by Pydantic
prompt = "Create a section about 'Pydantic and LangChain Integration' with two subsections: 'Field Definitions' and 'Output Formatting'."
response = structured_llm.invoke(prompt)

# 'response' is now a Pydantic Section object - structured output achieved through schema prompting!
print(response.to_markdown)
print(response.to_dict())

Best Practices for Pydantic and LangChain Structured Output

  • Design Clear Schemas (as Prompts): Carefully design your Pydantic models to accurately represent the structure you expect from the LLM. Think of your schema as the instruction manual for the LLM's output.
  • Write Descriptive Docstrings and Field Descriptions (Prompt Engineering): Use the description parameter in Field() and class docstrings to clearly document the purpose and content of each field and the overall object. These descriptions are crucial prompts for the LLM. The better your descriptions, the better the LLM understands what to generate.
  • Avoid Mutable Defaults: Always use default_factory for mutable types to prevent shared state between instances.
  • Modular Design: Break complex outputs into nested Pydantic models for clarity and maintainability.
  • Conversion Helpers: Implement properties like to_markdown and methods like to_dict to simplify downstream processing and export.

Final Thoughts

Pydantic models are not just about validation; they are a powerful way to prompt LLMs for structured output. By defining clear schemas with descriptive docstrings and field descriptions, you are effectively instructing the LLM on the exact format and content you require. LangChain's .with_structured_output method then leverages these schemas to guide the LLM's generation process, resulting in reliable, validated, and easily usable structured data.

Stop thinking of Pydantic just for validation – embrace it as a powerful prompting tool for structured LLM output. This schema-as-prompt approach is key to building truly robust and dependable LLM-powered applications.

Happy coding!


Jason Melman