Skip to main content

On This Page

Python Dataclasses vs Pydantic: The Complete Production Guide

28 min read
Share

TL;DR

Python dataclasses (standard library) give you type-annotated classes with auto-generated __init__, __repr__, __eq__, and ordering methods—solving the boilerplate problem for plain data containers. Pydantic v2 builds on this with runtime validation, type coercion, JSON parsing, and settings management, powered by a Rust core for performance. Use dataclasses for internal domain models where types are guaranteed correct. Use Pydantic at system boundaries (APIs, configs, external data) where validation matters. This guide covers every feature, footgun, and real-world pattern for both.


PART 1 — Python Dataclasses: Complete Coverage

1. Motivation and Design Goals

Before Python 3.7, creating a simple data container required verbose boilerplate:

class User:
    def __init__(self, id: int, name: str, email: str):
        self.id = id
        self.name = name
        self.email = email
    
    def __repr__(self):
        return f"User(id={self.id!r}, name={self.name!r}, email={self.email!r})"
    
    def __eq__(self, other):
        if not isinstance(other, User):
            return NotImplemented
        return (self.id, self.name, self.email) == (other.id, other.name, other.email)

Maintaining this is tedious and error-prone. Add a field? Update three methods. Forget to update __eq__? Subtle bugs.

Dataclasses solve this: they auto-generate these methods from type annotations. The design goals:

  • Zero runtime overhead: Generated code is identical to what you’d write by hand
  • Type-checker friendly: Annotations drive behavior, not runtime inspection
  • Opt-in magic: Control exactly which methods get generated
  • No new base class: Works with regular classes, inheritance, descriptors
from dataclasses import dataclass

@dataclass
class User:
    id: int
    name: str
    email: str

That’s it. You get __init__, __repr__, and __eq__ for free. The decorator introspects class annotations at definition time and injects methods.

2. The @dataclass Decorator In Depth

@dataclass(
    init=True,          # Generate __init__
    repr=True,          # Generate __repr__
    eq=True,            # Generate __eq__
    order=False,        # Generate __lt__, __le__, __gt__, __ge__
    unsafe_hash=False,  # Generate __hash__ (dangerous, see below)
    frozen=False,       # Make immutable
    match_args=True,    # Generate __match_args__ for pattern matching
    kw_only=False,      # All fields keyword-only
    slots=False,        # Use __slots__
    weakref_slot=False  # Add __weakref__ to __slots__
)
class Example:
    ...

init, repr, eq

These are self-explanatory. Setting init=False means you’ll provide your own __init__. Useful when you need custom initialization logic but still want __repr__ and __eq__.

@dataclass(init=False)
class Timestamped:
    created_at: datetime
    
    def __init__(self):
        self.created_at = datetime.now(timezone.utc)

order=True

Generates comparison methods based on field order. Fields are compared as tuples.

@dataclass(order=True)
class Version:
    major: int
    minor: int
    patch: int

v1 = Version(1, 2, 3)
v2 = Version(1, 3, 0)
assert v1 < v2  # Compares (1,2,3) < (1,3,0)

Footgun: If you have eq=False and order=True, you violate Python’s invariant that a <= b and b <= a implies a == b. Don’t do this.

unsafe_hash=True: The Danger Zone

Hashing mutable objects is a bug waiting to happen:

@dataclass(unsafe_hash=True)  # BAD IDEA
class Mutable:
    value: int

d = {}
m = Mutable(value=1)
d[m] = "found"
m.value = 2  # MUTATE
print(d[m])  # KeyError! Hash changed, dict can't find it

Safe pattern: Only use unsafe_hash=True with frozen=True.

frozen=True: Immutability

Makes instances immutable after __init__. Attempts to assign raise FrozenInstanceError.

@dataclass(frozen=True)
class Point:
    x: float
    y: float

p = Point(1.0, 2.0)
p.x = 3.0  # FrozenInstanceError

Frozen dataclasses are hashable by default (no need for unsafe_hash), making them safe for dict keys and sets.

Performance: Frozen dataclasses aren’t faster at runtime. The immutability is enforced by replacing __setattr__ and __delattr__, not through memory layout tricks.

slots=True: Memory Optimization

Python 3.10+ allows slots=True to use __slots__ for memory efficiency:

@dataclass(slots=True)
class Compact:
    a: int
    b: str

Without slots, each instance carries a __dict__ (~200 bytes overhead). With slots, attributes are stored in a fixed array. For millions of instances, this matters.

Trade-off: You can’t add arbitrary attributes:

c = Compact(1, "hi")
c.new_field = 123  # AttributeError: 'Compact' object has no attribute 'new_field'

Inheritance caveat: All classes in the hierarchy must use slots=True, or you lose the benefit.

3. Field Mechanics

field() Function

from dataclasses import dataclass, field

@dataclass
class Record:
    id: int
    tags: list[str] = field(default_factory=list)
    metadata: dict = field(default_factory=dict, repr=False)
    _internal: int = field(default=0, init=False)

Parameters:

  • default: Default value (must be immutable)
  • default_factory: Callable returning default (for mutable defaults)
  • init: Include in __init__ (default True)
  • repr: Include in __repr__ (default True)
  • compare: Include in __eq__ and ordering (default True)
  • hash: Include in __hash__ (default None means use compare)
  • metadata: Arbitrary dict for tooling (not used by dataclasses itself)

The Mutable Default Footgun

@dataclass
class Bad:
    items: list = []  # SyntaxError! Mutable default

@dataclass
class Good:
    items: list = field(default_factory=list)

Why? All instances would share the same list. default_factory creates a new list per instance.

init=False: Manual Initialization

@dataclass
class Computed:
    width: int
    height: int
    area: int = field(init=False)
    
    def __post_init__(self):
        self.area = self.width * self.height

c = Computed(10, 20)
print(c.area)  # 200

repr=False: Hide Sensitive Data

@dataclass
class Credentials:
    username: str
    password: str = field(repr=False)

creds = Credentials("admin", "secret123")
print(creds)  # Credentials(username='admin')  # password hidden

compare=False: Exclude From Comparisons

@dataclass
class CachedData:
    key: str
    value: str
    cache_time: float = field(compare=False)
    
# Two instances are equal if key/value match, ignoring cache_time

metadata: Custom Annotations

Used by third-party libraries (e.g., serialization frameworks):

@dataclass
class APIModel:
    user_id: int = field(metadata={"json_name": "userId"})
    created_at: datetime = field(metadata={"format": "iso8601"})

# Access via fields()
from dataclasses import fields
for f in fields(APIModel):
    print(f.name, f.metadata)

4. Post-Init Lifecycle

post_init

Called after __init__ completes. Use for validation, computed fields, or normalization:

@dataclass
class Email:
    address: str
    
    def __post_init__(self):
        if "@" not in self.address:
            raise ValueError(f"Invalid email: {self.address}")
        self.address = self.address.lower()

Modifying Frozen Dataclasses

You can’t assign to frozen instances normally, but __post_init__ has a workaround:

@dataclass(frozen=True)
class Normalized:
    name: str
    normalized: str = field(init=False)
    
    def __post_init__(self):
        # Use object.__setattr__ to bypass frozen check
        object.__setattr__(self, "normalized", self.name.lower())

n = Normalized("HELLO")
print(n.normalized)  # "hello"
n.normalized = "x"   # FrozenInstanceError

5. Inheritance Rules and Pitfalls

Subclasses inherit fields from parents:

@dataclass
class Base:
    x: int

@dataclass
class Derived(Base):
    y: int
    
# Derived.__init__(x, y)

Ordering matters: Fields without defaults must come before fields with defaults:

@dataclass
class Parent:
    a: int
    b: int = 10

@dataclass
class Child(Parent):
    c: int  # ERROR! Non-default after default

Fix: Give c a default, or rework the hierarchy.

Override fields:

@dataclass
class Parent:
    x: int = 10

@dataclass
class Child(Parent):
    x: int = 20  # Overrides default

c = Child()
print(c.x)  # 20

Slots inheritance: If the parent doesn’t use slots, the child won’t get slot benefits even if it specifies slots=True.

6. Dataclasses + Typing

Optional, Union, Literal

from typing import Optional, Literal

@dataclass
class Config:
    host: str
    port: int
    tls: bool
    log_level: Literal["DEBUG", "INFO", "ERROR"] = "INFO"
    proxy: Optional[str] = None

Dataclasses don’t validate types at runtime. Type checkers (mypy, pyright) will catch errors, but:

c = Config(host=123, port="wat", tls="yes")  # No error at runtime!

This is intentional. Dataclasses are about reducing boilerplate, not validation.

ClassVar: Class-Level Attributes

from dataclasses import dataclass
from typing import ClassVar

@dataclass
class Versioned:
    VERSION: ClassVar[int] = 2
    data: str

ClassVar tells the dataclass decorator to ignore this field (not in __init__, etc.).

InitVar: Init-Only Parameters

Fields that exist only during __init__, not as instance attributes:

from dataclasses import dataclass, field, InitVar

@dataclass
class Database:
    host: str
    port: int
    timeout: InitVar[int] = 30
    connection_string: str = field(init=False)
    
    def __post_init__(self, timeout: int):
        self.connection_string = f"postgresql://{self.host}:{self.port}?timeout={timeout}"

db = Database("localhost", 5432, timeout=60)
print(db.connection_string)  # Uses timeout
# db.timeout -> AttributeError, doesn't exist

7. Dataclasses + Serialization

asdict() and astuple()

from dataclasses import asdict, astuple

@dataclass
class Point:
    x: float
    y: float

p = Point(1.5, 2.5)
print(asdict(p))   # {'x': 1.5, 'y': 2.5}
print(astuple(p))  # (1.5, 2.5)

Deep conversion: Nested dataclasses are recursively converted:

@dataclass
class Line:
    start: Point
    end: Point

line = Line(Point(0, 0), Point(10, 10))
print(asdict(line))
# {'start': {'x': 0, 'y': 0}, 'end': {'x': 10, 'y': 10}}

Footgun: asdict() doesn’t handle arbitrary objects gracefully:

@dataclass
class Record:
    timestamp: datetime

r = Record(datetime.now())
asdict(r)  # Returns {'timestamp': <datetime object>}, NOT a string

You need custom serialization for complex types. Common pattern:

def to_serializable(obj):
    if isinstance(obj, datetime):
        return obj.isoformat()
    return obj

def serialize_dataclass(dc):
    return {k: to_serializable(v) for k, v in asdict(dc).items()}

Deserializing: No Built-In Support

Dataclasses don’t have from_dict(). You must construct instances manually:

data = {"x": 1.5, "y": 2.5}
p = Point(**data)

For nested structures, you need recursion or a library (e.g., dacite, cattrs).

8. Performance Characteristics

Memory Layout

Standard dataclass with __dict__:

  • Instance overhead: ~200 bytes + attribute storage
  • Attribute access: hash table lookup

With slots=True:

  • Instance overhead: ~40 bytes + attribute storage
  • Attribute access: direct array index (faster)

Benchmark (1 million instances):

@dataclass
class NoSlots:
    a: int
    b: int
    c: int
# Memory: ~350 MB

@dataclass(slots=True)
class WithSlots:
    a: int
    b: int
    c: int
# Memory: ~120 MB

Construction Speed

Dataclasses are just Python code. No metaclass overhead, no dynamic dispatch. Construction time is identical to hand-written __init__.

Comparison time: Generated __eq__ is tuple comparison under the hood. With slots=True, attribute access is faster, so equality checks are slightly faster.

9. Common Anti-Patterns

Mutable Default Anti-Pattern

@dataclass
class Container:
    items: list = []  # NO! All instances share the list

Always use default_factory.

Overusing frozen=True

Frozen dataclasses force immutability at the Python level, but they’re not truly immutable if they contain mutable objects:

@dataclass(frozen=True)
class Config:
    settings: dict

c = Config(settings={"debug": True})
c.settings["debug"] = False  # Mutates "frozen" object!

True immutability requires immutable data structures (e.g., frozendict, tuples).

Using Dataclasses for Validation

Dataclasses don’t validate. This silently succeeds:

@dataclass
class Age:
    value: int

age = Age(value=-50)  # No error!

Solution: Use __post_init__ for checks, or switch to Pydantic.

10. When NOT to Use Dataclasses

Don’t use dataclasses when:

  • You need validation: Dataclasses don’t validate types or constraints
  • You’re parsing untrusted input: No coercion or error handling
  • You need complex serialization: asdict() is shallow and doesn’t handle custom types
  • You want ORM features: Just use SQLAlchemy or similar
  • You need before/after hooks: Dataclasses only have __post_init__

Use cases for dataclasses:

  • Internal domain models with trusted data
  • Type-safe configuration objects (when initialized from code)
  • DTOs between layers (when types are guaranteed)
  • Replacing namedtuples with better type checking

PART 2 — Pydantic v2: Complete Guide

1. Philosophy and Design Differences

Pydantic solves a different problem: validating and parsing data from untrusted external sources. While dataclasses reduce boilerplate, Pydantic adds:

  • Runtime validation: Type annotations are enforced at runtime
  • Type coercion: Convert "123"123, "true"True
  • JSON parsing: Direct deserialization from JSON strings
  • Error aggregation: Collect all validation errors, not just the first
  • Settings management: Parse environment variables with validation

Key insight: Pydantic is built for system boundaries (APIs, configs, files). Dataclasses are for internal models.

2. BaseModel Deep Dive

from pydantic import BaseModel

class User(BaseModel):
    id: int
    name: str
    email: str

Unlike dataclasses, you inherit from BaseModel. This gives you:

  • __init__ with validation
  • model_dump() for serialization
  • model_validate() for parsing dicts
  • model_validate_json() for parsing JSON strings

Model Construction

user = User(id=1, name="Alice", email="[email protected]")
print(user.id)  # 1

# Type coercion
user2 = User(id="2", name="Bob", email="[email protected]")
print(user2.id, type(user2.id))  # 2 <class 'int'>

Pydantic converts compatible types automatically.

Immutability

By default, Pydantic models are mutable:

user.name = "Eve"  # OK

For immutability:

from pydantic import ConfigDict

class ImmutableUser(BaseModel):
    model_config = ConfigDict(frozen=True)
    
    id: int
    name: str

u = ImmutableUser(id=1, name="Alice")
u.name = "Eve"  # ValidationError: Instance is frozen

Slots

Pydantic v2 models do not use __slots__ by default (they need __dict__ for dynamic features). You can opt in:

class CompactModel(BaseModel):
    model_config = ConfigDict(use_attribute_docstrings=True)
    
    # Pydantic doesn't use __slots__ by default
    # For memory efficiency, use dataclasses with Pydantic validation

For memory-sensitive use cases, consider hybrid patterns (covered later).

3. Validation System

Type Coercion Rules

Pydantic tries to convert input to the annotated type:

class Data(BaseModel):
    count: int
    ratio: float
    active: bool

# All of these work
d1 = Data(count="42", ratio="3.14", active="yes")
print(d1.count, d1.ratio, d1.active)  # 42 3.14 True

d2 = Data(count=42.7, ratio=5, active=1)
print(d2.count, d2.ratio, d2.active)  # 42 5.0 True

Bool coercion: "yes", "true", "1", "on"True; "no", "false", "0", "off"False.

Strict vs Non-Strict Mode

Disable coercion:

from pydantic import Field

class StrictData(BaseModel):
    count: int = Field(strict=True)

StrictData(count="42")  # ValidationError: Input should be a valid integer
StrictData(count=42)    # OK

Global strict mode:

class AllStrict(BaseModel):
    model_config = ConfigDict(strict=True)
    
    count: int
    ratio: float

Field Validators

After validators (run after type coercion):

from pydantic import field_validator

class User(BaseModel):
    username: str
    age: int
    
    @field_validator("username")
    @classmethod
    def username_alphanumeric(cls, v: str) -> str:
        if not v.isalnum():
            raise ValueError("Username must be alphanumeric")
        return v
    
    @field_validator("age")
    @classmethod
    def age_positive(cls, v: int) -> int:
        if v < 0:
            raise ValueError("Age must be positive")
        return v

User(username="alice", age=25)  # OK
User(username="alice!", age=25)  # ValidationError: username
User(username="alice", age=-5)  # ValidationError: age

Before validators (run before type coercion):

class Normalized(BaseModel):
    email: str
    
    @field_validator("email", mode="before")
    @classmethod
    def lowercase_email(cls, v):
        if isinstance(v, str):
            return v.lower()
        return v

Normalized(email="[email protected]")  # Stores "[email protected]"

Wrap validators (control the entire validation):

from pydantic import field_validator, ValidationInfo
from pydantic_core import core_schema

class Logged(BaseModel):
    value: int
    
    @field_validator("value", mode="wrap")
    @classmethod
    def log_validation(cls, v, handler):
        print(f"Validating: {v}")
        result = handler(v)  # Call default validation
        print(f"Result: {result}")
        return result

Logged(value="123")
# Output:
# Validating: 123
# Result: 123

Model Validators

Validate across multiple fields:

from pydantic import model_validator

class DateRange(BaseModel):
    start: datetime
    end: datetime
    
    @model_validator(mode="after")
    def check_dates(self) -> "DateRange":
        if self.end <= self.start:
            raise ValueError("end must be after start")
        return self

DateRange(start=datetime(2024, 1, 1), end=datetime(2023, 1, 1))  # ValidationError

Use mode="before" to access raw dict:

class FlexibleInput(BaseModel):
    value: int
    
    @model_validator(mode="before")
    @classmethod
    def handle_legacy(cls, data):
        if isinstance(data, dict) and "old_value" in data:
            data["value"] = data.pop("old_value")
        return data

FlexibleInput(old_value=42)  # Works, converts to new format

4. Field Definitions

from pydantic import Field

class Product(BaseModel):
    id: int
    name: str = Field(min_length=1, max_length=100)
    price: float = Field(gt=0, le=1_000_000)
    quantity: int = Field(default=0, ge=0)
    description: str | None = Field(default=None, description="Product description")
    tags: list[str] = Field(default_factory=list)

Constraints:

  • Strings: min_length, max_length, pattern (regex)
  • Numbers: gt, ge, lt, le, multiple_of
  • Collections: min_length, max_length

default vs default_factory

Same as dataclasses:

class Config(BaseModel):
    options: dict = Field(default_factory=dict)  # New dict per instance

Aliasing

Map Python names to JSON/external names:

class APIResponse(BaseModel):
    user_id: int = Field(alias="userId")
    created_at: datetime = Field(alias="createdAt")

# Parse from API
data = {"userId": 123, "createdAt": "2024-01-01T00:00:00Z"}
response = APIResponse(**data)
print(response.user_id)  # 123

# Serialize with aliases
print(response.model_dump(by_alias=True))
# {'userId': 123, 'createdAt': datetime(...)}

Population by name:

class Flexible(BaseModel):
    model_config = ConfigDict(populate_by_name=True)
    
    user_id: int = Field(alias="userId")

# Accept both
Flexible(userId=1)   # OK
Flexible(user_id=1)  # Also OK

5. Parsing Inputs

From Dicts

data = {"id": 1, "name": "Alice", "email": "[email protected]"}
user = User(**data)  # OK
user = User.model_validate(data)  # Explicit validation

From JSON Strings

json_data = '{"id": 1, "name": "Alice", "email": "[email protected]"}'
user = User.model_validate_json(json_data)

This is faster than json.loads() + User(**data) because Pydantic’s Rust core parses JSON natively.

Lists of Models

users_data = [
    {"id": 1, "name": "Alice", "email": "[email protected]"},
    {"id": 2, "name": "Bob", "email": "[email protected]"},
]

# Option 1: List comprehension
users = [User(**d) for d in users_data]

# Option 2: TypeAdapter (preferred)
from pydantic import TypeAdapter

UserList = TypeAdapter(list[User])
users = UserList.validate_python(users_data)

TypeAdapter for Non-BaseModel Types

from pydantic import TypeAdapter

# Validate basic types
IntValidator = TypeAdapter(int)
print(IntValidator.validate_python("123"))  # 123

# Validate complex structures
DictAdapter = TypeAdapter(dict[str, list[int]])
result = DictAdapter.validate_python({"nums": ["1", "2", "3"]})
print(result)  # {'nums': [1, 2, 3]}

6. Serialization

model_dump()

user = User(id=1, name="Alice", email="[email protected]")

# Default
print(user.model_dump())
# {'id': 1, 'name': 'Alice', 'email': '[email protected]'}

# Exclude fields
print(user.model_dump(exclude={"email"}))
# {'id': 1, 'name': 'Alice'}

# Include only certain fields
print(user.model_dump(include={"id", "name"}))
# {'id': 1, 'name': 'Alice'}

# Use aliases
class APIModel(BaseModel):
    user_id: int = Field(alias="userId")

m = APIModel(userId=123)
print(m.model_dump(by_alias=True))  # {'userId': 123}

model_dump_json()

json_str = user.model_dump_json()
print(json_str)  # '{"id":1,"name":"Alice","email":"[email protected]"}'

# With indentation
print(user.model_dump_json(indent=2))

Nested Models

class Address(BaseModel):
    street: str
    city: str

class Person(BaseModel):
    name: str
    address: Address

p = Person(name="Alice", address={"street": "123 Main", "city": "NYC"})
print(p.model_dump())
# {'name': 'Alice', 'address': {'street': '123 Main', 'city': 'NYC'}}

Exclude Unset

Only serialize fields that were explicitly set:

class Partial(BaseModel):
    a: int = 1
    b: int = 2
    c: int = 3

p = Partial(a=10)
print(p.model_dump())  # {'a': 10, 'b': 2, 'c': 3}
print(p.model_dump(exclude_unset=True))  # {'a': 10}

Useful for PATCH operations in REST APIs.

7. Error System

from pydantic import ValidationError

class User(BaseModel):
    id: int
    age: int = Field(gt=0, lt=150)
    email: str

try:
    User(id="not_an_int", age=-5, email=123)
except ValidationError as e:
    print(e.json())

Output (formatted):

[
  {
    "type": "int_parsing",
    "loc": ["id"],
    "msg": "Input should be a valid integer, unable to parse string as an integer",
    "input": "not_an_int"
  },
  {
    "type": "greater_than",
    "loc": ["age"],
    "msg": "Input should be greater than 0",
    "input": -5
  },
  {
    "type": "string_type",
    "loc": ["email"],
    "msg": "Input should be a valid string",
    "input": 123
  }
]

Key properties:

  • type: Error code
  • loc: Field location (tuple for nested fields)
  • msg: Human-readable message
  • input: The invalid value

Custom Error Messages

class User(BaseModel):
    age: int = Field(gt=0, lt=150, description="User age")
    
    @field_validator("age")
    @classmethod
    def validate_age(cls, v):
        if v < 13:
            raise ValueError("Users must be at least 13 years old")
        return v

8. Settings Management

Pydantic’s killer feature for application config:

from pydantic_settings import BaseSettings

class AppSettings(BaseSettings):
    database_url: str
    redis_host: str = "localhost"
    redis_port: int = 6379
    debug: bool = False
    api_key: str

# Reads from environment variables
settings = AppSettings()

Set environment variables:

export DATABASE_URL="postgresql://localhost/mydb"
export API_KEY="secret123"

Run Python:

print(settings.database_url)  # "postgresql://localhost/mydb"
print(settings.debug)  # False (default)

Custom Prefix

class AppSettings(BaseSettings):
    model_config = ConfigDict(env_prefix="APP_")
    
    database_url: str

# Now looks for APP_DATABASE_URL

.env File Support

class AppSettings(BaseSettings):
    model_config = ConfigDict(env_file=".env")
    
    database_url: str
    api_key: str

.env file:

DATABASE_URL=postgresql://localhost/mydb
API_KEY=secret123

Nested Settings

class DatabaseSettings(BaseModel):
    host: str
    port: int
    name: str

class AppSettings(BaseSettings):
    database: DatabaseSettings

# Set via environment:
# DATABASE__HOST=localhost
# DATABASE__PORT=5432
# DATABASE__NAME=mydb

Double underscore __ for nested fields.

Secrets Support

from pydantic import SecretStr

class AppSettings(BaseSettings):
    api_key: SecretStr

settings = AppSettings(api_key="secret123")
print(settings.api_key)  # SecretStr('**********')
print(settings.api_key.get_secret_value())  # "secret123"

Prevents accidental logging of secrets.

9. Advanced Features

Computed Fields

Values derived from other fields:

from pydantic import computed_field

class Rectangle(BaseModel):
    width: float
    height: float
    
    @computed_field
    @property
    def area(self) -> float:
        return self.width * self.height

r = Rectangle(width=10, height=5)
print(r.area)  # 50.0
print(r.model_dump())  # {'width': 10.0, 'height': 5.0, 'area': 50.0}

Computed fields are included in serialization by default.

Private Attributes

Not validated, not serialized:

class Stateful(BaseModel):
    public_value: int
    _cache: dict = {}
    
    def compute(self):
        if "result" not in self._cache:
            self._cache["result"] = self.public_value * 2
        return self._cache["result"]

s = Stateful(public_value=10)
s._cache["custom"] = 123
print(s.model_dump())  # {'public_value': 10}  # _cache excluded

Root Models

Validate types that aren’t dicts:

from pydantic import RootModel

class ItemList(RootModel[list[int]]):
    pass

items = ItemList([1, 2, 3])
print(items.root)  # [1, 2, 3]

data = ItemList.model_validate(["1", "2", "3"])
print(data.root)  # [1, 2, 3] (coerced)

Useful for APIs that return arrays at the top level.

Discriminated Unions

Type-safe polymorphism:

from typing import Literal, Union
from pydantic import Field

class Cat(BaseModel):
    type: Literal["cat"]
    meow_volume: int

class Dog(BaseModel):
    type: Literal["dog"]
    bark_volume: int

class Snake(BaseModel):
    type: Literal["snake"]
    length: float

Animal = Union[Cat, Dog, Snake]

class Zoo(BaseModel):
    animals: list[Animal] = Field(discriminator="type")

zoo_data = {
    "animals": [
        {"type": "cat", "meow_volume": 8},
        {"type": "dog", "bark_volume": 10},
        {"type": "snake", "length": 2.5},
    ]
}

zoo = Zoo(**zoo_data)
for animal in zoo.animals:
    if isinstance(animal, Cat):
        print(f"Cat: {animal.meow_volume}")
    elif isinstance(animal, Dog):
        print(f"Dog: {animal.bark_volume}")

Pydantic looks at the type field to decide which model to use.

Generics

from typing import Generic, TypeVar

T = TypeVar("T")

class Response(BaseModel, Generic[T]):
    data: T
    status: int

# Use with different types
IntResponse = Response[int]
r1 = IntResponse(data=123, status=200)

UserResponse = Response[User]
r2 = UserResponse(data={"id": 1, "name": "Alice", "email": "[email protected]"}, status=200)

Recursive Models

class TreeNode(BaseModel):
    value: int
    children: list["TreeNode"] = []

# Must enable with model_rebuild() in Python < 3.10 or use from __future__ import annotations

tree = TreeNode(
    value=1,
    children=[
        TreeNode(value=2),
        TreeNode(value=3, children=[TreeNode(value=4)])
    ]
)

10. Performance and Internals

The Rust Core

Pydantic v2 rewrote core validation in Rust (pydantic-core). Benefits:

  • 5-50x faster validation than v1
  • Native JSON parsing: Faster than Python’s json module
  • Lower memory overhead: Efficient internal repr

Benchmark (parsing 10k user objects from JSON):

  • Pydantic v1: ~500ms
  • Pydantic v2: ~20ms
  • Manual json.loads() + dict access: ~15ms (but no validation)

When Validation is Expensive

Complex validators can be slow:

class Expensive(BaseModel):
    data: list[int]
    
    @field_validator("data")
    @classmethod
    def unique_check(cls, v):
        if len(v) != len(set(v)):  # O(n) check
            raise ValueError("Items must be unique")
        return v

# For 1 million items, this is slow

Optimization: Use Pydantic’s built-in constraints when possible:

from typing import Set

class Better(BaseModel):
    data: Set[int]  # Enforces uniqueness automatically

When to Avoid Pydantic

  • Hot inner loops: If you’re validating the same trusted data millions of times per second, validation overhead matters. Use dataclasses or plain classes.
  • Memory-constrained environments: Pydantic models use more memory than slotted dataclasses.
  • No external data: If your data is generated internally, dataclasses are simpler.

11. Migration Notes: v1 → v2

Major breaking changes:

Config class → model_config

v1:

class Model(BaseModel):
    class Config:
        frozen = True

v2:

class Model(BaseModel):
    model_config = ConfigDict(frozen=True)

Validators

v1: @validator v2: @field_validator

v1:

@validator("field")
def check_field(cls, v):
    return v

v2:

@field_validator("field")
@classmethod
def check_field(cls, v):
    return v

Serialization

v1: .dict(), .json() v2: .model_dump(), .model_dump_json()

Parsing

v1: .parse_obj(), .parse_raw() v2: .model_validate(), .model_validate_json()

12. Common Mistakes and Footguns

Forgetting Validation Runs on Every Assignment

class Expensive(BaseModel):
    values: list[int]
    
    @field_validator("values")
    @classmethod
    def validate_values(cls, v):
        print("Validating!")  # Prints on EVERY assignment
        return v

m = Expensive(values=[1, 2, 3])  # "Validating!"
m.values = [4, 5, 6]  # "Validating!" again

Use model_config = ConfigDict(validate_assignment=False) if reassignment doesn’t need validation.

Mutable Default Strikes Again

class Bad(BaseModel):
    items: list = []  # Pydantic allows this, but it's still wrong!

b1 = Bad()
b2 = Bad()
b1.items.append(1)
print(b2.items)  # [1]  # Shared!

Use Field(default_factory=list).

Over-Validating

Don’t validate internal data that’s already correct:

# BAD: Internal domain logic using Pydantic
def process_user(user: User):  # User is Pydantic model
    # Every attribute access pays validation tax
    ...

# GOOD: Use Pydantic at boundaries, dataclasses internally

PART 3 — Dataclasses vs Pydantic

1. Feature Comparison Table

FeatureDataclassesPydantic
PurposeReduce boilerplateValidate external data
Runtime validation❌ No✅ Yes
Type coercion❌ No✅ Yes
JSON parsing❌ Manual✅ Built-in
Serializationasdict() (shallow)model_dump() (rich)
Immutabilityfrozen=Truemodel_config
Slotsslots=True (3.10+)❌ No (uses __dict__)
Memory overheadLow (especially with slots)Higher
SpeedFastest (no validation)Fast (Rust core), but slower than no validation
Settings from env❌ ManualBaseSettings
Error aggregation❌ N/A✅ All errors at once
Nested validation❌ No✅ Recursive
Field constraints❌ Manual via __post_init__✅ Built-in (Field)
Standard library✅ Yes❌ No (third-party)

2. Performance Comparison

Benchmark: Create 100k instances from dicts

# Dataclass (no validation)
@dataclass
class DC:
    id: int
    name: str
    value: float

for d in data:
    DC(**d)  # ~50ms

# Pydantic (with validation)
class PM(BaseModel):
    id: int
    name: str
    value: float

for d in data:
    PM(**d)  # ~200ms

# Pydantic (construct without validation)
for d in data:
    PM.model_construct(**d)  # ~70ms

Lessons:

  • Dataclasses are faster when data is trusted
  • Pydantic validation adds ~4x overhead
  • model_construct() bypasses validation for internal use

3. Correct Use Cases

Use Dataclasses When:

Internal domain models

@dataclass(frozen=True, slots=True)
class OrderLine:
    product_id: int
    quantity: int
    unit_price: Decimal

Performance-critical paths

# Processing millions of records
@dataclass(slots=True)
class LogEntry:
    timestamp: float
    level: str
    message: str

Simple DTOs between layers

@dataclass
class ServiceResult:
    success: bool
    data: Any
    error: str | None = None

Use Pydantic When:

API request/response models

class CreateUserRequest(BaseModel):
    username: str = Field(min_length=3, max_length=20)
    email: str
    age: int = Field(ge=13)

Configuration from environment

class AppConfig(BaseSettings):
    database_url: str
    api_key: SecretStr
    debug: bool = False

Parsing external data (JSON, YAML, etc.)

class APIResponse(BaseModel):
    user_id: int
    created_at: datetime

response = APIResponse.model_validate_json(api_response_text)

Validation boundaries

# Validate at system edge
def create_user(request: CreateUserRequest) -> User:
    # request is validated
    # Convert to internal domain model (dataclass)
    return User(id=generate_id(), username=request.username)

4. Hybrid Patterns

Pattern 1: Pydantic for Input, Dataclasses for Domain

# API layer
class CreateOrderRequest(BaseModel):
    customer_id: int
    items: list[dict]

# Domain layer
@dataclass(frozen=True)
class Order:
    id: int
    customer_id: int
    items: list[OrderLine]
    created_at: datetime

# Service layer
def create_order(request: CreateOrderRequest) -> Order:
    # Validate at boundary
    items = [OrderLine(**item) for item in request.items]
    return Order(
        id=generate_id(),
        customer_id=request.customer_id,
        items=items,
        created_at=datetime.now(timezone.utc)
    )

Pattern 2: Pydantic Settings + Dataclass Models

# Config with Pydantic
class DatabaseConfig(BaseSettings):
    host: str
    port: int
    name: str

# Runtime models with dataclasses
@dataclass(slots=True)
class User:
    id: int
    name: str

Pattern 3: Dataclasses with Pydantic Validation

Use pydantic.dataclasses for dataclass syntax with Pydantic validation:

from pydantic.dataclasses import dataclass as pydantic_dataclass

@pydantic_dataclass
class User:
    id: int
    name: str
    age: int

# This is a dataclass, but with Pydantic validation!
User(id="123", age="30")  # Coerces types

Trade-off: You get validation but lose some performance.

5. Decision Framework

                Is data from external source?
                          |
                    Yes   |   No
                          |
           +--------------+--------------+
           |                             |
      Use Pydantic                 Do you need validation?
           |                             |
           |                       Yes   |   No
           |                             |
           |              +--------------+--------------+
           |              |                             |
           |        Pydantic or                   Use dataclasses
           |    dataclass + __post_init__              |
           |              |                             |
           |              |                             |
           +-------> Validation boundary <--------------+

Questions to ask:

  1. Is the data coming from users, APIs, files, or environment? → Pydantic
  2. Do I need type coercion? → Pydantic
  3. Is this a performance bottleneck? → Dataclasses (especially with slots)
  4. Do I need settings management? → Pydantic BaseSettings
  5. Is this an internal domain model used everywhere? → Dataclasses
  6. Do I need comprehensive validation rules? → Pydantic

PART 4 — Real-World Patterns

1. API Request/Response Models

FastAPI with Pydantic:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field

app = FastAPI()

class CreateUserRequest(BaseModel):
    username: str = Field(min_length=3, max_length=20, pattern=r"^[a-zA-Z0-9_]+$")
    email: str
    age: int = Field(ge=13, le=120)

class UserResponse(BaseModel):
    id: int
    username: str
    email: str
    created_at: datetime

@app.post("/users", response_model=UserResponse)
def create_user(request: CreateUserRequest):
    # Request is automatically validated
    user = save_user(request)  # Internal logic
    return UserResponse(
        id=user.id,
        username=user.username,
        email=user.email,
        created_at=user.created_at
    )

Key points:

  • Pydantic validates incoming JSON automatically
  • response_model validates outgoing data
  • Errors return 422 with detailed validation messages

2. Config Systems

Three-tier config with Pydantic:

from pydantic_settings import BaseSettings
from pydantic import SecretStr, Field

class DatabaseConfig(BaseModel):
    host: str = "localhost"
    port: int = 5432
    name: str
    user: str
    password: SecretStr
    
    @property
    def url(self) -> str:
        pwd = self.password.get_secret_value()
        return f"postgresql://{self.user}:{pwd}@{self.host}:{self.port}/{self.name}"

class RedisConfig(BaseModel):
    host: str = "localhost"
    port: int = 6379
    db: int = 0

class AppSettings(BaseSettings):
    model_config = ConfigDict(env_nested_delimiter="__")
    
    env: str = "development"
    debug: bool = False
    database: DatabaseConfig
    redis: RedisConfig
    secret_key: SecretStr
    api_keys: list[str] = Field(default_factory=list)

# Load from environment
# DATABASE__HOST=localhost
# DATABASE__PORT=5432
# DATABASE__NAME=myapp
# DATABASE__USER=postgres
# DATABASE__PASSWORD=secret
# REDIS__HOST=redis
# SECRET_KEY=supersecret
# API_KEYS=["key1","key2"]

settings = AppSettings()
print(settings.database.url)

3. Domain Models

Internal domain models with dataclasses:

from dataclasses import dataclass, field
from decimal import Decimal
from datetime import datetime

@dataclass(frozen=True)
class Money:
    amount: Decimal
    currency: str = "USD"
    
    def __add__(self, other: "Money") -> "Money":
        if self.currency != other.currency:
            raise ValueError(f"Cannot add {self.currency} and {other.currency}")
        return Money(self.amount + other.amount, self.currency)

@dataclass(frozen=True)
class OrderLine:
    product_id: int
    product_name: str
    quantity: int
    unit_price: Money
    
    @property
    def total(self) -> Money:
        return Money(self.unit_price.amount * self.quantity, self.unit_price.currency)

@dataclass(frozen=True)
class Order:
    id: int
    customer_id: int
    lines: tuple[OrderLine, ...]
    created_at: datetime
    
    @property
    def total(self) -> Money:
        if not self.lines:
            return Money(Decimal(0))
        return sum((line.total for line in self.lines[1:]), start=self.lines[0].total)

# Use frozen dataclasses for immutable domain objects
# Use tuples instead of lists for truly immutable collections

4. Validation Boundaries

Clean architecture with validation at edges:

# == API Layer (Pydantic) ==
class CreateOrderAPI(BaseModel):
    customer_id: int
    items: list[dict]

# == Application Layer ==
@dataclass
class CreateOrderCommand:
    customer_id: int
    items: list[OrderLineData]

@dataclass
class OrderLineData:
    product_id: int
    quantity: int

class OrderService:
    def create_order(self, command: CreateOrderCommand) -> Order:
        # Business logic with validated data
        lines = [
            OrderLine(
                product_id=item.product_id,
                product_name=self.get_product_name(item.product_id),
                quantity=item.quantity,
                unit_price=self.get_product_price(item.product_id)
            )
            for item in command.items
        ]
        return Order(
            id=self.generate_id(),
            customer_id=command.customer_id,
            lines=tuple(lines),
            created_at=datetime.now(timezone.utc)
        )

# == API Handler ==
@app.post("/orders")
def create_order_endpoint(request: CreateOrderAPI):
    # Validate at boundary
    command = CreateOrderCommand(
        customer_id=request.customer_id,
        items=[OrderLineData(**item) for item in request.items]
    )
    # Pass validated data to domain
    order = order_service.create_order(command)
    return OrderResponse.from_domain(order)

Pattern:

  1. API layer: Pydantic validates external input
  2. Application layer: Simple dataclasses (commands/queries)
  3. Domain layer: Rich dataclasses with business logic
  4. No validation inside domain: Data is pre-validated

5. Large-Scale Codebase Recommendations

Directory Structure

project/
├── api/
│   ├── models/          # Pydantic request/response models
│   │   ├── requests.py
│   │   └── responses.py
│   └── routes/
├── domain/
│   ├── entities/        # Dataclass domain entities
│   ├── value_objects/   # Frozen dataclass value objects
│   └── services/
├── infrastructure/
│   ├── database/
│   └── external_apis/   # Pydantic models for external APIs
└── config/
    └── settings.py      # Pydantic BaseSettings

Naming Conventions

  • Pydantic models: CreateUserRequest, UserResponse, ExternalAPIModel
  • Dataclass entities: User, Order, Product
  • Value objects: Email, Money, Address (frozen dataclasses)

Type Hints

# Use Protocol for interfaces (not BaseModel or dataclass)
from typing import Protocol

class UserRepository(Protocol):
    def find_by_id(self, user_id: int) -> User | None: ...
    def save(self, user: User) -> None: ...

# Implementations use dataclasses
@dataclass
class InMemoryUserRepository:
    users: dict[int, User] = field(default_factory=dict)
    
    def find_by_id(self, user_id: int) -> User | None:
        return self.users.get(user_id)
    
    def save(self, user: User) -> None:
        self.users[user.id] = user

Testing

# Use dataclasses for test fixtures
@dataclass
class UserBuilder:
    id: int = 1
    name: str = "Test User"
    email: str = "[email protected]"
    
    def with_id(self, id: int) -> "UserBuilder":
        return dataclass.replace(self, id=id)
    
    def build(self) -> User:
        return User(id=self.id, name=self.name, email=self.email)

# In tests
def test_user_service():
    user = UserBuilder().with_id(42).build()
    result = service.process(user)
    assert result.success

Performance Guidelines

  1. Hot paths: Use slotted dataclasses
  2. API boundaries: Pydantic is fine (amortized over network I/O)
  3. Bulk processing: Consider model_construct() for Pydantic models when re-validating trusted data
  4. Serialization: Use orjson with Pydantic for maximum JSON performance
import orjson
from pydantic import BaseModel

class FastModel(BaseModel):
    model_config = ConfigDict(
        # Use orjson for faster JSON serialization
        json_dumps=orjson.dumps,
        json_loads=orjson.loads
    )

Migration Strategy

Migrating a large codebase from ad-hoc dicts to typed models:

  1. Start with API boundaries: Add Pydantic models to all endpoints
  2. Config next: Move to BaseSettings
  3. Domain models: Gradually introduce dataclasses for core entities
  4. Don’t refactor everything: Focus on high-value areas
  5. Use TypedDict as intermediate: When full model migration is too much
from typing import TypedDict

class UserDict(TypedDict):
    id: int
    name: str
    email: str

# Later, upgrade to dataclass or Pydantic

Conclusion

Dataclasses are for reducing boilerplate in trusted internal code. They’re fast, memory-efficient (with slots), and part of the standard library. Use them for domain models, DTOs, and anywhere you need structured data without validation overhead.

Pydantic is for validating data from external sources. It coerces types, aggregates errors, parses JSON natively, and handles settings management. Use it at system boundaries: APIs, configs, file parsing.

The right approach: Don’t choose one or the other. Use both. Pydantic at the edges, dataclasses in the core. This gives you safety where you need it and performance where it matters.

Key takeaway: Type hints alone don’t validate. Dataclasses enforce structure at development time (via type checkers). Pydantic enforces correctness at runtime. Know which problem you’re solving.

Continue reading

Next article

AI Agents from Scratch Part 1: Understanding the ReAct Pattern (Research Report Generator)

Related Content