Engineering Reliable AI Agents: Why Programmatic Tests Must Replace Prompt-Only Control Flow

Babysitter, Auditor, Prayer. Or Tests.

Michael Tuszynski identifies a critical failure point in agent engineering where prompt chains are mistakenly treated as deterministic control flow. Systems collapse when complexity grows because functions return ‘Success’ while hallucinating, necessitating a shift to programmatic verification.

Why This Matters

The technical reality of LLMs is that they are flaky external APIs where statements act as suggestions rather than commands. Relying on ‘vibe-accepting’ outputs or manual human oversight fails to scale and leads to unmanaged risks. Implementing runtime assertions and schema checks transforms LLM outputs into trusted inputs, allowing engineers to use existing infrastructure like CI/CD and assertion libraries to gate deployments. This approach moves beyond the ‘prompt chain’ ceiling by enforcing strict contracts before the next code branch executes.

Key Insights

Deterministic control flow: Prompt chains fail because they lack the programmatic verification required for complex software systems (Michael Tuszynski, 2026).
Structured outputs as schema assertions: Using tool-use or structured output APIs acts as a contract at the API boundary, rejecting malformed data before it reaches application logic.
Evals as regression tests: AI evaluation suites serve as versioned test suites with pass/fail thresholds that should block deployment if thresholds are not met.
Blast-radius declarations: Implementing runtime checks that tool scope matches task declarations prevents agents from exceeding authorized actions, such as unauthorized database deletions.
The Honesty Test: If an engineer cannot write a programmatic assertion to unblock the next step in an LLM call, the system is operating on ‘prayer’ rather than engineering principles.

Practical Applications

Use case: Implementing dry-runs for destructive operations, such as Railway volume deletions, to ensure human sign-off blocks unauthorized calls. Pitfall: Relying on emphatic system prompts instead of runtime assertions, leading to irreversible data loss.
Use case: Using negative prompting paired with output filters to perform predicate checks on responses before they move downstream. Pitfall: Accepting responses without verifying intermediate reasoning (Chain-of-thought), allowing implicit contract violations to go unnoticed.
Use case: Wiring structured outputs into existing CI/CD pipelines to treat LLM responses as standard external API data. Pitfall: Treating LLM outputs as ‘special’ and bypassing traditional assertion libraries, resulting in silent failures.

References:

https://dev.to/michaeltuszynski/babysitter-auditor-prayer-or-tests-3cgi

On This Page

Babysitter, Auditor, Prayer. Or Tests.

Why This Matters

Key Insights

Practical Applications

Continue reading

Related Content

Why Agent Memory is Not a Database: Shifting to Governed Evolving Memory

"Nobody's Walking Over to a Desk": The Hidden Cost of Removing Humans from Software Spec Loops

AI Agent Architecture: Engineering Systems That Think, Plan, and Act