How to Build Traceable and Evaluated LLM Workflows with Promptflow and Prompty
These articles are AI-generated summaries. Please check the original sources for full details.
How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI
Sana Hassan details a complete production-style workflow utilizing Promptflow and GPT-4o-mini to bridge the gap between simple chat interfaces and enterprise pipelines. The system implements a class-based flex flow that allows for hybrid reasoning by injecting computed hints into model responses.
Why This Matters
Moving LLMs from prototypes to production requires more than just prompts; it necessitates a structured framework for observability and performance measurement. By integrating deterministic preprocessing with LLM-as-a-judge evaluation, developers can mitigate the inherent stochasticity of models and ensure that responses adhere to specific factual requirements, reducing manual verification overhead through automated accuracy metrics.
Key Insights
- Promptflow’s Flex Flow: Enables class-based pipelines where init arguments become flow parameters, as demonstrated in the ResearchAssistant implementation.
- Hybrid Reasoning: The safe_calc tool provides a deterministic tool for arithmetic, ensuring mathematical operations are calculated accurately before the LLM generates a response.
- LLM-as-a-Judge: The system uses a specialized Judge Prompty to grade responses against expected facts, returning structured JSON with binary scores and reasoning.
- Traceability: The start_trace function records every execution step, providing a full audit trail for both single-query and batch-run operations in environments like Colab.
- Prompty Asset Management: Externalizing prompt logic into .prompty files separates model configuration and system instructions from the core Python execution logic.
Working Examples
Defining a structured Prompty file with model configuration and Jinja2 templating for inputs.
(WORK_DIR / "researcher.prompty").write_text("""---\nname: Researcher\ndescription: Concise research assistant.\nmodel:\n api: chat\n configuration:\n type: openai\n connection: open_ai_connection\n model: gpt-4o-mini\n parameters:\n temperature: 0.2\n max_tokens: 350\ninputs:\n question: {type: string}\n hint: {type: string, default: ""}\n---\nsystem:\nYou are a precise research assistant. Answer in 1-3 sentences. If a `hint` is given, weave it in.\nuser:\nQ: {{question}}\n{% if hint %}Hint: {{hint}}{% endif %}\n""")
A deterministic tool used within the flow to handle arithmetic operations safely before calling the LLM.
@trace\ndef safe_calc(expression: str) -> str:\n if not set(expression) <= set("0123456789+-*/(). "):\n return "unsafe"\n try: return str(eval(expression))\n except Exception as e: return f"error:{e}"
Executing an evaluation run that links the outputs of a base run to a judge model for automated scoring.
eval_run = pf.run(\n flow=str(WORK_DIR / "eval.flex.yaml"),\n data=str(data_path),\n run=base_run,\n column_mapping={\n "question": "${data.question}",\n "expected": "${data.expected}",\n "answer": "${run.outputs.answer}",\n },\n stream=True,\n)
Practical Applications
- Use Case: Research assistants using GPT-4o-mini for concise summaries with external data injection. Pitfall: Using unvalidated strings in eval functions; mitigated here by character set whitelisting.
- Use Case: Batch processing of Q&A datasets to compute accuracy metrics across enterprise data. Pitfall: Unparseable LLM judge outputs; handled via try-except blocks and JSON schema enforcement.
- Use Case: Production monitoring of LLM steps using Promptflow Tracing. Pitfall: Lack of visibility into intermediate chain-of-thought steps; resolved by the @trace decorator.
References:
Continue reading
Next article
Poolside AI Launches Laguna XS.2 and M.1: High-Performance Agentic Coding via MoE
Related Content
Building Type-Safe and Schema-Constrained LLM Pipelines with Outlines and Pydantic
Build production-grade LLM pipelines using Outlines and Pydantic to enforce schema validation and JSON recovery for reliable structured outputs.
Building Uncertainty-Aware LLM Systems with Confidence Estimation and Automated Web Research
A technical implementation of a three-stage LLM pipeline using Python to enable self-reported confidence scores, meta-cognitive self-evaluation, and automated web research for higher reliability.
Building Django Applications with GitHub Copilot Agent Mode
Learn how to build a Django password generator in under three hours using GitHub Copilot agent mode and GPT-4.1, featuring automated setup and self-correcting code.