How to Build Traceable and Evaluated LLM Workflows with Promptflow and Prompty

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

Sana Hassan details a complete production-style workflow utilizing Promptflow and GPT-4o-mini to bridge the gap between simple chat interfaces and enterprise pipelines. The system implements a class-based flex flow that allows for hybrid reasoning by injecting computed hints into model responses.

Why This Matters

Moving LLMs from prototypes to production requires more than just prompts; it necessitates a structured framework for observability and performance measurement. By integrating deterministic preprocessing with LLM-as-a-judge evaluation, developers can mitigate the inherent stochasticity of models and ensure that responses adhere to specific factual requirements, reducing manual verification overhead through automated accuracy metrics.

Key Insights

Promptflow’s Flex Flow: Enables class-based pipelines where init arguments become flow parameters, as demonstrated in the ResearchAssistant implementation.
Hybrid Reasoning: The safe_calc tool provides a deterministic tool for arithmetic, ensuring mathematical operations are calculated accurately before the LLM generates a response.
LLM-as-a-Judge: The system uses a specialized Judge Prompty to grade responses against expected facts, returning structured JSON with binary scores and reasoning.
Traceability: The start_trace function records every execution step, providing a full audit trail for both single-query and batch-run operations in environments like Colab.
Prompty Asset Management: Externalizing prompt logic into .prompty files separates model configuration and system instructions from the core Python execution logic.

Working Examples

Defining a structured Prompty file with model configuration and Jinja2 templating for inputs.

(WORK_DIR / "researcher.prompty").write_text("""---\nname: Researcher\ndescription: Concise research assistant.\nmodel:\n  api: chat\n  configuration:\n    type: openai\n    connection: open_ai_connection\n    model: gpt-4o-mini\n  parameters:\n    temperature: 0.2\n    max_tokens: 350\ninputs:\n  question: {type: string}\n  hint: {type: string, default: ""}\n---\nsystem:\nYou are a precise research assistant. Answer in 1-3 sentences. If a `hint` is given, weave it in.\nuser:\nQ: {{question}}\n{% if hint %}Hint: {{hint}}{% endif %}\n""")

A deterministic tool used within the flow to handle arithmetic operations safely before calling the LLM.

@trace\ndef safe_calc(expression: str) -> str:\n    if not set(expression) <= set("0123456789+-*/(). "):\n        return "unsafe"\n    try: return str(eval(expression))\n    except Exception as e: return f"error:{e}"

Executing an evaluation run that links the outputs of a base run to a judge model for automated scoring.

eval_run = pf.run(\n    flow=str(WORK_DIR / "eval.flex.yaml"),\n    data=str(data_path),\n    run=base_run,\n    column_mapping={\n        "question": "${data.question}",\n        "expected": "${data.expected}",\n        "answer": "${run.outputs.answer}",\n    },\n    stream=True,\n)

Practical Applications

Use Case: Research assistants using GPT-4o-mini for concise summaries with external data injection. Pitfall: Using unvalidated strings in eval functions; mitigated here by character set whitelisting.
Use Case: Batch processing of Q&A datasets to compute accuracy metrics across enterprise data. Pitfall: Unparseable LLM judge outputs; handled via try-except blocks and JSON schema enforcement.
Use Case: Production monitoring of LLM steps using Promptflow Tracing. Pitfall: Lack of visibility into intermediate chain-of-thought steps; resolved by the @trace decorator.

References:

https://www.marktechpost.com/2026/04/28/how-to-build-traceable-and-evaluated-llm-workflows-using-promptflow-prompty-and-openai/

On This Page

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Building Type-Safe and Schema-Constrained LLM Pipelines with Outlines and Pydantic

Building Uncertainty-Aware LLM Systems with Confidence Estimation and Automated Web Research

How to Build a Stable and Efficient QLoRA Fine-Tuning Pipeline Using Unsloth for LLMs