Skip to main content

On This Page

How to Build Traceable and Evaluated LLM Workflows with Promptflow and Prompty

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

How to Build Traceable and Evaluated LLM Workflows Using Promptflow, Prompty, and OpenAI

Sana Hassan details a complete production-style workflow utilizing Promptflow and GPT-4o-mini to bridge the gap between simple chat interfaces and enterprise pipelines. The system implements a class-based flex flow that allows for hybrid reasoning by injecting computed hints into model responses.

Why This Matters

Moving LLMs from prototypes to production requires more than just prompts; it necessitates a structured framework for observability and performance measurement. By integrating deterministic preprocessing with LLM-as-a-judge evaluation, developers can mitigate the inherent stochasticity of models and ensure that responses adhere to specific factual requirements, reducing manual verification overhead through automated accuracy metrics.

Key Insights

  • Promptflow’s Flex Flow: Enables class-based pipelines where init arguments become flow parameters, as demonstrated in the ResearchAssistant implementation.
  • Hybrid Reasoning: The safe_calc tool provides a deterministic tool for arithmetic, ensuring mathematical operations are calculated accurately before the LLM generates a response.
  • LLM-as-a-Judge: The system uses a specialized Judge Prompty to grade responses against expected facts, returning structured JSON with binary scores and reasoning.
  • Traceability: The start_trace function records every execution step, providing a full audit trail for both single-query and batch-run operations in environments like Colab.
  • Prompty Asset Management: Externalizing prompt logic into .prompty files separates model configuration and system instructions from the core Python execution logic.

Working Examples

Defining a structured Prompty file with model configuration and Jinja2 templating for inputs.

(WORK_DIR / "researcher.prompty").write_text("""---\nname: Researcher\ndescription: Concise research assistant.\nmodel:\n  api: chat\n  configuration:\n    type: openai\n    connection: open_ai_connection\n    model: gpt-4o-mini\n  parameters:\n    temperature: 0.2\n    max_tokens: 350\ninputs:\n  question: {type: string}\n  hint: {type: string, default: ""}\n---\nsystem:\nYou are a precise research assistant. Answer in 1-3 sentences. If a `hint` is given, weave it in.\nuser:\nQ: {{question}}\n{% if hint %}Hint: {{hint}}{% endif %}\n""")

A deterministic tool used within the flow to handle arithmetic operations safely before calling the LLM.

@trace\ndef safe_calc(expression: str) -> str:\n    if not set(expression) <= set("0123456789+-*/(). "):\n        return "unsafe"\n    try: return str(eval(expression))\n    except Exception as e: return f"error:{e}"

Executing an evaluation run that links the outputs of a base run to a judge model for automated scoring.

eval_run = pf.run(\n    flow=str(WORK_DIR / "eval.flex.yaml"),\n    data=str(data_path),\n    run=base_run,\n    column_mapping={\n        "question": "${data.question}",\n        "expected": "${data.expected}",\n        "answer": "${run.outputs.answer}",\n    },\n    stream=True,\n)

Practical Applications

  • Use Case: Research assistants using GPT-4o-mini for concise summaries with external data injection. Pitfall: Using unvalidated strings in eval functions; mitigated here by character set whitelisting.
  • Use Case: Batch processing of Q&A datasets to compute accuracy metrics across enterprise data. Pitfall: Unparseable LLM judge outputs; handled via try-except blocks and JSON schema enforcement.
  • Use Case: Production monitoring of LLM steps using Promptflow Tracing. Pitfall: Lack of visibility into intermediate chain-of-thought steps; resolved by the @trace decorator.

References:

Continue reading

Next article

Poolside AI Launches Laguna XS.2 and M.1: High-Performance Agentic Coding via MoE

Related Content