An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik

This tutorial demonstrates a complete workflow for building, tracing, and evaluating a Local Large Language Model (LLM) pipeline using Opik, an open-source platform designed for LLM and RAG application debugging. The system starts with a lightweight model (distilgpt2) and progresses through prompt-based planning, dataset creation, and automated evaluations.

Why This Matters

Current LLM development often lacks robust tracing and evaluation, leading to unpredictable behavior and difficulty in reproducing results. Ideal models assume perfect data and consistent performance, but real-world applications suffer from issues like prompt sensitivity, context drift, and hallucination. Without proper instrumentation, debugging and improving LLM applications can be extremely costly, with wasted engineering hours and potential downstream errors impacting critical business processes.

Key Insights

Opik’s tracing capabilities log nested spans, LLM calls, token usage, and metadata: This level of detail is crucial for understanding complex pipeline behavior.
Structured prompts enhance consistency: Using Opik’s Prompt class allows for clear templates that impact model behavior.
Temporal used by Stripe, Coinbase: Opik offers similar workflow capabilities for complex, stateful applications.

Working Example

!pip install -q opik transformers accelerate torch
import torch
from transformers import pipeline
import textwrap
import opik
from opik import Opik, Prompt, track
from opik.evaluation import evaluate
from opik.evaluation.metrics import Equals, LevenshteinRatio
device = 0 if torch.cuda.is_available() else -1
print("Using device:", "cuda" if device == 0 else "cpu")
opik.configure()
PROJECT_NAME = "opik-hf-tutorial"

llm = pipeline(
"text-generation",
model="distilgpt2",
device=device,
)
def hf_generate(prompt: str, max_new_tokens: int = 80) -> str:
result = llm(
prompt,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.3,
pad_token_id=llm.tokenizer.eos_token_id,
)[0]["generated_text"]
return result[len(prompt):].strip()

plan_prompt = Prompt(
name="hf_plan_prompt",
prompt=textwrap.dedent("""
You are an assistant that creates a plan to answer a question
using ONLY the given context.
Context:
{{context}}
Question:
{{question}}
Return exactly 3 bullet points as a plan.
""").strip(),
)
answer_prompt = Prompt(
name="hf_answer_prompt",
prompt=textwrap.dedent("""
You answer based only on the given context.
Context:
{{context}}
Question:
{{question}}
Plan:
{{plan}}
Answer the question in 2–4 concise sentences.
""").strip(),
)

Practical Applications

Use Case: Marktechpost utilizes Opik to build and evaluate LLM-powered question answering systems for its AI news platform, ensuring content accuracy and relevance.
Pitfall: Relying solely on LLM output without tracing and evaluation can lead to inconsistent results and difficulty in identifying the root cause of errors, ultimately impacting user trust.

References:

https://www.marktechpost.com/2025/11/21/an-implementation-of-fully-traced-and-evaluated-local-llm-pipeline-using-opik-for-transparent-measurable-and-reproducible-ai-workflows/

On This Page

An Implementation of Fully Traced and Evaluated Local LLM Pipeline Using Opik