Custom Evals: A Unified Evaluation Framework for 17+ LLM Agent Frameworks

Anjaiah Methuku introduced Custom Evals to solve the fragmentation of AI testing. The framework supports over 17 different agent frameworks through a single unified interface.

Why This Matters

Engineering teams often demo smooth AI agents only to face weeks of production firefighting due to hallucinations. The technical reality is that most evaluation tools are either too heavy (requiring full observability stacks), too niche (RAG-only), or too opinionated (requiring specific test runners), leading developers to ship without proper validation.

Key Insights

Four-layer architecture: Custom Evals separates metrics into Code-Based (deterministic), LLM-as-Judge (semantic), NLP Similarity (BLEU/ROUGE), and OCR/Document metrics.
Reference-free evaluation: Systems can be validated without ground truth using context-based checks, such as the HallucinationEvaluator which only requires input, output, and context.
Universal Adapter Pattern: Every integration across cloud platforms (AWS Bedrock, Google ADK) and community frameworks (CrewAI, Pydantic AI) reduces to a standardized eval_input dictionary.
Non-LLM Pipeline Validation: The framework includes OCR metrics like Character Error Rate (CER) and Bounding Box IoU for extraction pipelines using AWS Textract or Azure Form Recognizer.

Working Examples

Basic implementation of a semantic coherence evaluation using an LLM judge.

from custom.evals import CoherenceEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CoherenceEvaluator(llm)
score = evaluator.evaluate({
"input": "What is AI?",
"output": "AI is artificial intelligence, enabling machines to perform intelligent tasks."
})
print(f"{score.label}: {score.explanation}")

Creating a custom deterministic evaluator for JSON validity.

from custom.evals import create_evaluator, Score

@create_evaluator(name="json_validity", direction="maximize")
def json_validity(output: str) -> Score:
    import json
    try:
        json.loads(output)
        return Score(score=1.0, label="valid", name="json_validity")
    except:
        return Score(score=0.0, label="invalid", name="json_validity")

Practical Applications

。Use case: RAG pipelines utilizing concurrent async calls with asyncio.gather to run multiple evaluators (Faithfulness, Relevance) simultaneously without serial bottlenecks.
。Pitfall: Relying solely on labeled datasets for ground truth; this leads to unused evaluation infrastructure in production where user queries are unpredictable.

References:

https://dev.to/anjaiahspr/stop-flying-blind-we-built-an-llm-evaluation-framework-that-works-across-17-agent-frameworks-1698

On This Page

Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+ Agent Frameworks

Why This Matters

Key Insights

Working Examples

Practical Applications

Continue reading

Related Content

Meta Applies Mutation Testing with LLM to Improve Compliance Coverage

Building a GPT-2 Level LLM for $100: Analyzing Karpathy's nanochat Pipeline

MCP vs. CLI: Measuring Token Overhead in Agent Search