Custom Evals: A Unified Evaluation Framework for 17+ LLM Agent Frameworks
These articles are AI-generated summaries. Please check the original sources for full details.
Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+ Agent Frameworks
Anjaiah Methuku introduced Custom Evals to solve the fragmentation of AI testing. The framework supports over 17 different agent frameworks through a single unified interface.
Why This Matters
Engineering teams often demo smooth AI agents only to face weeks of production firefighting due to hallucinations. The technical reality is that most evaluation tools are either too heavy (requiring full observability stacks), too niche (RAG-only), or too opinionated (requiring specific test runners), leading developers to ship without proper validation.
Key Insights
- Four-layer architecture: Custom Evals separates metrics into Code-Based (deterministic), LLM-as-Judge (semantic), NLP Similarity (BLEU/ROUGE), and OCR/Document metrics.
- Reference-free evaluation: Systems can be validated without ground truth using context-based checks, such as the HallucinationEvaluator which only requires input, output, and context.
- Universal Adapter Pattern: Every integration across cloud platforms (AWS Bedrock, Google ADK) and community frameworks (CrewAI, Pydantic AI) reduces to a standardized
eval_inputdictionary. - Non-LLM Pipeline Validation: The framework includes OCR metrics like Character Error Rate (CER) and Bounding Box IoU for extraction pipelines using AWS Textract or Azure Form Recognizer.
Working Examples
Basic implementation of a semantic coherence evaluation using an LLM judge.
from custom.evals import CoherenceEvaluator
from custom.evals.llm import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CoherenceEvaluator(llm)
score = evaluator.evaluate({
"input": "What is AI?",
"output": "AI is artificial intelligence, enabling machines to perform intelligent tasks."
})
print(f"{score.label}: {score.explanation}")
Creating a custom deterministic evaluator for JSON validity.
from custom.evals import create_evaluator, Score
@create_evaluator(name="json_validity", direction="maximize")
def json_validity(output: str) -> Score:
import json
try:
json.loads(output)
return Score(score=1.0, label="valid", name="json_validity")
except:
return Score(score=0.0, label="invalid", name="json_validity")
Practical Applications
- 。Use case: RAG pipelines utilizing concurrent async calls with
asyncio.gatherto run multiple evaluators (Faithfulness, Relevance) simultaneously without serial bottlenecks. - 。Pitfall: Relying solely on labeled datasets for ground truth; this leads to unused evaluation infrastructure in production where user queries are unpredictable.
References:
Continue reading
Next article
Solving the Misleading 'User is not authorized' Error in AWS CodeBuild
Related Content
MCP vs. CLI: Measuring Token Overhead in Agent Search
A comparison of SerpApi MCP and a custom CLI reveals that MCP can use 17x more tokens per call for stateless search tasks.
Synthadoc v0.6.0: Solving Knowledge Staleness with Lifecycle State Machines
Synthadoc v0.6.0 introduces a five-state page lifecycle and four export formats to detect content staleness without additional LLM calls.
Meta Applies Mutation Testing with LLM to Improve Compliance Coverage
Meta’s Automated Compliance Hardening system uses LLMs to generate targeted mutants and tests, improving compliance coverage and reducing overhead by 73% test acceptance.