How to Build a Self-Evaluating Agentic AI System with LlamaIndex and OpenAI

This tutorial demonstrates building an advanced agentic AI workflow using LlamaIndex and OpenAI, focusing on a reliable retrieval-augmented generation (RAG) agent capable of reasoning, using tools, and evaluating its outputs. The system structures operations around retrieval, answer synthesis, and self-evaluation, moving beyond basic chatbots towards trustworthy, controllable AI systems.

Why This Matters

Current AI systems often struggle with factual accuracy and relevance, leading to “hallucinations” and shallow responses. Ideal RAG models require seamless integration of retrieval, synthesis, and verification, but achieving this in practice is challenging, costing significant development and maintenance resources due to the need for constant monitoring and correction of inaccurate outputs.

Key Insights

GPT-4o-mini model used: The example utilizes OpenAI’s “gpt-4o-mini” model with a low temperature setting (0.2) for more deterministic responses.
Faithfulness and Relevancy Evaluation: The system incorporates automated evaluation of responses using FaithfulnessEvaluator and RelevancyEvaluator from LlamaIndex.
ReAct Agent Framework: The implementation leverages the ReAct agent framework for iterative reasoning and tool use, combining retrieval and answer generation.

Working Example

!pip -q install -U llama-index llama-index-llms-openai llama-index-embeddings-openai nest_asyncio
import os
import asyncio
import nest_asyncio
nest_asyncio.apply()
from getpass import getpass
if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("Enter OPENAI_API_KEY: ")

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.2)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")
texts = [
    "Reliable RAG systems separate retrieval, synthesis, and verification. Common failures include hallucination and shallow retrieval.",
    "RAG evaluation focuses on faithfulness, answer relevancy, and retrieval quality.",
    "Tool-using agents require constrained tools, validation, and self-review loops.",
    "A robust workflow follows retrieve, answer, evaluate, and revise steps."
]
docs = [Document(text=t) for t in texts]
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(similarity_top_k=4)

from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator
faith_eval = FaithfulnessEvaluator(llm=Settings.llm)
rel_eval = RelevancyEvaluator(llm=Settings.llm)
def retrieve_evidence(q: str) -> str:
    r = query_engine.query(q)
    out = []
    for i, n in enumerate(r.source_nodes or []):
        out.append(f"[{i+1}] {n.node.get_content()[:300]}")
    return "\n".join(out)
def score_answer(q: str, a: str) -> str:
    r = query_engine.query(q)
    ctx = [n.node.get_content() for n in r.source_nodes or []]
    f = faith_eval.evaluate(query=q, response=a, contexts=ctx)
    r = rel_eval.evaluate(query=q, response=a, contexts=ctx)
    return f"Faithfulness: {f.score}\nRelevancy: {r.score}"

from llama_index.core.agent.workflow import ReActAgent
from llama_index.core.workflow import Context
agent = ReActAgent(
    tools=[retrieve_evidence, score_answer],
    llm=Settings.llm,
    system_prompt="""
Always retrieve evidence first.
Produce a structured answer.
Evaluate the answer and revise once if scores are low.
""",
    verbose=True
)
ctx = Context(agent)

async def run_brief(topic: str):
    q = f"Design a reliable RAG + tool-using agent workflow and how to evaluate it. Topic: {topic}"
    handler = agent.run(q, ctx=ctx)
    async for ev in handler.stream_events():
        print(getattr(ev, "delta", ""), end="")
    res = await handler
    return str(res)
topic = "RAG agent reliability and evaluation"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(run_brief(topic))
print("\n\nFINAL OUTPUT\n")
print(result)

Practical Applications

Research Automation: Automating literature reviews and report generation with verifiable sources.
Pitfall: Relying solely on LLM-generated evaluations without human oversight can lead to reinforcing existing biases or overlooking subtle inaccuracies.

References:

https://www.marktechpost.com/2026/01/17/how-to-build-a-self-evaluating-agentic-ai-system-with-llamaindex-and-openai-using-retrieval-tool-use-and-automated-quality-checks/

On This Page

How to Build a Self-Evaluating Agentic AI System with LlamaIndex and OpenAI