AI Agents in Production: The Eval and Observability Playbook

Models fail loudly. Agents fail quietly.

A model that hallucinates says something demonstrably wrong on a prompt you can hand to a reviewer. An agent that fails in production calls the wrong tool, retrieves yesterday's data, takes a plausible-but-wrong reasoning step, and confidently returns an answer that nobody flags until a customer or a regulator does.

That gap — between "the model is fine" and "the agent is reliable" — is the single biggest reason teams who shipped AI demos in 2024 have spent 2025 and 2026 trying to make them production-grade.

This piece is about the operational stack that closes the gap. It's the same shape as a data observability stack, just adapted to the failure modes that agents actually have.

What goes wrong with agents

The first thing that helps is having a clean taxonomy of how agents fail. The patterns we see in production look like this:

Hallucinated facts — the classic, but in an agent it's usually downstream of bad context, not bad models.
Tool-call errors — wrong tool selected, malformed arguments, missing arguments, the right tool called with stale input.
Context drift / stale retrieval — RAG returns yesterday's policy document; the agent confidently quotes it.
Reasoning regressions on model upgrade — a model swap that scores better on benchmarks subtly breaks a multi-step workflow.
Latency and cost blowouts — recursive tool loops, retries, oversized contexts. The agent works; the bill doesn't.
Compliance and audit gaps — no trace of what was retrieved, what was decided, why.

This is not a fringe problem. In LangChain's State of Agent Engineering 2025 — surveyed Nov–Dec 2025 across 1,340 practitioners — 57% of respondents have agents in production, and the single most-cited blocker to scaling them is quality (32%), ahead of latency, security and cost. The teams shipping reliable agents aren't the ones with the cleverest prompts; they are the ones who built the stack that catches these failures before customers do.

Why model evals aren't enough

Most evaluation tooling — OpenAI Evals, benchmark suites, prompt regression tests — was designed for models. You score a single output against a fixed benchmark, you compare two models, you ship the winner.

Agents need something different. An agent run is a multi-step trajectory: the model calls a tool, observes the result, decides what to do next, possibly calls another tool, eventually answers. To evaluate an agent properly you have to score the whole trajectory:

Did it call the right tool, in the right order, with the right arguments?
Was the retrieved context fresh and on-topic?
Was the final answer grounded in what was actually retrieved?
Did it stay inside latency, cost and safety budgets?

You can't do that with a single output check. You need traces, structured assertions, and the discipline of a regression suite that runs on every change to the prompt, the model, the toolset, or the underlying data.

The four-layer stack

The shape of the stack we deploy looks like this:

Four-layer agent observability stack — traces at the bottom, metrics, evals, and audit on top, each enabling the next

Traces are the bedrock — every prompt, every tool call, every retrieval, with inputs and outputs captured. Metrics are aggregations over those traces — tool error rate, grounding rate, refusal rate, p95 latency, cost per session. Evals are the active checks — both offline regression suites and online sampling of live traffic. Audit is the immutable, regulator-readable layer on top — what was asked, what was retrieved, what the agent did, what it answered.

The numbers from the LangChain survey line up with this: 94% of teams running production agents have implemented some form of observability and 71.5% have full tracing — but only 44.8% are running online evaluations on production traffic. Tracing is becoming table stakes; the eval discipline is where most of the differentiation still sits.

The eval harness — what it actually looks like

Hosted platforms like LangSmith, Braintrust, and the open-source Promptfoo and Langfuse all give you runnable variations on the same pattern. The shape that survives contact with production is small enough to write yourself:

evals/order_status_agent.py

import time
from typing import Callable
 
# A single test case is a triple: input, what the agent should call,
# what the answer should contain (or pass through a structured check).
class AgentCase:
    def __init__(self, prompt, expected_tools, expected_answer_contains, latency_budget_s):
        self.prompt = prompt
        self.expected_tools = expected_tools
        self.expected_answer_contains = expected_answer_contains
        self.latency_budget_s = latency_budget_s
 
 
def score_run(agent: Callable, case: AgentCase) -> dict:
    start = time.monotonic()
    result = agent(case.prompt)             # returns {answer, tool_calls, retrieved_context}
    elapsed = time.monotonic() - start
 
    called_tools = [c["name"] for c in result["tool_calls"]]
    answer = result["answer"].lower()
 
    return {
        "prompt": case.prompt,
        "trajectory_ok": called_tools == case.expected_tools,
        "answer_grounded": all(
            phrase.lower() in answer for phrase in case.expected_answer_contains
        ),
        "within_budget": elapsed <= case.latency_budget_s,
        "elapsed_s": round(elapsed, 2),
    }
 
 
SUITE = [
    AgentCase(
        prompt="What's the status of order #4521?",
        expected_tools=["lookup_order", "format_status_for_user"],
        expected_answer_contains=["#4521", "shipped"],
        latency_budget_s=4.0,
    ),
    AgentCase(
        prompt="Cancel my order from yesterday.",
        expected_tools=["lookup_recent_orders", "request_cancellation_confirmation"],
        expected_answer_contains=["confirm", "yesterday"],
        latency_budget_s=4.0,
    ),
]
 
# Run on every prompt, model or tool change. Fail CI if pass-rate drops.
results = [score_run(my_agent, c) for c in SUITE]

Three things matter here. First, trajectory checking — the agent has to call the right tools, not just produce a plausible answer. Second, grounding — the answer has to actually contain the facts the tool returned. Third, budget — slow agents in CI become slow agents in production.

This runs on every prompt change, every model upgrade, every new tool. When the pass-rate drops, the build fails and somebody investigates before the regression ships.

The golden set — keep it alive

A static eval suite written on day one is dead by day ninety. Customer phrasing drifts. Tools get added. Edge cases that didn't exist start showing up in traces.

The pattern that works:

Mine the golden set from real traces. When the agent gets something wrong in production, the first thing to do — after fixing it — is add the input as a regression case.
Tier the cases. A small set of fast, deterministic cases for every CI run. A larger set of slower, model-judged cases for nightly. A full audit suite for releases.
Redact before you save. Treat the golden set like production data. Strip PII; respect the same retention policies as the underlying tables.
Version it. When a metric definition changes, the eval expectations may also change. Tag the suite with the schema version it ran against.

Tracing — what good instrumentation looks like

The standard for tracing agents is consolidating fast around OpenTelemetry's GenAI semantic conventions — currently in development status, but already implemented by every major hosted observability vendor and by the open-source OpenLLMetry library. Adopting the conventions now means you don't have to rewrite the instrumentation when the spec stabilises.

A minimal instrumented agent step looks like this:

agent_step.py

from opentelemetry import trace
from opentelemetry.semconv._incubating.attributes import gen_ai_attributes as gen_ai
 
tracer = trace.get_tracer("agent.order_status")
 
def call_model(prompt: str, tools: list, model: str) -> dict:
    with tracer.start_as_current_span("chat") as span:
        span.set_attributes({
            gen_ai.GEN_AI_OPERATION_NAME: "chat",
            gen_ai.GEN_AI_REQUEST_MODEL: model,
            gen_ai.GEN_AI_USAGE_INPUT_TOKENS: count_tokens(prompt),
        })
        response = model_client.chat(prompt=prompt, tools=tools)
        span.set_attributes({
            gen_ai.GEN_AI_USAGE_OUTPUT_TOKENS: response.usage.output_tokens,
            gen_ai.GEN_AI_RESPONSE_FINISH_REASONS: [response.finish_reason],
        })
        return response

Once every model call and every tool call is shaped this way, the metrics layer falls out for free — error rates, p95 latencies, token-cost rollups by user or feature, grounding-rate dashboards. And anything you ship to a hosted backend (LangSmith, Langfuse, Braintrust, your existing APM) keeps working.

Closing the loop with contracts and the semantic layer

The reason this stack works is it sits on top of two layers I've written about before. Data contracts make the inputs an agent reads predictable. The semantic layer makes the metrics it returns consistent. Eval scope shrinks dramatically when both are in place: the agent isn't asked to reconcile three different definitions of "active customer" on the fly, because there's only one. It isn't fed yesterday's data, because the contract has a freshness SLA the platform enforces.

When an agent answers a question by querying the dbt Semantic Layer or a governed feature store, the surface area you have to evaluate is narrow and well-defined. When it answers by reading raw tables and doing inline calculations, the surface area is the entire warehouse. The first is testable. The second is hope.

This is also where the Model Context Protocol becomes interesting in 2026 — it standardises how agents discover and call governed tools, which makes the trace data consistent across vendors and the eval harness portable.

Production patterns

A few patterns that consistently pay off once the stack is in place:

Circuit breakers on tool errors. If a tool's error rate crosses a threshold in the last fifteen minutes, the agent stops calling it and returns a structured "I can't answer that right now" rather than improvising.
Structured refusals over confident hallucinations. Train the agent to say "I don't have current data on that" when grounding fails. Measure the refusal rate as a positive signal.
Human-in-the-loop gates for high-stakes actions. Refunds, account changes, anything regulatory. The agent prepares the action; a human approves the trigger. Log both.
Feature flags on the agent itself. Roll out new prompts, new tools, new models to 5% of traffic and watch the eval metrics before going wide.
Model-version pinning. Pin the exact model snapshot in production, score the new one offline, gate the rollout on pass-rate. "The new model is better" is not a deployment strategy.

A 60-day rollout

If you're standing this up from scratch, the work breaks roughly into three months. In the first month, instrument — wrap the agent in OpenTelemetry GenAI spans, ship traces to whatever backend you've already standardised on, and start watching tool error rate, grounding rate, and p95 latency. In the second month, build the eval harness — start with ten to fifteen real cases mined from your earliest traces, wire them into CI, and fail builds when the pass-rate drops. In the third month, harden the production patterns — circuit breakers, structured refusals, human-in-loop gates for high-stakes paths, online evals on a sample of live traffic. By day sixty the agent is observable; by day ninety it is governable.

Where this lands

Reliable AI agents are not built by chasing better models. They are built by the firms who decided to instrument what the agent does, evaluate every change against a living regression suite, and treat the data underneath the agent as part of the production surface — not someone else's problem.

Models keep getting better. The discipline around them is what will separate the agents that ship from the ones that quietly disappear after their second incident.

AI Agents in Production: The Eval and Observability Playbook

What goes wrong with agents

Why model evals aren't enough

The four-layer stack

The eval harness — what it actually looks like

The golden set — keep it alive

Tracing — what good instrumentation looks like

Closing the loop with contracts and the semantic layer

Production patterns

A 60-day rollout

Where this lands

Move from AI demos to production.

More insights.

The 2026 AI-Ready Data Platform Blueprint

Data Contracts and Semantic Layers: The Missing Foundation for Reliable AI Agents

Production Apache Iceberg in 2026: The Practitioner Playbook

AI Agents in Production: The Eval and Observability Playbook

What goes wrong with agents#

Why model evals aren't enough#

The four-layer stack#

The eval harness — what it actually looks like#

The golden set — keep it alive#

Tracing — what good instrumentation looks like#

Closing the loop with contracts and the semantic layer#

Production patterns#

A 60-day rollout#

Where this lands#

Move from AI demos to production.

More insights.

The 2026 AI-Ready Data Platform Blueprint

Data Contracts and Semantic Layers: The Missing Foundation for Reliable AI Agents

Production Apache Iceberg in 2026: The Practitioner Playbook

What goes wrong with agents

Why model evals aren't enough

The four-layer stack

The eval harness — what it actually looks like

The golden set — keep it alive

Tracing — what good instrumentation looks like

Closing the loop with contracts and the semantic layer

Production patterns

A 60-day rollout

Where this lands