Pillar guide

AI agent evaluation, powered by replay.

Static prompt suites pass while production breaks. SafeRun evals run real traces against your candidate agent — and fail the build when it regresses.

The problem

Evals that don't reflect reality don't catch real bugs.

Most agent eval suites are a folder of hand-written prompts and an LLM-as-judge rubric. They're easy to write and almost always green — because real users never ask the questions you imagined.

Production traffic is the only fair eval set. The job of an eval system is to capture it, replay it against new versions, and score what actually happened on the action layer.

Six methods

How agent evaluation actually works.

Replay-based eval

Re-run real production failures against a candidate agent. The eval set writes itself from the bugs your users already found.

Model & prompt diffs

Compare runs across models (GPT-5 vs Claude vs Gemini) or prompt versions. See which one picks the right tool, returns the right value, costs less.

Outcome scoring

Score by tool-call success, policy decisions, human-approval overrides, and final task resolution — not just BLEU or token loss.

Reliability score over time

One number per agent that rolls up success, blocks, retries, and approvals. Trend it across releases.

Regression catching

Snapshot known-good runs. Fail the build when a candidate version regresses on any captured trace.

Synthetic + real traffic

Mix curated test cases with sampled production traces. Catches both happy-path bugs and weird-prompt edge cases.

Avoid these

Four eval patterns that look fine and aren't.

PatternWhy it misses
Static prompt eval suitesReal users don't ask the questions you wrote. Your suite passes; production breaks.
LLM-as-judge onlyJudges drift, agree with themselves, and miss tool-call correctness. Use them, but pair with tool-outcome scoring.
Token-level metricsBLEU and perplexity say nothing about whether the agent refunded the right customer.
Eval-once-at-launchModels, tools, and prompts change weekly. Eval has to run on every change, against real traces.
Drop-in CI

Block regressions before they ship.

eval.tstypescript
import { evaluate } from "@saferun/sdk";

const result = await evaluate({
  agent: candidateAgent,
  traces: { source: "production", limit: 200 },
  scorers: ["tool-success", "policy-decision", "human-override"],
});

if (result.regressions.length > 0) process.exit(1);
// Wire into CI. Block PRs that regress real production traces.
FAQ

Common questions about AI agent evaluation.

Free up to 10k actions a month

Ship agents your on-call won't dread.

Add SafeRun in three lines. Validate, block, and replay every risky tool call — before it touches production.