How is AI agent monitoring different from LLM observability?

LLM observability watches the model — prompts, tokens, latency, cost. Agent monitoring watches actions — what tool was called, with what arguments, what the policy decided, what the tool returned, and whether the user got a good outcome. Same agent, completely different signal.

What's the most important metric to watch?

Tool-call success rate per agent, broken down by tool. It correlates with user-facing outcomes more directly than token count, latency, or model loss. Reliability score is the rolled-up version for execs.

Do I need monitoring if I already have SafeRun's policies and approvals?

Yes. Policies stop bad actions; monitoring tells you when policies need to change, when an agent is regressing, and when a tool is silently degrading. You want both — control and feedback.

Start in test mode

Pillar guide

AI agent monitoring, done at the action layer.

The signals, metrics, and tooling for monitoring agents in production — not just the LLMs underneath them.

Why it's its own discipline

LLM tracing isn't agent monitoring.

LLM observability tools are great at the model layer: token counts, prompt traces, eval scores. None of that tells you whether your agent just refunded the wrong customer, looped on a flaky tool, or quietly stopped picking the right one after a model upgrade.

Agent monitoring lives one layer up — at the action. What did the agent do? Did the tool succeed? Did policy block it? Did a human have to step in? Those are the signals that correlate with user outcomes.

Six signals

What you actually want on the dashboard.

Live action stream

Every tool call, argument, policy decision, and outcome streamed in real time. Filter by agent, tool, status, or user.

Anomaly alerts

Page when block rate, retry rate, approval rate, or error rate moves outside an agent's learned baseline.

Per-agent reliability score

A single number per agent derived from policy decisions, tool outcomes, and human overrides. Up and to the right or it ships back to dev.

Latency & cost per tool

p50, p95, p99 latency and cost broken down by tool and agent. Find the one tool that's eating your budget.

Policy decision analytics

What got blocked, why, and how often. Tighten policies that never trigger, loosen the ones that block known-good calls.

Approval queue depth

How fast humans clear the queue, how many actions time out waiting, and which agents over-rely on approvals.

The metrics

Track these. Skip the rest.

Metric	What it is	Why it matters
Tool-call success rate	% of agent tool calls that returned a usable result	Top-line agent health
Policy block rate	% of calls blocked by SafeRun policies	Detect regressions and over-tight policies
Approval rate	% of calls escalated to a human	Trust calibration over time
Loop trip count	Times the loop breaker fired this period	Catches runaway agents early
Cost per resolved task	$ spent end-to-end per successful agent run	The real unit economics number
p95 tool latency	95th percentile latency per tool, per agent	Find the slow tool dragging the run
Reliability score (0–100)	Composite of success, blocks, retries, approvals	One number for product reviews

Three lines

Wire monitoring in once.

agent.tstypescript

import { monitor } from "@saferun/sdk";

monitor(agent, { service: "support-agent-v2", env: "prod" });
// Live stream, scores, and alerts. No new dashboard to wire up.

FAQ

Common questions about monitoring agents.

Keep reading

Runtime action-control for AI agents

Give production agents a checkpoint before they act.

Wrap risky tool calls, pause or block what shouldn't run, and replay the decision so teams can turn each near-miss into a rule.

Start in test mode Read the guide