Pillar guide

AI agent monitoring, done at the action layer.

The signals, metrics, and tooling for monitoring agents in production — not just the LLMs underneath them.

Why it's its own discipline

LLM tracing isn't agent monitoring.

LLM observability tools are great at the model layer: token counts, prompt traces, eval scores. None of that tells you whether your agent just refunded the wrong customer, looped on a flaky tool, or quietly stopped picking the right one after a model upgrade.

Agent monitoring lives one layer up — at the action. What did the agent do? Did the tool succeed? Did policy block it? Did a human have to step in? Those are the signals that correlate with user outcomes.

Six signals

What you actually want on the dashboard.

Live action stream

Every tool call, argument, policy decision, and outcome streamed in real time. Filter by agent, tool, status, or user.

Anomaly alerts

Page when block rate, retry rate, approval rate, or error rate moves outside an agent's learned baseline.

Per-agent reliability score

A single number per agent derived from policy decisions, tool outcomes, and human overrides. Up and to the right or it ships back to dev.

Latency & cost per tool

p50, p95, p99 latency and cost broken down by tool and agent. Find the one tool that's eating your budget.

Policy decision analytics

What got blocked, why, and how often. Tighten policies that never trigger, loosen the ones that block known-good calls.

Approval queue depth

How fast humans clear the queue, how many actions time out waiting, and which agents over-rely on approvals.

The metrics

Track these. Skip the rest.

MetricWhat it isWhy it matters
Tool-call success rate% of agent tool calls that returned a usable resultTop-line agent health
Policy block rate% of calls blocked by SafeRun policiesDetect regressions and over-tight policies
Approval rate% of calls escalated to a humanTrust calibration over time
Loop trip countTimes the loop breaker fired this periodCatches runaway agents early
Cost per resolved task$ spent end-to-end per successful agent runThe real unit economics number
p95 tool latency95th percentile latency per tool, per agentFind the slow tool dragging the run
Reliability score (0–100)Composite of success, blocks, retries, approvalsOne number for product reviews
Three lines

Wire monitoring in once.

agent.tstypescript
import { monitor } from "@saferun/sdk";

monitor(agent, { service: "support-agent-v2", env: "prod" });
// Live stream, scores, and alerts. No new dashboard to wire up.
FAQ

Common questions about monitoring agents.

Runtime action-control for AI agents

Give production agents a checkpoint before they act.

Wrap risky tool calls, pause or block what shouldn't run, and replay the decision so teams can turn each near-miss into a rule.