AI agent monitoring, done at the action layer.
The signals, metrics, and tooling for monitoring agents in production — not just the LLMs underneath them.
LLM tracing isn't agent monitoring.
LLM observability tools are great at the model layer: token counts, prompt traces, eval scores. None of that tells you whether your agent just refunded the wrong customer, looped on a flaky tool, or quietly stopped picking the right one after a model upgrade.
Agent monitoring lives one layer up — at the action. What did the agent do? Did the tool succeed? Did policy block it? Did a human have to step in? Those are the signals that correlate with user outcomes.
What you actually want on the dashboard.
Live action stream
Every tool call, argument, policy decision, and outcome streamed in real time. Filter by agent, tool, status, or user.
Anomaly alerts
Page when block rate, retry rate, approval rate, or error rate moves outside an agent's learned baseline.
Per-agent reliability score
A single number per agent derived from policy decisions, tool outcomes, and human overrides. Up and to the right or it ships back to dev.
Latency & cost per tool
p50, p95, p99 latency and cost broken down by tool and agent. Find the one tool that's eating your budget.
Policy decision analytics
What got blocked, why, and how often. Tighten policies that never trigger, loosen the ones that block known-good calls.
Approval queue depth
How fast humans clear the queue, how many actions time out waiting, and which agents over-rely on approvals.
Track these. Skip the rest.
| Metric | What it is | Why it matters |
|---|---|---|
| Tool-call success rate | % of agent tool calls that returned a usable result | Top-line agent health |
| Policy block rate | % of calls blocked by SafeRun policies | Detect regressions and over-tight policies |
| Approval rate | % of calls escalated to a human | Trust calibration over time |
| Loop trip count | Times the loop breaker fired this period | Catches runaway agents early |
| Cost per resolved task | $ spent end-to-end per successful agent run | The real unit economics number |
| p95 tool latency | 95th percentile latency per tool, per agent | Find the slow tool dragging the run |
| Reliability score (0–100) | Composite of success, blocks, retries, approvals | One number for product reviews |
Wire monitoring in once.
import { monitor } from "@saferun/sdk";
monitor(agent, { service: "support-agent-v2", env: "prod" });
// Live stream, scores, and alerts. No new dashboard to wire up.Common questions about monitoring agents.
Give production agents a checkpoint before they act.
Wrap risky tool calls, pause or block what shouldn't run, and replay the decision so teams can turn each near-miss into a rule.
