What is LLM observability?

LLM observability is the practice of recording prompts, completions, token usage, latency, and cost from your language-model calls so you can debug, monitor, and improve them. Tools like LangSmith, Langfuse, Helicone, and Arize Phoenix focus on this layer.

Why isn't LLM observability enough for AI agents?

An agent isn't one model call — it's a loop of model calls, tool invocations, retries, and decisions. Token traces tell you what the model said. They don't tell you what the agent did, which tool it picked, what arguments it passed, what came back, or whether a policy should have blocked it. SafeRun records the whole action trace, not just the LLM half.

How does SafeRun compare to LangSmith, Langfuse, or Helicone?

Those are great LLM tracing tools. SafeRun complements them at the agent layer: full step-by-step replay, inline policy decisions, loop and cost breakers, and a human-approval queue. Most teams keep their LLM tracer and add SafeRun on top — observe the model and control the agent.

Can I replay a production failure locally?

Yes. Every step of an agent run is captured with its inputs and outputs. You pick the run, hit replay, and re-execute the failing step against the exact same context — without scraping logs or rebuilding state by hand.

Do I need to rewrite my agent?

No. SafeRun wraps your existing agent with a single SDK call. Works with LangGraph, CrewAI, OpenAI Assistants, custom loops, MCP tools — anything that calls a function.

Open app

Pillar guide

LLM observability, extended to the whole agent.

Token traces tell you what the model said. SafeRun records what the agent did — every tool call, argument, return value, and policy decision — and lets you replay any failure step by step.

The gap

LLM tracing ends where the agent begins.

Classic LLM observability — LangSmith, Langfuse, Helicone, Arize Phoenix — does one thing well: it captures prompts, completions, tokens, and latency. That's enough when your product is a chat completion.

But agents are loops. They call tools, pass arguments, get results back, retry, branch, and sometimes do real damage. To debug them you need the action trace, not just the model trace — and you need to replay it.

Six pillars

What agent-grade observability looks like.

Full-trace capture

Every prompt, model response, tool call, argument, return value, latency, and cost — recorded at the action layer, not just the token layer.

Time-travel replay

Re-run any failed agent step with the exact context. Reproduce production bugs locally without copy-pasting JSON between Slack threads.

Live action stream

Watch agent runs as they happen. Filter by agent, tool, status, or user — find the bad run before the customer files a ticket.

Diffs across model versions

Compare runs across model upgrades. Catch the silent regression where GPT-5 picks the wrong tool 3% more often than 4o.

Inline policy decisions

Every block, allow, and approval recorded next to the action it touched. Observability and control in the same trace.

Anomaly + cost alerts

Page when block rate, retry rate, or per-run cost drifts off baseline. Stop runaway loops before they touch your AWS bill.

Side by side

LLM observability vs agent observability.

Layer	LLM tracing	SafeRun (agent layer)
Prompts & completions	Yes	Yes
Token + cost per call	Yes	Yes
Tool calls + arguments	Partial	Yes
Tool return values	Rare	Yes
Step-by-step replay	No	Yes
Policy decisions inline	No	Yes
Human approvals	No	Yes
Loop + cost breakers	No	Yes

Three lines

Wire observability in once.

agent.tstypescript

import { observe } from "@saferun/sdk";

observe(agent, { service: "support-agent-v2", env: "prod" });
// Step-by-step traces, replay, and policy decisions in one timeline.

FAQ

Common questions about LLM observability.

Keep reading

Free up to 10k actions a month

Ship agents your on-call won't dread.

Add SafeRun in three lines. Validate, block, and replay every risky tool call — before it touches production.

Get early access Read the guide