The AI agent observability stack — what to measure, what to use

A practical companion to How I think about agent observability. That one is the model; this one is the tool map.

Every team I talk to that's shipped one agent into production is now trying to figure out which observability tool to bring in. There's no shortage. LangSmith, LangFuse, Phoenix, Helicone, RAGAS, Promptfoo, Patronus, the various "Datadog for AI" startups, plus whatever your existing APM vendor is shipping this quarter. They all look like they do the same thing in the brochure. They don't.

The thing to internalise first is: agent observability isn't one problem. It's at least five, and most tools are good at one or two of them. Pick by what you actually need to see — not by who has the slickest dashboard.

The five things you might want to measure

Different things, different tools, different cost.

Traces and spans — what happened on this turn? The full call tree of an agent's reasoning: LLM calls, tool invocations, memory recalls, parallel branches. The thing that lets you replay a single user's bad request. Examples: LangSmith, LangFuse, Arize Phoenix, OpenLLMetry, Peekr.
Token cost — who spent how many dollars? Per-user, per-tenant, per-feature attribution of token usage and inference dollars. Boring but load-bearing as soon as someone in finance asks. Examples: Helicone, LiteLLM, Vercel AI Gateway, and most of the tracing tools as a side effect.
Hallucination / faithfulness — did the output match the context? For RAG systems especially: is the model citing what it retrieved, or making things up? Usually needs a second LLM as judge or a deterministic rubric. Examples: RAGAS, TruLens, DeepEval, Patronus.
Pre-prod evals — will the change regress? Run a fixed test set against your prompt/model/agent before you ship the change. Closer to a unit test runner than a dashboard. Examples: Promptfoo, DeepEval, LangSmith's eval mode.
Drift and aggregate behaviour — how is the system behaving over time? Is the refusal rate climbing? Are retrieval ranks shifting? This is the dashboard-and-alerts layer, and it tends to be the most product-specific. Examples: most hosted platforms (LangSmith, LangFuse, Phoenix) once you have enough volume.

Most "AI observability" products focus on (1) and add (2) and (5) as natural extensions. (3) and (4) are often separate libraries you wire on top.

The categories, mapped

Rough taxonomy. Not exhaustive — just enough to know which row to read.

Tracing platforms (hosted)

LangSmith — first-party tracing for LangChain/LangGraph apps; also works standalone. Tightest integration if you're already on the LangChain stack; rich UI; paid.
LangFuse — open source, self-hostable, hosted tier available. Framework-neutral. Good middle ground when you want a real UI without a vendor lock-in.
Arize Phoenix — open source, evaluator-friendly. Especially strong for RAG triage.
Datadog LLM Observability / New Relic AI Monitoring — incumbents bolting LLM views onto existing APM. Useful if you're already paying them and want one pane of glass; weaker than the AI-native tools on agent-specific patterns.

Tracing libraries (DIY / lightweight)

OpenLLMetry — OpenTelemetry-compatible instrumentation for LLM workflows. Use this if your org already runs OTEL and you want LLM traces in the same backend (Jaeger, Tempo, Honeycomb).
Peekr — zero-config JSONL tracing for OpenAI and Anthropic clients, in both Python and TypeScript. Writes spans to disk by default. Smallest possible first step; no hosted dependency. (I built this — see the previous post for why.)

LLM-as-proxy (cost + caching as the headline)

Helicone — HTTP proxy in front of OpenAI/Anthropic. Drops in via a base-URL swap; instantly gets you cost tracking, latency, and request logging.
LiteLLM — model gateway + cost tracker; routes requests across providers under a unified API.
Vercel AI Gateway — similar idea on the Vercel runtime; useful if you're already deployed there.

The proxy pattern is the simplest possible start for cost attribution, but it can't see anything that happens between LLM calls (tool use, retrieval, branching). You'll outgrow it the moment your agent gets non-trivial.

Eval / hallucination frameworks

RAGAS — Python library of RAG-specific metrics: faithfulness, context precision, context recall, answer relevancy. Works against any RAG pipeline; doesn't lock you to a UI.
TruLens — eval + feedback functions, more general than RAGAS.
DeepEval — pytest-style assertion API for LLM outputs. Plays nicely with CI.
Promptfoo — CLI-first prompt eval. Great for the "compare two prompts on a fixed test set" workflow.

These are libraries, not dashboards. You either run them in CI or feed their outputs into a tracing platform.

Guardrails and runtime safety

NeMo Guardrails (NVIDIA), Guardrails AI — runtime input/output validation. Less "observability" than "control," but they emit signals you'll want in your traces.

Where to start

Honest, opinionated answer that depends on where your team is. Three brackets.

Day 0 — you have one agent in prod and no observability at all

Start with a library, not a platform. Get spans into JSONL first; pick a UI later.

The cheapest path: install Peekr, wrap your OpenAI/Anthropic client, get spans flowing to disk in five minutes. No accounts, no procurement, no schemas to negotiate. Open the dashboard with peekr dashboard traces.jsonl -o report.html and you can already triage your worst trace.

If you specifically need cost attribution and nothing else, Helicone is a single base-URL swap.

Don't pick a hosted platform on day 0. You'll waste a week onboarding before you can answer "what does my worst request look like?"

Day 30 — multiple agents, a few thousand traces/day, you have an actual question

This is when you upgrade the backend, not the instrumentation. Pick one of:

LangFuse — if you want a real UI and dashboards but don't want a hosted dependency. Self-host it next to your app.
LangSmith — if you're on LangChain/LangGraph and want native integration; pay for the hosted tier.
Arize Phoenix — if your concern is RAG quality more than generic tracing.
Your existing APM (Datadog, New Relic) — only if your team is already living in it daily and most of your problem is plumbing-level latency, not agent semantics.

Whichever you pick, keep the JSONL writer too — locally-shaped traces are useful for offline replay even when you also send to a hosted backend.

Day 90 — you need to catch hallucinations before they ship

Now add an eval layer. Two pieces:

A library (RAGAS, DeepEval, or TruLens) that runs metrics against a fixed test set in CI.
A subset of those metrics running online against a sampled fraction of production traces, with the results written back into your tracing backend.

Without (2) you'll never see drift; without (1) you'll never catch regressions before they ship.

The trap to avoid

The single most common failure mode I see: teams pick a hosted observability platform on day 1, spend three weeks integrating it, and discover that the "single pane of glass" doesn't actually answer the question they had — which was usually a very specific one ("why is this user's trace going wrong?") that a file-based tracer would have answered in five minutes.

The right order is: get traces somewhere — anywhere — first. Look at them. Notice what's missing. Then pick the platform that fills that gap.

The opposite order — pick the platform first, then fight the schema — is how teams end up paying for a SaaS observability bill twice the cost of their LLM bill, with worse triage than grep over JSONL.

Disclosure

I built Peekr, so my "start with a library" framing isn't neutral. I'd still give the same advice if I hadn't — it's been the right call on every team I've sat with — but you should weight the recommendation accordingly. The hosted platforms are all good; LangFuse and Phoenix in particular are work I respect and have learned from. Pick whichever helps you ship.

If you're evaluating these in 2026 and want a second opinion on what to pick for your specific stack, email me. Happy to weigh in for free; I learn from the conversations.