How I think about agent observability

A draft. Companion piece to How I think about agent memory. Feedback welcome — ashwanijha04@gmail.com.

If you've ever shipped an LLM agent to production, you've had this conversation. The agent gave a user a wrong answer. Someone screenshots it. You open Datadog, hunt for the request, find a trace span that says POST /chat returned 200 in 8 seconds, and learn… nothing. The trace ends at the API boundary. Inside the agent — which model was called, what prompt, which tools fired, which memory was retrieved, how many tokens it cost — is a black box.

This is the gap Peekr tries to close. But before the tool, the model.

Why APM doesn't work for agents

Traditional application performance monitoring was built around a clean abstraction: requests are short, deterministic, and shaped like a tree of service calls. Datadog, New Relic, Honeycomb — they all assume one trace per user request, with spans that nest predictably.

Agents break almost every one of those assumptions:

They loop. A single user message might trigger ten LLM calls, three tool invocations, and a memory recall — all under one logical "turn." APM sees ten separate transactions and can't stitch them together.
They retry. Most agent frameworks retry on bad JSON, refusals, or tool failures. Naive instrumentation double-counts tokens, miscounts errors, and hides flakiness behind eventual success.
They cost money per call. A web request that takes 8 seconds is a latency problem. An agent call that takes 8 seconds and burns $0.42 in tokens is a latency and economics problem. APM tracks duration; it doesn't track dollars.
The "error rate" you care about is semantic. A 200 OK with a hallucinated answer is worse than a 500. APM has no concept of "the response was confidently wrong."
State lives outside the call. Memory, scratchpads, vector stores, prior conversations — the things that cause the agent's behavior — aren't in the request payload. You need a way to attach that state to the trace.

These aren't bolt-on extensions of normal APM. They're the workload.

What you actually need to see

After a couple of years of debugging agents in production at Amazon (Just Walk Out) and now at Property Finder, here's the minimum useful trace for one agent turn:

A span tree that survives async boundaries. When an agent fans out — multiple tool calls in parallel, streaming responses, background memory writes — the spans need to stay attached to the same logical turn. In Node this means propagating context through AsyncLocalStorage. In Python it means contextvars. Most "we added tracing" announcements quietly only handle the sequential case.
Inputs and outputs, truncated but real. Token counts tell you what it cost; raw text tells you what it said. Sampling helps with volume, but for any incident you'll want the actual prompt and the actual response, not a hash.
Token accounting. tokens_input, tokens_output, tokens_total per call. Streaming makes this harder — OpenAI emits usage in a final chunk only if you set stream_options.include_usage, which most wrappers forget. Get this right or every per-user cost number you compute will be wrong.
Status and error reasons. Not just HTTP codes. "Refused for safety," "JSON parse failure on tool call," "context length exceeded" — each one has different remediation, and aggregating them as one "error" rate destroys the signal.
Session and tenant attribution as first-class fields. Not buried in metadata. user_id, session_id, tenant_id — top-level on every span so you can answer "how much did acme.com spend last week" in one query.
A retention class. Some spans contain PII or full conversation transcripts. They need different storage policies than the cheap token-count summary. Bake the policy hint into the span at write time, not at query time.

If you have those six, you can debug almost anything. If you're missing any of them, expect to be guessing.

The two-runtime problem

Most AI infra teams I talk to run a Python backend with an OpenAI/Anthropic client. Many of them also run a Next.js or Cloudflare Workers front-end that calls the same model directly — for streaming UX, low-latency edge calls, or because the framework they picked (Mastra, agent-graph, LangChain.js) is TypeScript-native.

Today you can pip install-and-forget a Python-side tracer. There's no JS equivalent that produces traces in the same schema. So teams end up with one tracer in Python, a different tracer in JS, two query layers, two dashboards, and a permanent inability to ask "what was the full trace of this user's request" because the request crossed runtimes.

This is the problem Peekr is built around. The wrappers are per-language (you can't avoid this — monkey-patching the OpenAI Node SDK requires Node) but the output schema is identical. A span written by the TS SDK and a span written by Python read into the same store, render in the same dashboard, and join on the same trace_id.

import { instrument, wrap, withSession } from "@peekr/sdk";
import OpenAI from "openai";

instrument({ jsonlPath: "./traces.jsonl" });
const openai = wrap(new OpenAI());

await withSession(
  { user_id: "alice", tenant_id: "acme", retention_class: "long" },
  async () => {
    await openai.chat.completions.create({ /* ... */ });
  },
);

Then on any machine with the Peekr Python CLI:

pip install peekr
peekr view --io traces.jsonl
peekr dashboard traces.jsonl -o report.html

Same store, same schema, indistinguishable from Python-produced traces.

Why disk first, cloud later

The default Peekr exporter writes JSON-Lines to disk. Not a hosted service. Not a database. A file.

This is deliberate, and it's the thing people push back on most. The argument goes: "But what about scale? What about persistence? What about querying across machines?" All valid concerns. None of them matter on day one.

What matters on day one is: can you turn it on without filing a procurement ticket? Hosted observability for AI is currently expensive, has long onboarding, and most teams stall out before they get a single trace. A pip install that writes to ./traces.jsonl and a peekr view command they can run in two minutes is the difference between "we have observability" and "we don't."

When the volume justifies it, swap the exporter:

import { HTTPExporter } from "@peekr/sdk";
instrument({ exporter: new HTTPExporter({ endpoint, apiKey }) });

…and the same schema flows to a hosted backend. Files first, cloud when it earns it.

What observability unlocks

This is the part most "add tracing!" announcements skip. Tracing is a means, not an end. The point of the schema is to make these queries cheap:

Hallucination triage. Pull every span where the user thumbs-down'd the response, join with the input prompt and the retrieved memory, group by which memory was top-ranked. Within a week of running this query, you'll find a small set of stale memories that are over-influencing answers — and now you can prune them.
Per-tenant cost reporting. SELECT tenant_id, SUM(tokens_total * price) is the entire query. If tenant_id is buried in metadata, this is a multi-hour project; if it's top-level, it's a one-liner.
Tool reliability. Group by tool name, status, and error reason. The "JSON parse failure" rate on your weakest tool is almost certainly higher than you think.
Memory hit rate. Pull every span tagged as a recall, look at whether the retrieved memories actually showed up in the model's response. Memory you wrote but never read isn't memory — it's storage.
The replay loop. Pick a real production trace, change one variable (a different prompt, a different model, a different retrieval rank), and re-run it offline. This is the workflow that turns observability into a development tool, not just a forensic tool.

Each of those is one or two SQL queries against the JSONL store. None of them are possible without the schema being right.

What I'd build next

Three things I'm working on next in Peekr — partly because I want them, partly because they're the things every team I talk to asks for:

A diagnostic engine that suggests root causes. Most "AI observability dashboards" show you the data. None of them tell you what to look at first. I want the dashboard to surface "your hallucination rate spiked yesterday because retrieval rank dropped on these three memories" without me having to run the joins by hand.
An evaluator harness. Plug in a hallucination check, a citation-accuracy check, a custom rubric — and run them as part of the regular trace pipeline. Eval shouldn't be a separate stack from observability; they're the same data.
Tighter integration with Extremis. Every memory recall span should carry the recall reason (similarity, score, usage count, age). That makes the "stale memory ranked high" debugging loop trivial.

The library is open source. If you're debugging an agent in production right now and any of the above sounds familiar, please tell me what's broken in your setup — GitHub, issue tracker, or email. The schema is still young; every real workload makes it sharper.