← All writing
Paper Breakdown

Dapper: what Google's distributed tracing paper actually says

Reading Sigelman et al. (Google, 2010) while debugging a 400ms latency spike that touched nine services and left no useful logs.

The incident that prompted this: a request to our order placement service was intermittently slow, but only when a specific downstream inventory check was involved, and only during certain load windows. The logs were useless — every service logged "request received" and "request completed" with timings, but nothing connected them. You couldn't tell which of the 12 service calls in the critical path was slow for that specific request, because the request ID wasn't threaded through consistently, and even when it was, the timestamps across services weren't aligned. After three hours across five engineers, we found it: a distributed lock contention in the inventory service that only surfaced at >60% concurrent request load. We would have found it in 20 minutes with tracing.

Dapper — "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure", Sigelman et al., Google 2010 — is the paper that defined how distributed tracing works. Zipkin (Twitter, 2012), Jaeger (Uber, 2016), and OpenTelemetry (CNCF, 2019) all implement the Dapper model. The trace/span data model, the in-band context propagation through RPC headers, the root-sampling decision, the out-of-band collection pipeline — all of this comes from this paper. Understanding the original design choices explains why modern tracing systems have the limitations they do.

The problem Dapper is actually solving

Distributed systems have a specific observability problem that logs and metrics don't address: causality across process boundaries.

A single user request in a large system fans out to many services. Service A calls B and C in parallel; B calls D; C calls E and F; D calls G. The request completes when all sub-calls complete. If the overall latency is 400ms instead of the expected 50ms, the logs tell you each service's self-reported timing — but not which path caused the slowness, or how the waits compose. You want to know: what did this specific request do, across all services, in causal order, with timing for each hop?

The naive solution is to log a request ID and grep across service logs. This breaks when:

  • Request IDs aren't consistently propagated (one service generates a new ID)
  • The same logical request spawns async work that outlives the original request
  • Services have different clock offsets (NTP synchronization errors make timestamp ordering unreliable at millisecond granularity)
  • High fanout requests generate thousands of log lines that are expensive to correlate

Dapper's answer: instrument the RPC library to automatically propagate a trace context, collect timing data per span as a structured artifact (not a log line), and store spans in a dedicated backend keyed by trace ID.

The trace/span data model

The fundamental unit is a span: a named operation with a start timestamp, end timestamp, and a set of key-value annotations.

Every span carries:

  • Trace ID: a 64-bit integer, globally unique, the same for all spans in a request tree
  • Span ID: a 64-bit integer, unique within the trace
  • Parent Span ID: the span ID of the caller that created this span (empty for the root)
  • Span name: typically the RPC method name ("inventory.GetAvailability")
  • Timestamps: wall-clock start and end (from the local machine's clock)
  • Annotations: application-defined key-value pairs attached to the span

A trace is the complete tree of spans for a single request: root span at the top, child spans below it, each child's parent pointing to its caller.

The parent-child relationship is the mechanism that makes distributed tracing useful. When service A calls service B, A's instrumented RPC library creates a child span with A's current span ID as parent, passes the child span's trace ID and span ID in the RPC request headers, and B's library reads those headers to continue the trace. The result is a tree you can reconstruct from the spans alone — no centralized coordinator needed.

The paper uses an example with five spans for a frontend request that calls two backend services: one root, two intermediate, two leaf. This tree structure is the artifact you're looking at in Jaeger or Zipkin when you see the waterfall view — it's directly from the Dapper data model.

How context propagation actually works

The key insight is in-band propagation: trace context is carried inside the same RPC request that does the work.

When a thread makes an RPC call:

  1. The instrumented RPC library reads the current trace context from thread-local storage
  2. If no context exists (this is the root), create a new trace ID and root span
  3. Create a child span: new span ID, parent = current span ID, same trace ID
  4. Serialize the child span's trace ID and span ID into the RPC request headers (a few dozen bytes)
  5. Start the span timer, send the request, stop the timer when the response arrives
  6. Write the completed span to local disk

On the server side:

  1. The RPC library reads trace ID and parent span ID from incoming request headers
  2. Creates a new span with those values
  3. Stores it in thread-local storage for the duration of request handling
  4. Any further RPCs this handler makes will create children of this span

This is why you get distributed tracing "for free" when you instrument your RPC library: every service call propagates the context automatically, with no application code changes required. The paper emphasizes this as a design requirement — requiring every developer to manually instrument every function would guarantee incomplete coverage.

The thread-local storage mechanism is important. It means span context is implicitly passed through the call stack without modifying every function signature. The tradeoff appears when threads are handed off: if a request handler submits work to a thread pool, the thread-local context doesn't follow. The worker thread starts with no trace context, and any RPC calls it makes are orphaned — they generate spans with no parent, invisible in the trace tree. Dapper acknowledges this and requires explicit context handoff for async patterns. Modern tracing SDKs have baggage propagation APIs for this, but it remains a manual step that developers forget.

The sampling decision, and why it's made at the root

At Google's 2010 scale — billions of RPCs per day — recording every span at full fidelity would generate hundreds of terabytes of trace data daily. The paper reports Dapper processing roughly 1 petabyte per year of trace data, which implies significant sampling. The production default was 1 in 1024 requests sampled (approximately 0.1%).

The critical design choice: sampling is decided once at the trace root and propagated.

When the first service receives a request, it decides whether to sample. If yes, it sets a sampling flag in the trace context. Every downstream service reads this flag. If the trace is sampled, all spans in the trace are recorded — the sampling flag propagates exactly like the trace ID. If the trace is not sampled, no spans are recorded anywhere.

Why not sample per-span? Because orphaned spans are useless. If each service independently decides whether to sample its own spans, you get fragments: service A samples its span, service B doesn't, service C does. You have A's span and C's span but no way to connect them — C's parent (B) is missing. The trace is broken. Per-span sampling generates expensive incomplete data that can't be reconstructed into a coherent trace tree.

Root sampling means: all-or-nothing per trace. Either you have the complete tree or you have nothing. At 0.1% sampling rate on a system handling 1M requests/second, you're still getting 1000 complete, high-fidelity traces per second — enough to surface patterns.

The tradeoff: rare errors that occur in less than 0.1% of requests will be underrepresented in sampled traces. If your error rate is 0.01%, you're sampling roughly 1 in 10 errors. The paper discusses adaptive sampling — dynamically increasing the rate for specific services or error conditions — but the default was a fixed global rate because simplicity aids the "ubiquitous deployment" goal.

The out-of-band collection pipeline

Span data flows from application processes to Dapper's storage through a pipeline deliberately designed to be off the critical request path.

After a span completes, the instrumented library writes it to a local log file on the same machine. This is a low-latency, low-reliability write: the file is in a tmpfs-like location, the write is buffered, and data can be lost if the machine fails before flushing. The paper explicitly accepts some data loss in exchange for zero impact on request latency — a span write that blocks the request thread would violate the "low overhead" design goal.

A Dapper daemon runs on every production machine. It reads from the local log file, collects spans in batches, and pushes them to Dapper collectors via a secondary network path — not the same network handling production traffic. Collectors write spans to Bigtable, with one row per trace and one column family per span. BigTable's sparse column model is well-suited here: traces can have arbitrary numbers of spans, and you want to retrieve all spans for a trace in a single read.

The paper reports end-to-end latency from span completion to availability in the Bigtable-backed query system of approximately 15 minutes. This isn't real-time — you can't query a trace while the request is still in flight. For incident response, this is often acceptable; for real-time debugging of an active production incident, it isn't. Modern implementations (Jaeger, Honeycomb) have reduced this to seconds, but the architectural decision — write locally, collect out-of-band — persists.

Production overhead numbers

The paper measures three types of overhead:

Per-span annotation overhead: creating a root span costs ~204 nanoseconds. Creating a child span costs ~176 ns. These are measured on Google's 2010 hardware (roughly 2GHz x86); on modern hardware they'd be faster. At 1/1024 sampling, this cost is paid only 0.1% of the time.

Trace collection overhead: the Dapper daemon consumes less than 0.3% of a core on the machines tested. Local disk writes for span data are amortized through buffering and add negligible latency to the write path.

End-to-end latency impact: the paper reports "no statistically significant impact" on request latency for sampled and non-sampled traces. For non-sampled traces (99.9%), the overhead is a single comparison against the sampling flag and an early return. For sampled traces, the span creation overhead is on the order of microseconds in a request that likely takes milliseconds.

The caveat: these numbers are for the RPC instrumentation only. Application-level annotations — developers explicitly adding key-value pairs to spans (e.g., span.annotate("user_id", userId)) — have no built-in overhead budget. If an application annotates spans with expensive serialized objects or makes annotation writes in hot loops, the overhead isn't zero. The paper doesn't address application misuse.

What Dapper doesn't solve

Non-RPC causality. Dapper traces causality through RPC calls. If two services communicate via a shared database (service A writes a record, service B reads it and acts on it), there's no trace context in that communication path — the causal relationship is invisible. Same for event queues, shared memory, files, signals. A trace that ends at the database write and doesn't connect to the downstream consumer leaves a gap that's often the one you need most.

Modern tracing systems have added "messaging" span kinds and baggage injection into message headers (Kafka headers, SQS message attributes) to address this, but it requires explicit instrumentation of each messaging system — it doesn't happen automatically.

Async work and callbacks. Thread-local context propagation breaks when a request spawns async work. A request handler that submits a Runnable to an executor pool loses the trace context. The submitted work runs on a different thread with empty thread-local storage. Any RPCs in that work create new root spans with no parent — orphaned, invisible to the trace that triggered them.

The paper acknowledges this and recommends passing trace context explicitly to async work. In practice, teams forget, and you end up with traces that appear to complete quickly because the async tail is untraced. Production systems I've debugged had exactly this pattern: the synchronous part of a request was fast and well-traced; the async work that caused the real latency was invisible because no one had wired up context propagation through the job queue.

Clock skew. Span timestamps are taken from the local machine's clock. NTP synchronization maintains accuracy to within roughly 1-10 milliseconds. When a parent span's end time is before a child span's start time (due to clock drift between machines), the trace waterfall display shows the impossible: work completing before it started. Tools typically clamp these to zero-length or treat them as data quality issues. At microsecond-resolution traces, clock skew is the dominant source of measurement error.

Tail latency attribution. The most useful information in a trace — why was this specific request slow — is captured at whatever sampling rate you're running. At 0.1%, a request that's slow 0.5% of the time will be captured in roughly 1 in 200 such requests. You may not accumulate enough samples of the slow case to characterize it. Head-based sampling (decide at the root before you know if the request will be slow) systematically undersamples the interesting cases.

Tail-based sampling — decide whether to sample at the end of a trace, after you know the outcome — solves this but requires buffering complete traces before making the sampling decision, which means holding trace data in memory for the duration of each request. The paper doesn't implement this; modern systems like Refinery and Grafana Tempo's tail-sampling processor do, at the cost of memory and complexity.

When not to use distributed tracing as your primary debugging tool

For frequent, high-rate failures at low sampling. If 0.1% of requests fail and you're sampling 0.1%, you're capturing ~1 in 1000 failing requests as a traced sample. For a system handling 10K requests/second, that's 10 failing requests/second but only 1 sampled failing trace every 100 seconds. If the failure is sensitive to load or a specific data pattern, you may wait a long time for the right trace to appear. Error-specific sampling rate overrides help (sample 100% when the response code is 5xx), but require explicit instrumentation to trigger.

For long-running background jobs. A job that runs for 10 minutes and makes thousands of RPC calls generates a trace with thousands of spans. Rendering this in a trace visualization tool is slow; storing it is expensive; reading anything useful from it is nearly impossible. Dapper's model assumes request-scoped work with bounded fanout. Batch jobs, streaming pipelines, and long-lived connections need different observability primitives — metrics aggregated over time, log-based analysis, or sampling at a sub-job level.

When you need correlation across time, not within a request. Distributed tracing answers "what happened during this request." It doesn't answer "which requests are slow today compared to yesterday," "which users consistently experience high latency," or "did this deployment change the latency distribution." Those questions require aggregated metrics indexed by time, not per-request trees. Using traces to answer statistical questions means querying many individual traces and aggregating manually — expensive and slow. Use metrics for trends, traces for individual request diagnosis.

When your system doesn't use shared RPC infrastructure. If your services communicate through HTTP client libraries that each team has written independently, instrument one RPC library is not an option — you have to instrument all of them. If services use multiple protocols (gRPC, HTTP/1.1, Thrift, custom TCP), instrumenting all of them consistently is a significant engineering investment. The "ubiquitous deployment" property that makes Dapper effective at Google depended on everyone using the same RPC framework. In heterogeneous systems, you get partial tracing that may be more misleading than no tracing.

What Dapper actually gives you

The paper's central contribution isn't the data model or the collection pipeline — those are implementation details. It's the argument that tracing can be both comprehensive and low-overhead through sampling and library-level instrumentation.

Before Dapper, the engineering intuition was that tracing was expensive: you either sampled sparsely and got incomplete data, or sampled everything and crushed your performance. Dapper demonstrated that with a root-sampling model and out-of-band collection, you could instrument every service in a large production system at negligible cost, and the resulting 0.1% sample would still give you statistically sufficient coverage for most debugging tasks.

The second contribution is the span data model as a first-class artifact — not a log line, not a metric, but a structured tree with identity, timing, and parent-child relationships. Logs tell you what a service did. A trace tells you why — what caused the service to do work, in the context of the full request that triggered it. These are different questions, and they need different data structures.

For the incident I opened with: we installed Jaeger (the Dapper descendant), instrumented our gRPC clients and servers, set a 1% sampling rate (higher than Dapper's 0.1% because our volume was lower), and within one week had the inventory lock contention visible as a wide span in the trace waterfall — 340ms of a 400ms trace spent waiting on a distributed lock. The lock issue took two hours to fix. Finding it without tracing took three hours of manual log correlation.

The 15-year-old design still works. Understand where it breaks — non-RPC causality, async context propagation, tail-based sampling for rare errors — and you know exactly where your tracing has blind spots.


Dapper, a Large-Scale Distributed Systems Tracing Infrastructure — Sigelman, Barroso, Burrows, Stephenson, Plakal, Beaver, Jaspan, Shanbhag. Google Technical Report, 2010.

Related reading

  • Kafka: What the Original Paper Actually Says

    The original Kafka paper from 2011 had no replication. A broker failure made all unconsumed messages permanently unavailable. The paper treats this as a limitation to fix later, not a deal-breaker. Understanding why explains more about Kafka's design philosophy than any architecture diagram.

  • MapReduce: What the Google Paper Actually Says

    The 2004 Google paper that gave us Hadoop — and everything that replaced it — is worth reading not for the map/reduce abstraction itself, but for the fault tolerance model and the straggler insight. The failure modes are still the failure modes.

  • Pregel: What the Large-Scale Graph Processing Paper Actually Says

    PageRank in MapReduce is O(iterations × full dataset reloads). Pregel fixes this by keeping the graph in memory across iterations and replacing disk I/O with message passing. The 'think like a vertex' model is the insight — BSP is the implementation.

← All writing