Thoughts on Distributed Tracing

21 Dec 2015

A colleague recently asked whether we should spin up a distributed tracing tool, and I realized that I had never written down some of my experiences developing and running a production Zipkin deployment.

Full disclosure: I haven’t been involved with the Zipkin community lately, so most of these thoughts are about Zipkin circa 2013.

Why tracing?

Maybe you’re going through a Cambrian explosion of microservices in your organization or you’re carving up a monolithic code base into finer-grain units of isolation. Somehow you’ve ended up with a single external request fanning out to many services and tons of processes.

The opening paragraph of the Dapper paper¹ sums up what life looks like at the far end of the spectrum:

Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facilities. Tools that aid in understanding system behavior and reasoning about performance issues are invaluable in such an environment.

Reasoning about the entire system’s behavior, whether safety, liveness, or performance, becomes more difficult as the width and depth of the service dependency tree grows, and distributed tracing tools can provide a global view of the complicated interactions and emergent behaviors² in these complex systems.

Zipkin

Zipkin is closely modeled off of the Dapper paper, so I’ll talk about the two somewhat interchangably.

When Twitter first started work on Zipkin (ca 2010/2011), the Monorail was serving the vast majority of production traffic. Site performance and reliability were top priorities given the abundance of Fail Whales³, and visibility into what was going on when processing requests could be helpful in debugging issues.

Data model

Zipkin models RPCs as a tree of relationships between clients and servers. Each RPC is capture as a Span, which is essentially made up of a bunch of Annotations. There are four special annotations that fully describe an RPC: client send, server receive, server send, and client receive. With this information we know which hosts were involved, the request latency, and any additional data the client or server may want to include.

  ┌─────────────────────────────────────┐
  │            Load Balancer            │
  └*──────────────────────────────────*─┘
   └─┐ Client Send     Client receive ▲
     │                              ┌─┘
     ▼  Server receive   Server send│
   ┌─*──────────────────────────────*─┐
   │               App                │
   └──────────────────────────────────┘

Spans are linked together by instrumenting the protocol⁴ to pass a few headers:

trace ID: unique to the entire request tree
span ID: unique to the particular RPC call
parent span ID: the span ID of the server’s caller
sampled flag: whether the trace has been sampled

These are logged as part of every Span so that we can reassemble the tree later on. Zipkin store and indexes the traces by various attributes (e.g. service and method names, trace duration, custom annotations) for ease of access.

Challenges

Sampling rate

In larger systems, you may not want (or have the luxury) to record traces for every request. Both Dapper and Zipkin bake sampling into the protocol via the sampled header and allow for the root node of the request tree to decide whether to sample (e.g. randomly generate a number and compare to a pre-determinated percentage).

Zipkin also adds a server side sampling rate to allow operators to adjust the percentage of traffic that hits the backing store. In our deployment, this was wired up to synchronize across all the collectors (varying sampling rates result in partial spans) and self-adjust if the overall traffic levels fluctuated. The latter was added as an optimization to keep the amount of load on the data store consistent across Twitter’s diurnal traffic patterns.

Sampling rates can cause a number of problems.

Bias

Sampling uniformly (like with a random number generator) treats all data equally. In a diverse topology, you’ll often have a small number of highly loaded services or routes (e.g. Tweets) that have many orders of magnitude more traffic than everything else; this isn’t an ideal environment for uniform sampling since most code paths will not be sampled, though would be useful.

Dapper alludes to implementing adaptive sampling “parameterized by… desired rate of sampled traces per unit time” to avoid the manual step of configuring low traffic flows with higher sampling rates; we never got around to implementing this in Finagle.

Live Debugging

Often, a developer loading a page in a development or test environment may way to look at the backing trace data. In Zipkin we added a debug protocol header that the user could include to force the trace through all layers of sampling and built out a browser extension⁵ to display trace data.

Trace linking

Traces are great for RPCs, but your data flows may include async processing (queues) that break the RPC paradigm. We never really got around to building in useful ways to associate traces.

Service homogeneity

Coordinating via protocol instrumentation was one of the key design choices in Dapper and Zipkin, and this made it difficult to integrate external and new services (e.g. M&A) as the organization grew. In practice, many of these foreign systems were simply not integrated.

Lack of message bus

We originally built Zipkin using Scribe as the primary means of transport; it was deployed on every production machine at Twitter and supported the best-effort durability semantics we needed. In hindsight, publishing the event stream to a unified message bus and building applications, such as indexing, debugging UIs, aggregation, on top as isolated use cases would have allowed us to move faster and experiment with new ideas.

Franklin Hu