Large ktn

Finding Datacenter Software Tail Latency

Richard L. Sites

Recorded 16 October 2017 in Lausanne, Vaud, Switzerland

Event: IC Colloquia - EPFL IC School Colloquia


Datacenter computers are the other half of cell phones -- the anonymous servers somewhere in the world that make every cell phone browser, app, and operation work. Unlike traditional throughput-oriented computing, datacenter software is measured by user-facing transaction latency. For a given service, a histogram of the latencies usually has a long tail of very slow responses, with the 99th percentile latency 10x or more of the median latency. The "interesting" slow transactions are only slow under live load during the busiest hour of the day; they are fast if run again. They cannot be reproduced during offline load testing, and their underlying causes remain a mystery for months or years, hurting overall datacenter capacity. As an industry, we have very poor tools for observing and therefore fixing the unknown sources of interference.
The talk discusses several low-overhead tools for identifying where all the transaction wallclock time goes in such complex software. Versions of these have been in production use at Google for a few years.

Watched 134 times.