CppCon 2025 Using Distributed Trace for End-to-End Latency Metrics -- Kusha Maharshi
Registration is now open for CppCon 2025! The conference starts on September 13 and will be held in person in Aurora, CO. To whet your appetite for this year’s conference, we’re posting some upcoming talks that you will be able to attend this year. Here’s another CppCon future talk we hope you will enjoy – and register today for CppCon 2025!
Using Distributed Trace for End-to-End Latency Metrics
Thursday, September 18 15:15 - 16:15 MDT
by Kusha Maharshi
Summary of the talk:
In this talk, we'll delve into how we utilized distributed trace to address a prevalent need at a large technology company that serves the finance industry: timing requests from point A to point Z in a complex system, where a whole alphabet's worth of steps occur in between. We'll show how tracing, an open source telemetry standard, helped us generate insights from tracking requests in a complicated, microservices-driven architecture. We'll discuss the challenges faced and lessons learned while building a C++ solution that turns trace data into end-to-end latency metrics. We hope to inspire attendees to apply these lessons to telemetry solutions tailored to their own firms' needs.
Distributed tracing was introduced within our company before it was a stable, open standard. We saw its potential and invested in solutions that utilized its rich, cross-service information. However, existing open source or commercial products didn't fit the complexity and scale of our trace data. Our engineers, incident responders, and managers wanted end-to-end latency metrics to observe complex workflows, so we built our own solution! The resulting metrics now drive service level objectives (SLOs) that set measurable targets that define a system's quality and reliability. When these targets are not met, teams are alerted, thus driving quick remediation. Furthermore, the traces corresponding to any degradation in system health can be used to pinpoint faulty components, as well as aid in the development and testing phases when building new solutions.
From a technical point of view, trace data is represented as directed acyclic graphs (DAGs), and our challenge was processing more than 50 billion daily nodes, with deep fan-outs and fan-ins. These graphical structures mirror real-world scenarios like queuing systems for high-volume messaging, order processing or batched email notifications, and present concurrency choke points at scale. In this talk, we'll break down the design of the C++ microservices we built to process large-scale streaming graphs into metrics, while also addressing scalability bottlenecks. We'll also highlight why we chose C++ and its libraries, powerful profiling tools, and efficient data structures. If you're into building low-latency, high-throughput distributed systems, want to build telemetry solutions that best suit your needs, or just enjoy geeking over graphs, this talk is for you!
Kusha Maharshi is a Senior Software Engineer at Bloomberg, where she passionately works on distributed tracing and observability infrastructure. An avid public speaker, she loves breaking down complex technical challenges with clarity – and a dose of humor. Kusha holds a degree in Computer Science from Carnegie Mellon University, where she specialized in computer systems. Her favorite programming language? Assembly.