Once you reach a certain scale, monitoring by hand becomes impossible; there are simply far too many moving pieces to keep track of. Life360 reached that scale long ago and we've been upgrading our analysis and monitoring tool sets accordingly. The latest in a long line of improvements is to our tracing system, and LightStep is the tool of choice. Let's take a look at what, how, and why.
LightStep is an implementation of the OpenTracing protocol which defines a common interface for tracing in distributed systems such as Life360's. There is a very easy-to-read white paper on the topic (which you absolutely should read, if for no other reason than to understand the whys and look at well-put-together diagrams) but let's see if I can distill that paper down into a couple paragraphs.
First, Some Basics
What is a Distributed Tracing System?
The main purpose of a distributed tracing system is to follow a given operation through all parts - including mobile and web clients - of the system under analysis, collecting execution times, relationships between calls, and optional additional context about the operations. Each individual hop is called a "span", and the full collection of spans is called a "trace". Each span can have either 0 or 1 parent spans, and it can also have 0 or more child spans. The end result is a list of spans with detail about the operation and parentage of each.
Once the data is generated, it is sent to a collector which aggregates the data, resolves all the relationships into a tree, and then renders it into a usable format, typically a graph.
A portion of a sample trace graph © 2016 Life360
Once in this format, it becomes very easy to view the relationships between spans, the duration of a span, etc.
Why Would I Use a Distributed Tracing System?
First and foremost, it gives visibility into how a system is doing what it's doing. In this ecosystem where the microservice is more and more prevalent, it becomes nigh-impossible to use per-process tracing since there just isn't enough context or scope. If all the operations for one API call are executed in the same programming language, in the same application, on the same server, then that's easy. This, unfortunately to those who want to trace, is not reality; microservices and the specialization they offer are common now, so we must embrace our micro overlords!
OpenTracing has produced a slide in a deck for a recent talk which summarized the "why" very neatly:
See? More clear than I could manage © 2016 OpenTracing.io
The square on the right is a visual example of tracing operations in a single process, the left one is what it looks like to trace when those operations are broken up over multiple services.
Also, if you want more detail about the state of distributed tracing, OpenTracing has a nice blog post which goes into some of the problems with the technology and what they, OpenTracing, are looking to do about it in terms of a turn-key tracing mechanism. You can find that post over on medium.
As for Life360, we're planning use the data generated to perform in-depth analysis of a production system. To that end we will the collected timing of the calls to define SLAs, configure alerts, monitor system health, and other real-time monitoring.
We also plan to use it to pinpoint slow spots in our application. While it's easy to say "an endpoint is slow" while using a mobile client, finding the latency can be difficult. Using the data we'll gain from this tracing, we will be able to drill down to what operations are actually taking a long time, getting to the root cause of the issue. Sure, we can trim a few milliseconds here or there, but if we can drop tens of milliseconds by fixing one slow running process nobody thought of, then that's where we want to focus our effort.
What Have We Used Thus Far?
We have been using NewRelic for some time, but there are a few issues we've encountered along the way.
With auto-scaling groups, it gets... a little weird; latency, throughput, and other reports end up changing drastically as servers are added or removed.
Tracing of our external services is practically non-existent; the only visibility we have is "called url http://foo" and, since we have... oh, about a dozen microservices running, each with their own data stores, external interface calls, etc., there is a lot which is unclear.
We only get a small number of traces, typically only when an operation is slow. This doesn't help much if we want to know what a trace should look like.
Tracing, while fairly easy to do, does not give a lot of room for... shall we say, improvisation. We get all of our function calls tracked through NewRelic, but that ends up becoming meaningless and very hard to read when we make hundreds of function calls to inconsequential helper functions in a request.
LightStep is still a fairly new company, but the people behind it have a little bit of experience with implementing and delivering distributed tracing. Like "they built this at Google and wrote the white paper on it" experience. Therefore it's safe to say they understand what it takes to build a tracing system with two major characteristics we at Life360 require: it must be very performant, and it must be able to cover all aspects of our system.
As for performance, the implementations themselves are very lightweight, using low-impact classes for tracing and high-throughput communications technologies like Thrift to send traces to the collector.
This performance is also gleaned in that I have control over what I trace and where I trace it. As a direct example that I'll talk about at length later, we instrumented all our database calls using LightStep. I did this functionality in one place, and thus I have traces for all SELECTS, INSERTS, etc. Now, for a typical database call in our framework, there are probably about 50 function calls which aren't as critical to track as the actual database operation. Using NewRelic, I didn't have a choice; I got tracing for all of them. With LightStep, I'm able to choose exactly where I want the tracing to be, what group of functionality it will wrap, etc. This will not only lessen the impact of tracing, but it will also make my reports far more readable.
That brings me to another large difference from NewRelic: we're able to host our own data collectors instead of only being hosted by LightStep. This will give us more control over the communications layer and will also put 100% of the hardware dedicated to our uses. While not a requirement, it's certainly something we will want to do sooner rather than later.
Finally, after having personally worked on this project in four different languages, the implementation is very straightforward, and I've been able to get an excellent level of support from everyone over there.
Okay, How Do I Do It?
We at Life360 have two major categories of API traffic: location updates and everything else. As of the time of this writing, approximately half of our traffic is location updates which, as you may remember from our What Powers Life360 article, works out to somewhere in the neighborhood of 400 million hits per day (we're above that now due to increasing growth, but you get the idea). Since we are primarily a location sharing company, and since we have a rather complex location stack, we decided to instrument that flow first with the hope of finding points for optimization, along with being able to truly visualize exactly what's going on when a user's location is saved.
In the next few paragraphs, I'll walk you through all the steps we went through to fully instrument our location save calls. Mind you, it's complex... but it's not really all that complex, considering what you get for the effort!
First Steps: Sign Up and Assessment
Once you've contacted LightStep, gotten an access token, and gained access to the repos, you can jump right in. Our location calls go through the following path to go from a new location detected on the client through to a potential geofence violation push notif being sent to other circle members.
Simple diagram of how a location is saved. © 2016 Life360
As you can see, a significant portion of the time is going to be spent in our PHP monolith. Now, there are a bunch of things it does, including talking to various caches, talking to MySQL, talking to external services, accepting calls from the client, and so forth. Therefore, we can see that, depending on how we instrument PHP, we can probably get a lot of additional traces for free.
We'll also need to add tracing to the Python location service, as this does a significant amount of logic on every single call; it will be good to see if this is running as performantly as we'd like.
Since our location save has a chance of having a geofence violation push notif being sent (albeit a small one; less than 1% of each location save results in a geofence violation), we'll also need to instrument our push notif service written in Go.
Finally, in order to fully trace the save and push notif send, we'll have to instrument both our Android and iOS clients to initiate a span when sending location, and to rejoin the span when accepting the geofence violation push notif.
Five languages, one tracing system, one call. This is going to be fun! And, while I hate to do it, I'm going to end this post here and continue with implementation details in the next post. So, tune in again to see all the steps and tech considerations we took to add tracing to each of these languages!