Building an application which can accept locations from millions of user devices, store them, trigger push events based on previous and current state, and honor individual users' preferences for privacy is hard. Making that same system able to grow to double or triple the number of users while accepting even more locations per device is even harder. We have found ourselves in this situation; while we have had growth pains, our system is basically fully functional. However, while we have the time and are not in crisis, building for a far more sizeable throughput now makes the most sense. Here's how we're attacking exactly that problem.
In the last 6 years, Life360 has gone through several iterations of backend structure. Initially, we were completely PHP and MySQL based. About four years ago, we added a dedicated service - we called it
GEOrge for handling location; it ran on Python and talked to Cassandra. About two years after that, we rewrote and upgraded that service, splitting it building a new framework in Python for it and introducing asynchronous messaging between the different processing steps; we gave this system the unfortunate name of
LOST, the Long Overdue microServices Technology. Since then, we've added the driving portion of our app and have been gradually pulling off individual processing steps into separate microservices, some using the
LOST framework, some using Golang, some using Java, some using Erlang.
While we've built a system which is very resilient, can pretty easily handle over 20k requests per second, and can scale up quite a bit from our current traffic, we have been finding a bunch of growing pains. One example is that we've been able to overrun Redis in terms of connections on a few different occasions, and at least once the not too long at all latency of a Redis query ended up backing up our flow, causing a bunch of issues. Long story short, we need to start looking to the future and planning for larger scale, especially with our great rate of user growth.
So what then?
Over the past year, we as a company have embraced using compiled languages over interpreted languages. This sounds like a no-brainer, and in retrospect, we probably should have done so earlier. However, hindsight is always 20/20, and there are always good reasons for not bringing in new tech, especially when we are still working to gain some breathing room with our existing stack. Over the past year, we've found we now have the time to explore making large changes.
We started small, creating one proof-of-concept service in Java which does some basic processing on our entire location stream, scheduling future batch processes and firing off messages at the right time. Watching this single server handle our entire location stream while we've at times had to have ~60 servers in Python really proved our assumptions. Since then, we've introduced 5 more services in Java, taking our learnings from one service to the next, developing a set of patterns along the way.
As of today, without going too far into the details, we have a flow that looks somewhat like this for our location path:
One interesting thing to note is that most of this used to be done all in Python... or not at all; we've just gradually added functionality - usually in Java, as you can see - and injected those services into the existing flow. The tech which ties it all together is NSQ, a distributed messaging platform which we've found to be very easy to use and grow with.
So, if we have something that works, why change it? Primarily because we can see an end to our system as we have it now. Even if we don't do a ground-up rewrite, we'll need to do some serious refactoring in order to make the various pieces work efficiently together at larger and larger scale. At some point, we'll end up outgrowing the bounds of what we feel we can reasonably do; we don't want to be unprepared for when this occurs. Therefore, we're looking at how we can rebuild our tech stack to plan for 10x scale.
After a lot of research, we've begun our implementation using Akka written in Scala, and will be using Kinesis for our messaging platform. This promises to bring a whole new level of performance to our system and will give us the opportunity to address our data models, our historical algorithmic choices (which we haven't changed in over 3 years!), and to determine which pieces of functionality we can make more native to our stream process.
First steps first: in order to do this, we've had to determine how to chunk out the work so we're not just releasing an entirely new system on one day. This has been challenging since the new system is such a vast difference to the existing one. However, we've decided on a plan to run the two in parallel, comparing how the two are running and giving us an opportunity to scale the new system as we build it; it's our goal to be able to run with a duplicate stream from production traffic even when all the new system does is injest and immediately publish messages without doing any of the data lookups, etc.
This promises to be one of the largest engineering undertakings at Life360 and we are all very excited to see this come to fruition, both for what it can do for our bottom line (we're looking at saving quite a few servers with this change!) and with simply the accomplishment of having switched over a high-throughput production system to a system designed from the ground up to be a better-in-every-way replacement. More updates to come, I'm sure!