Feeling inspired by a popular blog post by Instagram from a few years ago, I wanted to share what powers Life360 and allows us to support over 800 million API requests every day. To serve all of these requests, we use a lot of cool technologies in some interesting ways, which I will share with you in this article.
Freshness note: We change things fast here at Life360, so any part of this article may be out of date by the time you read this!
Intro to Life360's Servers
The nature of our app requires a lot of passive traffic in order to work correctly which, spread across our 75 million registered families, creates a lot of API traffic for us to keep up with everyday. At the time of this writing, we log 9,000 requests every second. That's over 800 million requests per day. All of this traffic takes a big toll on all parts of our server ecosystem, from our load balancers to our databases.
We make extensive use of AWS and its services. It has done us very well over the last 6 years and while there are other enticing options out there, we have no plans of changing. We run around 150 EC2 servers spanning all aspects of our ecosystem. We make significant use of auto scaling groups to save money and ensure we're serving our customers best due to our sine-wave-like traffic patterns. Most the rest of our usage is in Elasticache, S3, VPC and Cloudwatch.
Isn't it amazing that this portion isn't even meaningful or useful in this context? Of course we use Linux (Ubuntu Server to be specific)!
We run a handful of AWS ELB instances that accept the traffic from mobile app and API partners then forward it to our configured front-end router instances. These boxes are dedicated to the blazingly fast HAProxy which routes our traffic to the configured backend servers.
Many startups accrue lots of technical debt in early stages in order to be flexible and fast to build stuff that people will use. We are no different. We have a legacy monolithic PHP app built in CakePHP that most of our traffic passes through. In order to meet this traffic in a synchronous environment, we run 30-50 EC2 instances each running around 150 php-fpm workers.
I believe that the aforementioned 800 million requests per day makes us one of the largest executors of PHP in the world, at least in terms of HTTP requests served per day. We do this with NGinx, php-fpm, PHP5's built-in OpCache, Memcached, MySQL (which I'll get to later) and lots of testing. Not having native support for asynchronous programming, PHP makes it difficult for us to choose it going forward, however we will likely run it for the foreseeable future.
As PHP runs in a synchronous environment, we need all of our client requests to complete very quickly. We achieve this in under 80ms on average largely because we send long-running or logically asynchronous processes over RPC to a setup of the PHP platform that can execute in longer-running processes. Most requests also route to one of our many microservices.
Building out a microservice infrastructure takes time. We have built many microservices over the history of this company, running around 20 of them now. They are written mostly in Python and Go. We make particular use of the Tornado framework in Python and channels and goroutines in Go. Each one is a little different but we make huge efforts to keep them similar and under control.
Protocols and Queues
We use HTTP, ReST, RPC and JSON in our server communications. We run in an AWS VPC which allows HTTP traffic to be very fast and the least of our worries at this point. Pubnub and MQTT are some awesome technologies that we use for server to phone communications and event publishing in a distributed system. We also make heavy use of Bitly's awesome NSQ, which we think very highly of. Internally, we send around 3 billion messages per day spanning 20 topics to our nsqd and nsqlookupd cluster.
Discovery and Configuration
Consul is a newer technology that we have deployed and are using more and more every day. We use a handful-sized cluster of Consul agents to easily do live-update configs, service discovery, and distributed health checks.
Instrumentation, Metrics and Alerting
We used to rely heavily on Statsd and Graphite for our monitoring needs. However, we recently hit Graphite's limits as mentioned in this great interview we did with the nice people at Prometheus. Prometheus has proved immensely useful to us so far and we are migrating to that right now, along with their Alertmanager. We use Grafana extensively for data viewing which is a requirement for many of us these days. To get insight into our monolithic PHP servers, New Relic has been invaluable in doing a heroic feat--showing us what over 50,000 lines of PHP code are doing in production. CloudWatch, PagerDuty and Amplitude are other important parts of our instrumentation system.
Chef sets up and deploys our code (although Terraform is encroaching quickly here). We have automated deploys to our QA environments on every develop merge but manually deploy to production.
Databases have been a hot topic in the 2.5 years that I've been at Life360. We run a single-master, many-slave MySQL environment in production that serves over 5.5 billion queries per day. If you have used MySQL at scale, you know that this requires a LOT of effort. We also serve about 1 billion queries per day with Cassandra for our write-heavy, read-light location cluster. We run about 12-20 nodes in production at any given time and it is one of our favorite datastores.
In a scalable, highly-available PHP environment, caching is an absolute necessity. We use a handful of Memcached instances via AWS Elasticache as a caching layer to MySQL mostly. This layer stores around 500 million rows at a time and supports over 7 billion commands per day. We also use Redis more and more for caching and other uses.
These really are more than just honorable but they support our production system in an indirect way so we are mentioning them separately. We are using Docker (and Docker Compose) for new server development. CircleCI does the continuous integration for our many Github repositories. We make use of other CI tools like Jenkins and Atlassian's Bamboo for scheduled, non-production-critical tasks, such as comprehensive automated endpoint testing.
We hope you liked a tour of our infrastructure! We couldn't mention all of the technologies we use in this one article, so check out Life360's stackshare.io page for all of them!
And see www.life360.com for more about what we do.