In digital advertising, unlike stock markets, real-time bidding (RTB) markets are open 24/7, so Eyeview’s systems are always up and always bidding. Our software is constantly looking for the right users and then matching those people with the most relevant ads.
At the same time, we know that systems are imperfect, and sometimes things will fail. Individual servers fail, whole racks may lose power, and entire data centers could be brought down by storms or bugs (think Hurricane Sandy or an AWS outage like the one in 2012).
As we grow and mature as a company, we need our software to be able to survive those obstacles with minimum impact to the business, which is why we need to worry about high availability (HA) of our systems.
There are many different levels of HA out there in terms of scope (host, cluster, system, region, etc.) as well as solution (cold standbys, hot standbys, active-active, etc.). The right solution will be use-case specific. Entire region failovers may or may not be possible without using multiple cloud providers since failing over to a different region may not be tolerable by applications that require low latency (e.g. ad tech RTB usually requires sub-50ms latencies while stock trading requires even lower), or it may simply be cost-prohibitive if you need to replicate your entire tech stack. Based on our needs, we are currently using one AWS region, US EAST, and for us, HA means being able to sustain anything up to a full AWS Availability Zone failure.
We started with an active-active architecture that leverages multiple AZs. This also allowed us to use a diverse EC2 Spot market for our c3.2xlarge instances across different “markets” when one zone runs out of our desired instance type or the prices for that instance type become too high. For that reason alone, we ran in 3 zones. The below diagram shows a simplified version of our architecture with a cluster of real-time load balancers (RLBs) that are protocol aware, a cluster of bidders that implement our RTB bidding strategies and a cluster of Solr servers that we use for geo queries. Each of these clusters spans multiple AZs.
This setup was working for us quite well initially. However, as we started pushing more data around, we saw our cross-AZ traffic skyrocket (AWS charges for cross-AZ traffic while traffic within the same AZ is free). Another issue we noticed was that cross AZ round-trip times are often above 1ms, and when you have low-latency requirements in a micro-service environment, spending 1ms on each network hop can quickly add up.
To solve these problems, we decided that we had to stay vertical within AZs. This allowed us to cut both on latency and cross AZ traffic/cost. Looking forward, this also makes it easy for us to replicate the vertical “chain” in a new zone or region as it becomes an independent entity. In order to achieve that, we had to make infrastructure changes—mostly to our load balancers, application discovery services and cluster managers, limiting visibility of instances to be within a single zone. And in order to avoid issues with AWS spot prices and availability within a given AZ, we diversified our spot instance types to include multiple varieties with similar characteristics (c, m and r-s—see here for list of instance types). This is reflected in our updated architecture—we now have one RLB cluster per zone, one bidder cluster per zone and one Solr cluster per zone—each with their own scalability.
Some AWS services such as DynamoDB and S3 already replicate data across multiple AZs and provide free cross-AZ access so no changes were needed there. For other services that are not latency-critical or that don’t get a lot of traffic, we decided to continue using cross-AZ connections as we didn’t think that changing them was worth the effort (such as MongoDB and statsd).
This change saved us a lot of money and shaved off a few milliseconds from our response time while giving us peace of mind for future AZ failures. Together with our work on service elasticity, it allows us to seamlessly switch over to a single AZ in case of an issue. We feel this is currently the best way for us to do multi-AZ HA on AWS while also setting us up on the path to go beyond AZs.
The next step is to figure out how to handle full AWS-region failures while maintaining our sub 50ms response time.