Engineering / March 6, 2018

Large Scale Cloud Infrastructure Migration: Eyeview’s Journey into VPC

If you’ve read some of our previous tech blog posts, by now you know that at Eyeview we run our infrastructure entirely in Amazon’s AWS cloud. As the cloud evolves over time so do we. Back in the early days of AWS, resources were deployed in a single flat network shared among all customers. Then the AWS team introduced Virtual Private Cloud (VPC) which gave customers more control over their network and allowed them to launch their instances in a logically isolated private network separate from others in the AWS cloud. By the end of 2013, all new accounts only supported VPC.

The production Eyeview account, however, dates back to 2009 and as of the beginning of 2017, almost all of our resources were still in the original (EC2-Classic) network. Since things were already working pretty well, moving to VPC was not originally a priority. The few components that were running in the new network were our new real-time user database cluster and our metrics collection instances. The rest of our applications and services were still being launched in EC2-Classic.

Moving everything into VPC would mean we’d have to design, implement and manage our own networks, including subnets, route tables, NAT gateways and so on – things that the Eyeview team never had to worry about before. A lot of changes to our CI/CD jobs and scripts that automate our deployments would be required. We would be launching all new database clusters and redeploying each of our applications into the VPC network.

Reasons for the move

So why would we take on such a big migration if everything was working well? Despite the risks and the level of effort involved, there were quite a few benefits to migrate to VPC. Some were nice-to-haves, while others would improve our system performance, our cost efficiency and allow us to expand our business.

For starters, there were significant performance improvements we would only get if we launched our instances in VPC such as the ability to run the latest generation instance types (c4, m4, r4, i3 available exclusively only in VPC). The new instances were faster, with more memory, larger disks and supported higher network throughput at a similar cost. In addition, we wanted to be able to launch our applications into newer AWS regions around the world and the new regions only supported VPC. We’d also be able to separate our apps into public and private networks and follow security best practices. Last but not least, we saw an opportunity to upgrade and further automate a number of components as well as convert our underlying infrastructure management to Infrastructure-as-Code (IAC).

Research and planning

With our goals in mind, the next step was to learn how to best achieve them. Where do we begin? What steps do we take to make sure we don’t impact the business or the developers as they deploy their code on a daily basis? Is there a particular timeframe that we should target for this project when any interruptions in service would not be as impactful to the business?

Typically, in the first half of the year, our volume is relatively lower compared to the second half, so we knew we would want to start the migration right after the winter holidays.

We had to learn all about VPC setup and management. AWS re:Invent was a great place to start. The team made sure to attend the VPC-related talks including “Creating Your Virtual Data Center: VPC Fundamentals and Connectivity”, “Moving Mountains: Netflix’s Migration into VPC”, and “From One to Many: Evolving VPC Design”. Then we headed to the AWS docs and read everything related to VPC.

As part of the information gathering efforts, we started drawing out what our data flow and network looked like. It was important to get a good understanding of our dependency graph before we started pulling pieces and migrating them to the new network one by one. To visualize that we first tried to import our infrastructure directly into web-based flowcharts & diagrams software that integrated with AWS. We quickly found that the import feature was quite tedious, did not support all the AWS services we used, and there was no way to differentiate instances since all thousands of nodes just showed an instance ID. So we drew it ourselves. You can see an early stage of that diagram in Fig. 1.

Fig. 1. Early simplified version of our network diagram
Fig. 1. An early, simplified version of our network diagram

Armed with a network diagram and dependency graph, the next step was to plan the migration and prioritize infrastructure components and applications. We made a decision to go with the components that most services were dependent on first (e.g. database clusters). Security and performance requirements drove the next few components, which were our front-facing web app and the Redshift database. Only then came our most critical apps which were the different pieces of our real-time bidding system.

Time to move

At this point, we were ready to start making the necessary changes (i.e. start breaking things). We tested ClassicLink to ensure network communications worked in both directions between EC2-Classic and VPC and that the network speed was good. We began with the migration of our MongoDB cluster by adding nodes from inside the VPC to our existing replica set and removing the nodes running in EC2-Classic. And, just like that, we had migrated one of our central components.

Fig. 2. MongoDB cluster CPU load before and after migration to VPC
Fig. 2. MongoDB cluster CPU load before and after migration to VPC

Then we refactored our deployment scripts and CI/CD jobs to provide the ability to launch and replace our application instances in both VPC and EC2-Classic. We migrated the less impactful applications and, once we felt comfortable with the process, we tackled the critical ones. For services which had ELBs (Elastic Load Balancers) in front of them, we used Amazon Route 53 with a weighted routing policy to send only a tiny percentage of traffic to the new ELB in VPC. We slowly increased the amount of traffic until finally moving all of it to VPC.

Fig. 3. Route 53 weighted policy with ELBs
Fig. 3. Route 53 weighted policy with ELBs
Fig. 4. Nodes in EC2-Classic vs VPC. 98% in 3.5 months - not bad!
Fig. 4. Nodes in EC2-Classic vs VPC. 98% in 3.5 months – not bad!

Lessons learned

Overall the migration went well, with no impact to the business. That said, we did come out of this project with a number of lessons learned:

  • AWS has separate limits in VPC and EC2-Classic. Not only are the limits separate and you have to request them to be increased in VPC, but some of them are structured differently. For example, we had to change the way we tagged instances using security groups during our deployments since we were now limited to 6 security groups per instance. This limit was increased from a default of 5, but at the cost of lowering the number of rules per security group since these are tied together.
  • Private IP Addresses and DNS resolution. We had to make sure all traffic between our offices and AWS was going via internal IP’s (i.e. over our VPN gateway to AWS VPC). This included working with the engineering teams, and updating some code and documentation. In addition, once in VPC, we had to make sure apps knew how to resolve the internal DNS names (e.g. ip-172-16-5-5.ec2.internal) when running local tests that needed to connect to services in AWS.
  • Some Boto API calls are different. During our migration to VPC we still relied heavily on security groups when launching instances and discovering components (we have since moved to Consul for Service Discovery) and we quickly learned that we had to update all our scripts to implement the VPC security group API calls.
  • Classic Link. We made sure to enable ClassicLink on all nodes at launch in EC2-Classic so they can communicate with instances in the VPC using private IP addresses.

Future improvements

We used this opportunity to update our deployment processes and refactor some of our automation code. We implemented additional monitoring and added metrics to our grafana dashboards. We are in a much better place than we were a year ago in terms of performance, security, and ability to scale in the future. But there’s always more we can improve!

While we were working on this project, we evaluated a configuration management tool to migrate our deployment scripts to. There’s an ongoing effort to replace any python scripts with Ansible playbooks.

There are also a handful legacy and one-off services which are still sitting in EC2-Classic. They are not critical and it’s a cost-benefit case, but we are working on migrating them one by one. It’s time to get “all the things” into VPC.

Fig. 5. Today a handful of mostly single-node or legacy services remain in EC2-Classic
Fig. 5. Today a handful of mostly single-node or legacy services remain in EC2-Classic

In addition, we’ve already started using Terraform as our Infrastructure-as-Code (IAC) tool of choice in building our underlying infrastructure in other regions. The end goal is to rebuild our production region using Terraform and enforce any changes going forward through IAC.

Finally, we are reaching out to our vendors and third-party partners who are also in AWS to see if they would want to enable VPC Peering so the data transfers between us never leave the AWS private network. This would save cost for both sides.

Conclusion

Infrastructure migrations at this scale are not uncommon as better technologies and more cost-effective ways to run applications become available. Businesses often face challenges such as rebuilding their infrastructure or completely redesigning their application stack. This does not have to be a hard and scary process.

In our case, there were a few important factors that helped us execute fast and with minimal to no impact. Our service-oriented (or modular) architecture allowed us to modify one component at a time. In addition, in most cases, we utilized our automated continuous deployment process so, for the most part, an actual migration looked just like another deploy. We made efforts, not only to keep our services available at all times, but also to provide a seamless experience for the rest of the engineers.

Thanks to this migration we are now reaping the benefits of increased performance, higher network throughput and ultimately better cost-effectiveness. Furthermore, our application deployment automation is more streamlined and our IT auditors are happier. Overall we are positioned better for any future move, be it to a new type of network in AWS, to another cloud provider (did someone say multi-cloud?), or even to our own data center.

 
Itso Slavchev

Itso Slavchev Lead DevOps Engineer

Date: 03.06.2018
Tags: