This blog post is part of Eyeview Service DIscovery series. Be sure to check out the first post where we shared how to implement a no-cost service discovery solution with AWS.
Eyeview maintains microservices applications, and the fleet of running instances we have changes dynamically and constantly, making it impossible to maintain all the working pieces without a well-structured service-discovery solution in place. We deploy constantly and thrive to maintain a “zero-configuration” setup, which makes service discovery a key player within our infrastructure by helping our apps connect with one another.
If an instance were to launch in an environment without a robust service discovery setup, it would be kind of like walking into a party where you don’t know anyone.
“Who am I?”
“What am I doing here?”
“Who am I supposed to talk to?”
Service discovery is the person who notices your confusion and comes over to help introduce you to everyone at the party.
For example, say you launch a GO application in your cloud environment, service discovery helps you auto-discover the database IP it should connect to, just like your new friend would bring you into the social fold at the party.
Service discovery can be implemented in many ways. It’s primarily split into service-side discovery or server-side discovery. Since Eyeview is micro-services environment, we primarily use service-side discovery.
As Eyeview’s operations expanded to include data housing, delivery, analytics and much more, the number of running applications grew. This increase in applications meant that we needed a new service discovery solution to cope with the mounting challenges of our former infrastructure. We had over 1000 instances, and AWS started throttling our API requests because there were so many. Our infrastructure is based in Spot Fleets and Auto-Scale groups that scale based on the number of running campaigns, which means that the number of running instances can be in the hundreds or thousands throughout the day.
Choosing Consul as Our Service Discovery Solution
When we started looking for a service discovery solution, we laid out all of the necessities and optional features we might need in the future. While doing our research, we compared etcd, Netflix Eureka and Consul. After a short time, it was clear that Consul was the most turnkey solution for us as it is packed with features that could solve the majority of our (and other companies’) service discovery problems and more.
Hashicorp Consul offers ready-to-use service discovery, key-value store, Multi-DC support, health checks and UI compared to the other solutions that only offer one or two features and probably will require an installation of third-party software.
Consul is a distributed highly available (HA) system using gossip protocol to manage members in the cluster. Each agent(=client) in the cluster can provide one or more services (i.e web servers, databases, etc.), and servers are in charge for storing and syncing data. Agents needs to communicate with servers in order to register their own services or to query about other services in the cluster.
Consul is also backed by a strong open-source community that provides amazing support and developing quickly with new helpful features added on a monthly basis.
There were a few key points we needed to align on to deploy the new Consul architecture.
- Consul would be the backbone for our infrastructure, and we had to capsulate it in a way that would provide minimum loss of service in the case of disaster, like an AWS availability zone fail.
- We needed to have a seamless deploy process that enabled us to scale in and out without interfering with normal operations, and we needed Consul to automatically discover its peers (a service-discovery system that could discover itself).
- We had to get the sizing right, meaning the number of servers in the cluster. Consul’s official website recommends 3-5 servers in a cluster.
Essentially, we needed to build a highly available service that was easy to scale in and out and roll upgrades/config changes seamlessly.
Putting together the terms high-availability (HA) and scale it was obvious that AWS autoscale group (ASG) is the way to go. EC2 auto scaling allows you to create a group of servers and scales based on different criteria such as CPU or memory load. You can also use it to maintain a number of healthy instances within a cluster with predefined health checks.
In our case, we used ASG to maintain a cluster of three Consul servers. We set up a launch configuration with c3.large instance type and AMI that has a script to run Consul installation on launch. Once we had that going, we added ELB that will perform the health checks on the cluster.
Keeping the Cluster Healthy
A Consul server cluster is only considered healthy as long as you have a functioning leader. To accomplish this, we used the following built-in HTTP health check: “HTTP:8500/v1/status/leader”
Additionally, here are a few useful configurations that we used in our setup:
This one is a big plus. Adding servers or agents to Consul clusters programmatically is much easier with the retry_join_ec2. This setting will scan AWS API for instances with a pre-defined tag and automatically try to join them. It comes in handy when you often upgrade or change your server’s cluster.
This setting enables Consul servers to gracefully leave the cluster on a TERM signal. If a server leaves the cluster ungracefully, the cluster will treat that server as “failed” and the record will remain stale for a period of 72h (Can be reduced with “reconnect_timeout). This is a great setup for using spot instances or ASG though it’s not always able to gracefully leave before termination.
This setting came along in Consul 0.7, and it means that the cluster needs at least three servers in order to elect a leader. It helps to prevent split-brain scenario where two servers in the cluster consider themselves as the leader.
Also introduced in later versions of Consul, this allows you to modify the timeframe for when a node is forced to leave the cluster or after ungraceful leave. Allows you the reduce visibility of stale records in the cluster after a shorter period of time (compared to default value of 72 hours). When a node is marked as failed or if a node gracefully left the cluster, you will see it in your cluster for a period of 72 hours, which is the default configuration).
With this setting, we lowered that grace period to the minimum time possible, so if a node is marked as “left” or “failed” for eight hours, Consul will “reap” it out of the cluster.
Consul open-source provides a pretty solid UI that helps you view the services in the cluster. The built-in UI is nice but is lacking some basic capabilities, which is why we decided to try Hashi-UI (a community-backed project).
How We Roll Out New Versions
Since we work with rolling deployment, launching a new version in an ASG can be somewhat tricky since you must refresh the instances within the group to get the new version. The first step would be to choose an approach:
- Bake Consul into AWS AMI
- Pull Deployment
Once you make a decision, you will need to replace the instances in the ASG in order for them to catch the new version. To do this, we wrote a script that terminates instance and waits for the ASG to rebalance and become healthy again before terminating the next instances. This keeps the quorum consistent and the Consul cluster in a healthy state. You can start with this script.
Eyeview tech is never static, and we are always looking for ways to improve efficiency and cost. Having a state-of-the-art service-discovery system allows us to grow effortlessly within our current infrastructure and also expand into new regions (we’re currently expanding to the UK). It is also a great stepping stone for us to explore implementation of containers in the near future.