If you read some of our previous posts, you should know by now that Eyeview is running on a highly elastic system that deploys thousands of servers per day.
The engine that controls this massive deploy-load has to be fast, resilient and scalable but most importantly – reliable.
Our systems are changing constantly and being discovered with our service discovery solution, we are highly dependent on our deployments to complete successfully, in a timely manner and be able to quickly discover a failed deploy and take action.
Pre-configuration management tools era at Eyeview, we built a Python program to manage our deployments in a pull methodology.
In this blog post, we will show why Ansible was chosen as our configuration management tool for application deployment, and how we reversed its native push architecture to better fit our extreme scale-out system. In addition, we will show how to deal with a failed deploy when using pull and how to ensure all tasks are truly idempotent.
Though our Python code did a great job and helped us achieve what we needed at the time, we had to constantly chase our tail. Each version upgrade or system-wide change would cost us many hours of engineering work to get the task done.
With a well known CM tool, it’s easier to quickly deliver system-wide changes.
So… Why Ansible?
True for Ansible but not exclusively. Using a well known CM tool is extremely helpful when transferring knowledge or on-boarding new engineers.
Written in Python and uses YAML for playbooks language – when used correctly can easily replace many systems documentation.
Can run multiple times without changing the result of the operation, as long as the current state matches the desired state.
As our team is always on the lookout for new features and technologies, it was crucially important for us to choose a configuration management tool with an awesome community to keep it updated. For example, AWS has recently released a new feature called Launch Templates and Ansible’s community has already developed a module that’s planned to be released in the next official version and can already be tested in devel release.
Pull vs Push?
Taking a look at the continuous deployment world, there are two options to carry out a deployment: pull and push.
Pull – Usually there is a master server waiting for a call from the “pulling” instance, the pulling instance gets the code, configuration and configures itself independently. This is the way that other configuration management tools work, like puppet or chef.
Push – The master server orchestrates the deployment and requires to remain connected with the instance until deploy is completed. With Ansible, the most convenient way to run “push” deployments is with Ansible Tower.
Both options are solid and can deliver great results (depending on the system requirements).
At Eyeview, we scale in and out pretty often, most of our systems are launched through orchestration tools (Auto Scale Groups, Spot Fleets, etc…) So we knew that we must use a tool that can “Pull” its deploy during launch time but also not being dependent on a master server.
Going down the “Push” path would expose us to the risk of being dependent on the “pushing” server/s and leave us with a single point of failure which is NOT acceptable nowadays.
Adding to that, the “Pull” methodology is much more scalable (clients are independent) thus more suited to our system.
Even though Ansible is “Push” based (Agentless) we have found that we have the option of using `ansible-pull` that can run locally on each instance and pull the latest Ansible code from a remote repo during boot time.
All the instance needs to have is Ansible installed and connection to the remote repo – it’s that simple!
So not only are we able to implement a pull-based deploy, but we are also not dependent on any master servers, that’s a double win for us! And the frosting on the cake is that we can use Ansible to push configuration changes ad hoc or deploy databases and other stateful applications.
The Golden (“common”) Role
Assuming you are familiar with Ansible roles structure, most of the setups have the “Common” role, usually a role that will run before any other role or one that will periodically run across the board to verify all of the servers have the needed packages and configurations.
Here we took a slightly different approach with the common role, not only that it contains all of our common settings and dependencies, it’s also in charge of running the local deploy process.
Each EC2 instance is launched with user-data that contains a bunch of variables in YAML format. During launch, the common role is firing up and loading those variables from user-data with those vars the role knows which app and build should be installed on the instances.
If any failure occurs, we send a notify message to a dedicated Slack channel to help us keep track of which tasks are failing or when something breaks so we can react immediately.
We have also implemented a deploy retry workflow – more on that later on.
Idempotency: Batteries not (always) Included
Although Ansible modules are built with Idempotency in mind, it’s not achievable in all cases.
For certain tasks or workflows, additional work must be done to ensure we don’t override previous running tasks or run a task when it’s not needed.
For example, we can talk about AWS Logs agent installation, since it’s not available as an apt package, the installation needs to be done with a set of commands, so we baked the agent in our AMI but we still want to ensure it’s installed when we run our Ansible common role, so we add another task before to check if the package is installed or not, then register the output, and conditionally run a block of tasks for the installation.
This is just one of many use cases, where we have a “verify” that registers a conditional variable for a block of tasks.
Using this method you can optionally shave some seconds of your playbook runs or avoid running tasks when not needed.
OoPs… It failed…
Things don’t always go as expected, each deploy or run is dependent on other components such as remote repositories, API’s etc… most of the times a deploy could fail on a one-time error that optionally will succeed on a second try.
For example, after noticing we are getting API failures from AWS (Limit throttles) on a daily basis, we decided to use Ansible block and rescue to give our deploy another shot – we have found it to go all the way through 99% of the times! With this addition and specific retries per tasks, with high failure risk, we have lowered our Ansible deploy failures to a non-existent level. If a deploy would still fail after the retries, we post a message to Slack with a log file attached and have the instances terminated.
This is how we “reversed” Ansible native behavior to cope with our extremely scaled out environment.
Over time we have added and modified the common role to a point where we have a 99.9% successful deploys on EC2 instances. And those who failed… will be terminated immediately.
Looking forward, we plan to automate more infrastructure configurations with Ansible (VPC, IAM Roles and more…).