At Eyeview, like at any other startup, we look closely at our technology cost. As the VP of Architecture, it is my responsibility to forecast our technology cost for the upcoming year as well as track to ensure we meet this forecast, expressed as percentage of revenue. In this post, I will share valuable insights about cost analysis, different tips for cost efficiency across the AWS ecosystem and challenges faced along the way.
The Starting Point
If you want to get anywhere with cost analysis you need to start by tagging. Cost allocation tags are keys and values you can set to all of your AWS resources in order to enable slicing and dicing the cost later on. While some of those tags are automatic, the usage of User-Defined tags is the focal point here. We generally use a single tag that represents the resource logical “service”. In EC2 it will be the application name, in DynamoDB it is the table name etc. Eventually, we will group those tags in logical ways for different types of reports.
The Basics of Cost Analysis
AWS itself provides you with tools to analyze your costs. The one Eyeview uses most is the AWS Cost Explorer which recently got a fresh look. The cost explorer is a great ad-hoc tool which allows slicing and dicing the different applications, services and usage costs. Filtering by your cost allocation value and grouping by usage type is an insightful way of understanding your service cost.
Within the cost explorer you will find other helpful tools, including reports on Reserved Instances coverage and utilization. Reserved Instances are a great way to save costs on EC2 instances second only to Spot Instances. If utilizing on-demand EC2 Instances and forecasting that you will continue to use them in the next year, Reserved Instances is the way to go. The coverage report can tell you how much of your on-demand workload is covered by Reserved Instances hours, while the utilization report will tell you how much of your Reserved Instances hours are being utilized by on-demand instances. Combining the two reports, will provide insights about which instance types you should purchase.
AWS also provides in depth cost report directly to your s3 bucket, though we did not find that useful and determined other services can analyze it much better, which leads me to the next set of tools.
The last native AWS tool you probably want to take a closer look at is AWS Trusted Advisor, though you will need the business support plan to use the cost optimization section. A trusted advisor aids you in highlighting inefficiencies on EC2 utilization as well as RI recommendation.
Taking Cost Analysis Further
While the cost explorer is not bad, we reached a point where we needed more insights. The two things that we needed are (a) to have team-level reports and (b) to have a financial view of COGS vs OPEX.
Team-level insights are important to us as every engineering team is using a different part of our AWS infrastructure in different usage pattern. Our data team utilizes Kinesis Streams along with EC2 instances that run KCL to populate S3, while our bidding team utilizes the same streams to populate ad delivery data in DynamoDB. In order to understand how our architecture is evolving from a cost point-of-view, it is better to trace it back from a cost perspective and try to analyze which team’s architecture affected the cost bottom line. This tactic can help drive the next architectural decision.
An analysis of COGS vs OPEX is important from a financial point-of-view for two reasons. Firstly, it will allow you to understand what drives costs and whether the technology of the company is scalable from a cost aspect. Secondly, passing an audit often requires that all costs used directly in service of revenue (including those associated with AWS) are recorded as COGS. The balance of the costs should be shifted “below the line” into OPEX.
Those two items can be achieved in a few ways. One way is to create cost explorer reports that have the right filters and contain all of those tag values, but the cost explorer has some limitations and that are not ideal. Another method is to add another cost allocation tag for those two grouping definitions. That is OK but not very flexible, you cannot have a resource that is not fully attached to one group will only show changes going forward and not retroactively.
One tool that we looked at, and maybe so should you, is ice – an open source AWS cost analysis tool created by Netflix. We ended up thinking it is not intuitive nor easy to use enough.
We then tried Stax which proved very effective for us. Stax is a cloud management tool that gives us better visibility and insight into the cost and efficiency of our AWS workloads. You can define the cost allocation rules and have different views looking at it.
An added bonus are the daily and weekly reports providing a summary of how we are doing comparing to past weeks and months.
Tips for Cost Efficiency
If you have reached this milestone, you should at least get something out of it. Some other items worth looking at in order to optimize costs on core services are:
- An obvious but sometimes forgotten item is to track your running instances and catch those that are underutilized for some time or that are just laying there unused at all.
- Use auto scaling as much as you can. Auto Scale Groups are fairly easy to use and can help you provisioning only the right amount of instances.
- Use new instance generations that are more cost efficient than the old ones. AWS are usually pricing these to incentivize dropping of older infrastructure, so take advantage.
- Ask yourself if you really need on-demand instances. If so, make sure to watch your RI reports and purchase accordingly. If you do not, make use of Spot instances and mainly Spot Fleets. Spot Fleets is a neat feature allowing you to use many instance type pools in order to get the cheapest for the moment. The more “diverse” your fleet is, the better cost and availability you have.
- Data Transfer: Try to get to a “vertical” architecture in which you do not transfer data from one Availability Zone to another. That will also help you achieving high availability.
- S3: Take advantage of the different storage tiers S3 has to offer. For backup and archived data, use Glacier. For data that is not accessed as much, use Infrequent Access tier. If the data is unimportant, expire it. A new tool that helped us understand why our Infrequent Access data retrieval is high is the S3 Storage Class Analysis.
- DynamoDB: Remember that DynamoDB charges you for both IO operations AND storage. Make sure to use the TTL feature of DynamoDB which helps both. Learn about how DynamoDB partitions data and implement best practices around it, in order to prevent hot keys that can inflate your provisioned capacity. Don’t forget to add auto-scaling to that mix and finally purchase some reserved capacity at lower cost.
Unless keeping a close eye on AWS cost, they can grow exponentially displeasing your finance team (rightfully so). I encourage you to have AWS cost considerations an integral part of the engineering cycle as important as resilience, scalability and availability. In the case of AWS, there is usually a strong correlation between a cost efficient system and a well architected one, therefore analyzing your cost drivers more efficiently will help you understand the weaknesses in your architecture and help you make better decisions on how to improve it.