This October, I gave a talk at HashiConf 2018 where I shared 5 key lessons we learned at Gruntwork while creating and maintaining a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. In this blog post, I’ll share with you the video and slides from the talk, as well as a condensed, written version of the 5 key lessons.
slideshare:
Although the industry is full of cutting-edge buzzwords—Kubernetes, microservices, service meshes, immutable infrastructure, big data, data lakes, etc—the reality is that when you’re knee deep in building infrastructure, it doesn’t feel cutting edge.
To me, DevOps feels more like this:
Building production-grade infrastructure is hard. And stressful. And time consuming. Very time consuming.
Here’s roughly how long you should expect your next infrastructure project to take, based on empirical data we’ve gathered while working with hundreds of different companies:
DevOps projects always take way longer than you expect. Always. Why is that?
Well, the first reason is Yak Shaving, as perfectly illustrated in this clip from Malcolm in the Middle:
giphy:
The second reason is that building production-grade infrastructure (as in, the type of infrastructure you’d bet your company on) involves a thousand little details. The vast majority of developers don’t know what those details are, so when you’re estimating a project, you usually forget about number of critical—and time consuming—details.
To avoid this issue, every time you go to work on a new piece of infrastructure, go through the following checklist:
Not every single piece of infrastructure needs every single item on the list, but you should consciously and explicitly document which items you’ve implemented, which ones you’ve decided to skip, and why.
As of 2018, here are the primary tools we use at Gruntwork to build and manage infrastructure:
Now, all of these tools are useful, but that’s not the real lesson here. The real lesson is that tools, by themselves, are not enough. You also need to change your team’s behavior.
In particular, the best tools in the world will not help your team one bit if your team isn’t bought into using those tools or if your team doesn’t have enough time to learn use those tools. Therefore, the key takeaway is that using infrastructure as code is an investment: there’s an up-front cost to get going, but if you invest wisely, you’ll earn big dividends over the long-term.
Infrastructure as code newbies often define all of their infrastructure for all of their environments (dev, stage, prod, etc) in a single file or single set of files that are deployed as a unit. This is a Bad Idea.
Here are just a few of the downsides:
terraform plan
takes 5–6 minutes to run!terraform plan
becomes useless, as no one bothers to look through thousands of lines of plan output. Moreover, code reviews become useless:10 lines of code = 10 issues.500 lines of code = "looks fine."Code reviews.
In short, you should build your code out of small, standalone, reusable, composable modules. This is not a new or controversial insight. You’ve heard it a thousand times before, albeit in slightly different domains:
“Do one thing and do it well” —Unix Philosophy
“The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that.”—Clean Code
If your infrastructure code does not have automated tests, it’s broken. You just don’t know it yet. That said, testing infrastructure code is hard. You don’t really have “localhost” (e.g., you can’t deploy an AWS VPC on your laptop) and you don’t really have “unit tests” (e.g., you can’t isolate your Terraform code from the “outside” world as all Terraform does is talk to the outside world).
Therefore, to properly test your infrastructure code, you typically have to deploy it to a real environment, run real infrastructure, validate that it does what it should, and then tear it all down (for this style of testing, see Terratest, an open source library that includes tools for testing Terraform, Packer, and Docker code, working with AWS, GCP, and Kubernetes APIs, executing shell commands locally and on remote servers over SSH, and much more). What this means is that, with infrastructure testing, you have to slightly redefine terms:
Note that the diagram is a pyramid, where we have lots of unit tests, a smaller number of integration tests, and a very small number of e2e tests. Why? Because of how long each type of test takes:
Cycle time with infrastructure tests is slow, especially as you go up the pyramid, so you’ll want to catch as many bugs as you can as low in the pyramid as you can. That means you should:
Let’s now put everything in this talk together. Here’s how you’ll be building and managing infrastructure from now on:
Get your DevOps superpowers at Gruntwork.io.