Update, November 17, 2016: We took this blog post series, expanded it, and turned it into a book called Terraform: Up & Running!
Update, July 8, 2019: We’ve updated this blog post series for Terraform 0.12 and released the 2nd edition of Terraform: Up & Running!
Update, Sep 28, 2022: We’ve updated this blog post series for Terraform 1.2 and released the 3rd edition of Terraform: Up & Running!
This is Part 6 of the Comprehensive Guide to Terraform series. In previous parts, you learned why we picked Terraform, the basic syntax and features of Terraform, how to manage Terraform state, how to create reusable infrastructure with Terraform modules, and Terraform tips & tricks such as loops and if-statements. In this final part of the series, you’ll learn best practices for using Terraform as a team.
As you’ve been reading this blog post series and working through the code samples, you’ve most likely been working by yourself. In the real world, you’ll most likely be working as part of a team, which introduces a number of new challenges. You may need to find a way to convince your team to use Terraform and other infrastructure-as-code (IaC) tools. You may need to deal with multiple people concurrently trying to understand, use, and modify the Terraform code you write. And you may need to figure out how to fit Terraform into the rest of your tech stack and make it a part of your company’s workflow.
In this blog post, I’ll dive into the key processes you need to put in place to make Terraform and IAC work for your team:
Let’s go through these topics one at a time.
You can find working sample code for the examples in this blog post in the Terraform: Up & Running code samples repo. This blog post corresponds to Chapter 10 of Terraform Up & Running, “How to Use Terraform as a Team,” so look for the code samples in the 10-terraform-team
folders.
In this section, I’ll introduce a typical workflow for taking application code (e.g., a Ruby on Rails or Java/Spring app) from development all the way to production. This workflow is reasonably well understood in the DevOps industry, so you’ll probably be familiar with parts of it. Later in this blog post, I’ll talk about a workflow for taking infrastructure code (e.g., Terraform modules) from development to production. This workflow is not nearly as well known in the industry, so it will be helpful to compare that workflow side by side with the application workflow to understand how to translate each application code step to an analogous infrastructure code step.
Here’s what the application code workflow looks like:
Let’s go through these steps one at a time.
All of your code should be in version control. No exceptions. It was the #1 item on the classic Joel Test when Joel Spolsky created it nearly 20 years ago, and the only things that have changed since then are that (a) with tools like GitHub, it’s easier than ever to use version control and (b) you can represent more and more things as code. This includes documentation (e.g., a README written in Markdown), application configuration (e.g., a config file written in YAML), specifications (e.g., test code written with RSpec), tests (e.g., automated tests written with JUnit), databases (e.g., schema migrations written in ActiveRecord), and of course, infrastructure.
As in the rest of this blog post series, I’m going to assume that you’re using Git for version control. For example, here is how you can check out the sample code repo for this blog post series and the book Terraform: Up & Running:
git clone https://github.com/brikis98/terraform-up-and-running-code.git
By default, this checks out the main
branch of your repo, but you’ll most likely do all of your work in a separate branch. Here’s how you can create a branch called example-feature
and switch to it by using the git checkout
command:
$ cd terraform-up-and-running-code
$ git checkout -b example-feature
Switched to a new branch 'example-feature'
Now that the code is on your computer, you can run it locally. For example, if you had a simple web server written in a general purpose programming language such as Ruby, you might run it as follows:
$ ruby web-server.rb
[2019-06-15 15:43:17] INFO WEBrick 1.3.1
[2019-06-15 15:43:17] INFO ruby 2.3.7 (2018-03-28)
[2019-06-15 15:43:17] INFO WEBrick::HTTPServer#start: port=8000
Now you can manually test it with curl
:
$ curl http://localhost:8000
Hello, World
Alternatively, you might have some automated tests you can run for the web server:
$ ruby web-server-test.rb
(...)
Finished in 0.633175 seconds.
--------------------------------------------
8 tests, 24 assertions, 0 failures, 0 errors
100% passed
--------------------------------------------
The key thing to notice is that both manual and automated tests for application code can run completely locally on your own computer. You’ll see later in this blog post that this is not true for the same part of the workflow for infrastructure changes.
Now that you can run the application code, you can start making changes. This is an iterative process where you make a change, re-run you manual or automated tests to see if the change worked, make another change, re-run the tests, and so on.
For example, you can change the output of your Ruby web server to “Hello, World v2”, restart the server, and see the result:
$ curl http://localhost:8000
Hello, World v2
You might also update and rerun the automated tests. The idea in this part of the workflow is to optimize the feedback loop so that the time between making a change and seeing whether it worked is minimized.
As you work, you should regularly be committing your code, with clear commit messages explaining the changes you’ve made:
$ git commit -m "Updated Hello, World text"
Eventually, the code and tests will work the way you want them to, so it’s time to submit your changes for a code review. You can do this with a separate code review tool (e.g., Phabricator or ReviewBoard) or, if you’re using GitHub, you can create a pull request. There are several different ways to create a pull request. One of the easiest is to git push your example-feature
branch back to origin (that is, back to GitHub itself), and GitHub will automatically print out a pull request URL in the log output:
$ git push origin example-feature
(...)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
remote:
remote: Create a pull request for 'example-feature' on GitHub:
remote: https://github.com///pull/new/example-feature
remote:
Open that URL in your browser, fill out the pull request title and description, and click create. Your team members will now be able to review your code changes:
You should set up commit hooks to run automated tests for every commit you push to your version control system. The most common way to do this is to use a continuous integration (CI) server, such as Jenkins, CircleCI, or GitHub Actions. Most popular CI servers have integrations built in specifically for GitHub, so not only does every commit automatically run tests, but the output of those tests shows up in the pull request itself:
You can see in the image above that CircleCI has run unit tests, integration tests, end- to-end tests, and some static analysis checks (in the form of security vulnerability scanning using a tool called snyk
) against the code in the branch, and everything passed.
Your team members should review your code changes, looking for potential bugs, enforcing coding guidelines (more on this later), checking that the existing tests passed, and ensuring that you’ve added tests for any new behavior. If everything looks good, your code can be merged into the main
branch.
The next step is to release the code. If you’re using immutable infrastructure practices, releasing application code means packaging that code into a new, immutable, versioned artifact. Depending on how you want to package and deploy your application, the artifact can be a new Docker image, a new virtual machine image (e.g., new AMI), a new .jar file, a new .tar file, etc. Whatever format you pick, make sure the artifact is immutable (i.e., you never change it) and that it has a unique version number (so you can distinguish this artifact from all of the others).
For example, if you are packaging your application using Docker, you can store the version number in a Docker tag. You could use the ID of the commit (the sha1 hash) as the tag so that you can map the Docker image you’re deploying back to the exact code it contains:
$ commit_id=$(git rev-parse HEAD)
$ docker build -t brikis98/ruby-web-server:$commit_id .
The preceding code will build a new Docker image called brikis98/ruby-web-server
and tag it with the ID of the most recent commit, which will look something like 92e3c6380ba6d1e8c9134452ab6e26154e6ad849
. Later on, if you’re debugging an issue in a Docker image, you can see the exact code it contains by checking out the commit ID the Docker image has as a tag:
$ git checkout 92e3c6380ba6d1e8c9134452ab6e26154e6ad849
HEAD is now at 92e3c63 Updated Hello, World text
One downside to commit IDs is that they aren’t very readable or memorable. An alternative is to create a Git tag:
$ git tag -a "v0.0.4" -m "Update Hello, World text"
$ git push --follow-tags
A tag is a pointer to a specific Git commit but with a friendlier name. You can use this Git tag on your Docker images:
$ git_tag=$(git describe --tags)
$ docker build -t brikis98/ruby-web-server:$git_tag .
Thus, when you’re debugging, check out the code at a specific tag:
$ git checkout v0.0.4
Note: checking out 'v0.0.4'.
(...)
HEAD is now at 92e3c63 Updated Hello, World text
Now that you have a versioned artifact, it’s time to deploy it. There are many different ways to deploy application code, depending on the type of application, how you package it, how you wish to run it, your architecture, what tools you’re using, and so on. Here are a few of the key considerations:
Deployment tooling
There are many different tools that you can use to deploy your application, depending on how you package it and how you want to run it. Here are a few examples:
ami
parameter in your Terraform code and running terraform apply
.Deployment strategies
There are a number of different strategies that you can use for application deployment, depending on your requirements. Suppose that you have five copies of the old version of your app running, and you want to roll out a new version. Here are a few of the most common strategies you can use:
Deployment server
You should run the deployment from a CI server and not from a developer’s computer. This has the following benefits:
Promote artifacts across environments
If you’re using immutable infrastructure practices, the way to roll out new changes is to promote the exact same versioned artifact from one environment to another. For example, if you have dev, staging, and production environments, to roll out v0.0.4
of your app, you would do the following:
v0.0.4
of the app to dev
.dev
.v0.0.4
works well in dev
, repeat steps 1 and 2 to deploy v0.0.4
to stage
(this is known as promoting the artifact).v0.0.4
works well in stage
, repeat steps 1 and 2 again to promote v0.0.4
to prod
.Because you’re running the exact same artifact everywhere, there’s a good chance that if it works in one environment, it will work in another. And if you do hit any issues, you can roll back anytime by deploying an older artifact version.
Now that you’ve seen the workflow for deploying application code, it’s time to dive into the workflow for deploying infrastructure code. In this section, when I say “infrastructure code,” I mean code written with any IaC tool (including, of course, Terraform) that you can use to deploy arbitrary infrastructure changes beyond a single application: for example, deploying databases, load balancers, network configurations, DNS settings, and so on.
Here’s what the infrastructure code workflow looks like:
On the surface, it looks identical to the application workflow, but under the hood, there are important differences. Deploying infrastructure code changes is more complicated, and the techniques are not as well understood, so being able to relate each step back to the analogous step from the application code workflow should make it easier to follow along. Let’s dive in.
Just as with your application code, all of your infrastructure code should be in version control. This means that you’ll use git clone to check out your code, just as before. However, version control for infrastructure code has a few extra requirements:
Live repo and modules repo
As discussed in How to create reusable infrastructure with Terraform modules, you will typically want at least two separate version control repositories for your Terraform code: one repo for modules and one repo for live infrastructure. The repository for modules is where you create your reusable, versioned modules, such as all the modules you built in the previous parts of this blog post series (the ASG module, the RDS module, etc). The repository for live infrastructure defines the live infrastructure you’ve deployed in each environment (dev
, stage
, prod
, etc.).
One pattern that works well is to have one infrastructure team in your company that specializes in creating reusable, robust, production-grade modules. This team can create remarkable leverage for your company by building a library of production-grade infrastructure modules; that is, each module has a composable API, is thoroughly documented (including executable documentation in the examples folder), has a comprehensive suite of automated tests, is versioned, and implements all of your company’s requirements from the production-grade infrastructure checklist (i.e., security, compliance, scalability, high availability, monitoring, and so on).
If you build such a library (or you buy one off the shelf), all the other teams at your company will be able to consume these modules, a bit like a service catalog, to deploy and manage their own infrastructure, without (a) each team having to spend months assembling that infrastructure from scratch or (b) the Ops team becoming a bottleneck because it must deploy and manage the infrastructure for every team. Instead, the Ops team can spend most of its time writing infrastructure code, and all of the other teams will be able to work independently, using these modules to get themselves up and running. And because every team is using the same canonical modules under the hood, as the company grows and requirements change, the Ops team can push out new versions of the modules to all teams, ensuring everything stays consistent and maintainable.
Or it will be maintainable, as long as you follow the Golden Rule of Terraform.
The Golden Rule of Terraform
Here’s a quick way to check the health of your Terraform code: go into your live repository, pick several folders at random, and run terraform plan
in each one. If the output is always “no changes,” that’s great, because it means that your infrastructure code matches what’s actually deployed. If the output sometimes shows a small diff, and you hear the occasional excuse from your team members (“Oh, right, I tweaked that one thing by hand and forgot to update the code”), your code doesn’t match reality, and you might soon be in trouble. If terraform plan
fails completely with weird errors, or every plan shows a gigantic diff, your Terraform code has no relation at all to reality and is likely useless.
The gold standard, or what you’re really aiming for, is what I call The Golden Rule of Terraform:
The main branch of the live repository should be a 1:1 representation of what’s actually deployed in production.
Let’s break this sentence down, starting at the end and working our way back:
main
. This means that all changes that affect the production environment should go directly into main
(you can create a separate branch but only to create a pull request with the intention of merging that branch into main
), and you should run terraform apply
only for the production environment against the main
branch. In the next section, I’ll explain why.The trouble with branches
In How to manage Terraform state, you saw that you can use the locking mechanisms built into Terraform backends to ensure that if two team members are running terraform apply
at the same time on the same set of Terraform configurations, their changes do not overwrite each other. Unfortunately, this only solves part of the problem. Even though Terraform backends provide locking for Terraform state, they cannot help you with locking at the level of the Terraform code itself. In particular, if two team members are deploying the same code to the same environment but from different branches, you’ll run into conflicts that locking can’t prevent.
For example, suppose that one of your team members, Anna, makes some changes to the Terraform configurations for an app called “foo” that consists of a single EC2 Instance:
resource "aws_instance" "foo" {
ami = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
}
The app is getting a lot of traffic, so Anna decides to change the instance_type
from t2.micro
to t2.medium
:
resource "aws_instance" "foo" {
ami = data.aws_ami.ubuntu.id
instance_type = "t2.medium"
}
Here’s what Anna sees when she runs terraform plan
:
$ terraform plan
(...)
Terraform will perform the following actions:
# aws_instance.foo will be updated in-place
~ resource "aws_instance" "foo" {
ami = "ami-0fb653ca2d3203ac1"
id = "i-096430d595c80cb53"
instance_state = "running"
~ instance_type = "t2.micro" -> "t2.medium"
(...)
}
Plan: 0 to add, 1 to change, 0 to destroy.
Those changes look good, so she deploys them to staging.
In the meantime, Bill comes along and also starts making changes to the Terraform configurations for the same app but on a different branch. All Bill wants to do is to add a tag to the app:
resource "aws_instance" "foo" {
ami = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
tags = {
Name = "foo"
}
}
Note that Anna’s changes are already deployed in staging, but because they are on a different branch, Bill’s code still has the instance_type
set to the old value of t2.micro
. Here’s what Bill sees when he runs the plan
command (the following log output is truncated for readability):
$ terraform plan
(...)
Terraform will perform the following actions:
# aws_instance.foo will be updated in-place
~ resource "aws_instance" "foo" {
ami = "ami-0fb653ca2d3203ac1"
id = "i-096430d595c80cb53"
instance_state = "running"
~ instance_type = "t2.medium" -> "t2.micro"
+ tags = {
+ "Name" = "foo"
}
(...)
}
Plan: 0 to add, 1 to change, 0 to destroy.
Uh oh, he’s about to undo Anna’s instance_type
change! If Anna is still testing in staging, she’ll be very confused when the server suddenly redeploys and starts behaving differently. The good news is that if Bill diligently reads the plan
output, he can spot the error before it affects Anna. Nevertheless, the point of the example is to highlight what happens when you deploy changes to a shared environment from different branches.
The locking from Terraform backends doesn’t help here, because the conflict has nothing to do with concurrent modifications to the state file; Bill and Anna might be applying their changes weeks apart, and the problem would be the same. The underlying cause is that branching and Terraform are a bad combination. Terraform is implicitly a mapping from Terraform code to infrastructure deployed in the real world. Because there’s only one real world, it doesn’t make much sense to have multiple branches of your Terraform code. So for any shared environment (e.g., stage, prod), always deploy from a single branch.
Now that you’ve got the code checked out onto your computer, the next step is to run it. The gotcha with Terraform is that, unlike application code, you don’t have “localhost”; for example, you can’t deploy an AWS ASG onto your own laptop. The only way to manually test Terraform code is to run it in a sandbox environment, such as an AWS account dedicated for developers (or better yet, one AWS account for each developer).
Once you have a sandbox environment, to test manually, you run terraform apply
:
$ terraform apply
(...)
Apply complete! Resources: 5 added, 0 changed, 0 destroyed.
Outputs:
alb_dns_name = "hello-world-stage-xxx.us-east-2.elb.amazonaws.com"
And you verify the deployed infrastructure works by using tools such as curl
:
$ curl hello-world-stage-xxx.us-east-2.elb.amazonaws.com
Hello, World
You can write automated tests for your Terraform code using Terratest and Go, executing those tests by runninggo test
in a sandbox account dedicated to testing:
$ go test -v -timeout 30m
(...)
PASS
ok terraform-up-and-running 229.492s
Now that you can run your Terraform code, you can iteratively begin to make changes, just as with application code. Every time you make a change, you can rerun terraform apply
to deploy those changes and rerun curl
to see whether those changes worked:
$ curl hello-world-stage-xxx.us-east-2.elb.amazonaws.com
Hello, World v2
Or you can rerun go test
to make sure the tests are still passing:
$ go test -v -timeout 30m
(...)
PASS
ok terraform-up-and-running 229.492s
The only difference from application code is that infrastructure code tests typically take longer, so you’ll want to put more thought into how you can shorten the test cycle so that you can get feedback on your changes as quickly as possible. Check out Terratest’s test stages to see how to only re-run specific parts of your tests to dramatically shorten the feedback loop.
As you make changes, be sure to regularly commit your work:
$ git commit -m "Updated Hello, World text"
After your code is working the way you expect, you can create a pull request to get your code reviewed, just as you would with application code. Your team will review your code changes, looking for bugs as well as enforcing coding guidelines. Whenever you’re writing code as a team, regardless of what type of code you’re writing, you should define guidelines for everyone to follow. One of my favorite definitions of “clean code” comes from an interview I did with Nick Dellamaggiore for my earlier book Hello, Startup:
If I look at a single file and it’s written by 10 different engineers, it should be almost indistinguishable which part was written by which person. To me, that is clean code. The way you do that is through code reviews and publishing your style guide, your patterns, and your language idioms. Once you learn them, everybody is way more productive because you all know how to write code the same way. At that point, it’s more about what you’re writing and not how you write it. — Nick Dellamaggiore, Infrastructure Lead at Coursera
The Terraform coding guidelines that make sense for each team will be different, so here, I’ll list a few of the common ones that are useful for most teams:
Documentation
In some sense, Terraform code is, in and of itself, a form of documentation. It describes in a simple language exactly what infrastructure you deployed and how that infrastructure is configured. However, there is no such thing as self-documenting code. Although well-written code can tell you what it does, no programming language that I’m aware of (including Terraform) can tell you why it does it.
This is why all software, including IaC, needs documentation beyond the code itself. There are several types of documentation that you can consider and have your team members require as part of code reviews:
#
) as a comment. Don’t use comments to explain what the code does; the code should do that itself. Only include comments to offer information that can’t be expressed in code, such as how the code is meant to be used or why the code uses a particular design choice. Terraform also allows every input and output variable to declare a description
parameter, which is a great place to describe how that variable should be used.Automated Tests
Chapter 9 of Terraform: Up & Running is entirely focused on testing Terraform code (see also the talk Automated Testing for Terraform, Docker, Packer, Kubernetes, and More), so I won’t repeat any of that here, other than to say that infrastructure code without tests is broken. Therefore, one of the most important comments you can make in any code review is “How did you test this?”
File Layout
Your team should define conventions for where Terraform code is stored and the file layout you use. Because the file layout for Terraform also determines the way Terraform state is stored, you should be especially mindful of how file layout affects your ability to provide isolation guarantees, such as ensuring that changes in a staging environment cannot accidentally cause problems in production. In a code review, you might want to enforce the file layout described in How to manage Terraform state, which provides isolation between different environments (e.g., stage and prod) and different components (e.g., a network topology for the entire environment and a single app within that environment).
Style Guide
Every team should enforce a set of conventions about code style, including the use of whitespace, newlines, indentation, curly braces, variable naming, and so on. Although programmers love to debate spaces versus tabs and where the curly brace should go, the more important thing is that you are consistent throughout your codebase.
Terraform has a built-in fmt
command that can reformat code to a consistent style automatically:
$ terraform fmt
I recommend running this command as part of a commit hook to ensure that all code committed to version control uses a consistent style.
Just as with application code, your infrastructure code should have commit hooks that kick off automated tests in a CI server after every commit and show the results of those tests in the pull request. Chapter 9 of Terraform: Up & Running and the talk Automated Testing for Terraform, Docker, Packer, Kubernetes, and More cover how to write automated tests for infrastructure code. There’s one other critical type of test you should run: terraform plan
. The rule here is simple:
Always run plan before apply
.
Terraform shows the plan
output automatically when you run apply
, so what this rule really means is that you should always pause and read the plan
output! You’d be amazed at the type of errors you can catch by taking 30 seconds to scan the “diff ” you get as an output. A great way to encourage this behavior is by integrating plan
into your code review flow. For example, Atlantis is an open source tool that automatically runs terraform plan
on commits and adds the plan
output to pull requests as a comment:
Terraform Cloud and Terraform Enterprise, HashiCorp’s paid tools, both support running plan
automatically on pull requests as well.
After your team members have had a chance to review the code changes and plan
output and all the tests have passed, you can merge your changes into the main
branch and release the code. Similar to application code, you can use Git tags to create a versioned release:
$ git tag -a "v0.0.6" -m "Updated hello-world-example text"
$ git push --follow-tags
Whereas with application code, you often have a separate artifact to deploy, such as a Docker image or VM image, since Terraform natively supports downloading code from Git, the repository at a specific tag is the immutable, versioned artifact you will be deploying.
Now that you have an immutable, versioned artifact, it’s time to deploy it. Here are a few of the key considerations for deploying Terraform code:
Deployment tooling
When deploying Terraform code, Terraform itself is the main tool that you use. However, there are a few other tools that you might find useful:
plan
output to your pull requests but also allows you to trigger a terraform apply
when you add a special comment to your pull request. Although this provides a convenient web interface for Terraform deployments, be aware that it doesn’t support versioning, which can make maintenance and debugging for larger projects more difficult.terraform plan
and terraform apply
as well as manage variables, secrets, and access permissions.Deployment strategies
For most types of infrastructure changes, Terraform doesn’t offer any built-in deployment strategies: for example, there’s no way to do a blue-green deployment for a VPC change, and there’s no way to feature toggle a database change. You’re essentially limited to terraform apply
, which either works or it doesn’t. A small subset of changes do support deployment strategies, such as the zero-downtime rolling deployment with ASGs using Instance Refresh, but these are the exceptions and not the norm.
Due to these limitations, it’s critical to take into account what happens when a deployment goes wrong. With an application deployment, many types of errors are caught by the deployment strategy; for example, if the app fails to pass health checks, the load balancer will never send it live traffic, so users won’t be affected. Moreover, the rolling deployment or blue-green deployment strategy can automatically roll back to the previous version of the app in case of errors.
Terraform, on the other hand, does not roll back automatically in case of errors. In part, that’s because there is no reasonable way to roll back many types of infrastructure changes: for example, if an app deployment failed, it’s almost always safe to roll back to an older version of the app, but if the Terraform change you were deploying failed, and that change was to delete a database or terminate a server, you can’t easily roll that back!
Therefore, you should expect errors to happen and ensure you have a first-class way to deal with them:
terraform apply
. The deployment tooling you use with Terraform should detect these known errors and automatically retry after a brief pause. Terragrunt has automatic retries on known errors as a built-in feature.terraform apply
. For example, if you lose internet connectivity partway through an apply
, not only will the apply
fail, but Terraform won’t be able to write the updated state file to your remote backend (e.g., to Amazon S3). In these cases, Terraform will save the state file on disk in a file called errored.tfstate. Make sure that your CI server does not delete these files (e.g., as part of cleaning up the workspace after a build)! If you can still access this file after a failed deployment, as soon as internet connectivity is restored, you can push this file to your remote backend (e.g., to S3) using the state push
command so that the state information isn’t lost: terraform state push errored.tfstate
.terraform apply
, the state will remain permanently locked. Anyone else who tries to run apply
on the same module will get an error message saying the state is locked and showing the ID of the lock. If you’re absolutely sure this is an accidentally leftover lock, you can forcibly release it using the force-unlock
command, passing it the ID of the lock from that error message: terraform force-unlock <LOCK_ID>
.Deployment server
Just as with your application code, all of your infrastructure code changes should be applied from a CI server and not from a developer’s computer. You can run terraform from Jenkins, CircleCI, GitHub Actions, Terraform Cloud, Terraform Enterprise, Atlantis, or any other reasonably secure automated platform. This gives you the same benefits as with application code: it forces you to fully automate your deployment process, it ensures deployment always happens from a consistent environment, and it gives you better control over who has permissions to access production environments.
That said, permissions to deploy infrastructure code are quite a bit trickier than for application code. With application code, you can usually give your CI server a minimal, fixed set of permissions to deploy your apps; for example, to deploy to an ASG, the CI server typically needs only a few specific ec2
and autoscaling
permissions. However, to be able to deploy arbitrary infrastructure code changes (e.g., your Terraform code might try to deploy a database or a VPC or an entirely new AWS account), the CI server needs arbitrary permissions — that is, admin permissions. And that’s a problem.
The reason it’s a problem is that CI servers are (a) notoriously hard to secure, (b) accessible to all the developers at your company, and (c) used to execute arbitrary code. Adding permanent admin permissions to this mix is just asking for trouble! You’d effectively be giving every single person on your team admin permissions and turning your CI server into a very high-value target for attackers.
There are a few things you can do to minimize this risk:
plan
output, as one final check that things look OK before apply
runs. This ensures that every deployment, code change, and plan output has had at least two sets of eyes on it.terraform plan
and terraform apply
), in specific repos (e.g., your live repo), in specific branches (e.g., the main
branch), and so on. This way, even if an attacker gets access to your CI server, they still won’t have access to the admin credentials, and all they can do is request a deployment on some code that’s already in your version control system, which isn’t nearly as much of a catastrophe as leaking the admin credentials fully. Check out Gruntwork Pipelines for a real-world example of this worker pattern.Promote artifacts across environments
Just as with application artifacts, you’ll want to promote your immutable, versioned infrastructure artifacts from environment to environment: for example, promote v0.0.6
from dev to stage to prod. The rule here is also simple:
Always test Terraform changes in pre-prod before prod.
Because everything is automated with Terraform anyway, it doesn’t cost you much extra effort to try a change in staging before production, but it will catch a huge number of errors. Testing in pre-prod is especially important because, as mentioned earlier in this post, Terraform does not roll back changes in case of errors. If you run terraform apply
and something goes wrong, you must fix it yourself. This is easier and less stressful to do if you catch the error in a pre-prod environment rather than prod.
The process for promoting Terraform code across environments is similar to the process of promoting application artifacts, except there is an extra approval step, as mentioned in the previous section, where you run terraform plan
and have someone manually review the output and approve the deployment. This step isn’t usually necessary for application deployments, as most application deployments are similar and relatively low risk. However, every infrastructure deployment can be completely different, and mistakes can be very costly (e.g., deleting a database), so having one last chance to look at the plan
output and review it is well worth the time.
Here’s what the process looks like for promoting, for instance, v0.0.6
of a Terraform module across the dev, stage, and prod environments:
v0.0.6
, and run terraform plan
.plan
; for example, send an automated message via Slack.plan
is approved, deploy v0.0.6
to dev by running terraform apply
.v0.0.6
works well in dev, repeat steps 1–4 to promote v0.0.6
to staging.v0.0.6
works well in staging, repeat steps 1–4 again to promote v0.0.6
to production.One important issue to deal with is all the code duplication between environments in the live repo. For example, consider this live repo:
This live repo has a large number of regions, and within each region, a large number of modules, most of which are copied and pasted. Sure, each module has a main.tf that references a module in your modules repo, so it’s not as much copying and pasting as it could be, but even if all you’re doing is instantiating a single module, there is still a large amount of boilerplate that needs to be duplicated between each environment:
provider
configurationbackend
configurationThis can add up to dozens or hundreds of lines of mostly identical code in each module, copied and pasted into each environment. To make this code more DRY, and to make it easier to promote Terraform code across environments, you can use the open source tool I’ve mentioned earlier called Terragrunt. Terragrunt is a thin wrapper for Terraform, which means that you run all of the standard terraform commands, except you use terragrunt
as the binary:
$ terragrunt plan
$ terragrunt apply
$ terragrunt output
Terragrunt will run Terraform with the command you specify, but based on configuration you specify in a terragrunt.hcl file, you can get some extra behavior. In particular, Terragrunt allows you to define all of your Terraform code exactly once in the modules repo, whereas in the live repo, you will have solely terragrunt.hcl files that provide a DRY way to configure and deploy each module in each environment. This will result in a live repo with far fewer files and lines of code:
To get started, install Terragrunt by following the instructions on the Terragrunt website. Next, add a provider configuration to modules/data-stores/mysql/main.tf and modules/services/hello-world-app/main.tf:
provider "aws" {
region = "us-east-2"
}
Commit these changes and release a new version of your modules repo:
$ git add modules/data-stores/mysql/main.tf
$ git add modules/services/hello-world-app/main.tf
$ git commit -m "Update mysql and hello-world-app for Terragrunt"
$ git tag -a "v0.0.7" -m "Update Hello, World text"
$ git push --follow-tags
Now, head over to the live repo, and delete all the .tf files. You’re going to replace all that copied and pasted Terraform code with a single terragrunt.hcl file for each module. For example, here’s terragrunt.hcl for live/stage/data-stores/mysql/terragrunt.hcl:
terraform {
source = "github.com//modules//data-stores/mysql?ref=v0.0.7"
}
inputs = {
db_name = "example_stage"
# Set the username using the TF_VAR_db_username env var
# Set the password using the TF_VAR_db_password env var
}
As you can see, terragrunt.hcl files use the same HashiCorp Configuration Language (HCL) syntax as Terraform itself. When you run terragrunt apply
and it finds the source
parameter in a terragrunt.hcl file, Terragrunt will do the following:
source
to a temporary folder. This supports the same URL syntax as the source
parameter of Terraform modules, so you can use local file paths, Git URLs, versioned Git URLs (with a ref
parameter, as in the preceding example), and so on.terraform apply
in the temporary folder, passing it the input variables that you’ve specified in the inputs = { … }
block.The benefit of this approach is that the code in the live repo is reduced to just a single terragrunt.hcl file per module, which contains only a pointer to the module to use (at a specific version), plus the input variables to set for that specific environment. That’s about as DRY as you can get.
Terragrunt also helps you keep your backend
configuration DRY. Instead of having to define the bucket
, key
, dynamodb_table
, and so on in every single module, you can define it in a single terragrunt.hcl file per environment. For example, create the following in live/stage/terragrunt.hcl:
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite"
}
config = {
bucket = ""
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-2"
encrypt = true
dynamodb_table = ""
}
}
From this one remote_state
block, Terragrunt can generate the backend
configuration dynamically for each of your modules, writing the configuration in config to the file specified via the generate
param. Note that the key
value in config uses a Terragrunt built-in function called path_relative_to_include()
, which will return the relative path between this root terragrunt.hcl file and any child module that includes it. For example, to include this root file in live/stage/data-stores/mysql/terragrunt.hcl, add an include
block:
terraform {
source = "github.com//modules//data-stores/mysql?ref=v0.0.7"
}
include {
path = find_in_parent_folders()
}
inputs = {
db_name = "example_stage"
# Set the username using the TF_VAR_db_username env var
# Set the password using the TF_VAR_db_password env var
}
The include
block finds the root terragrunt.hcl using the Terragrunt built-in function find_in_parent_folders()
, automatically inheriting all the settings from that parent file, including the remote_state
configuration. The result is that this mysql
module will use all the same backend
settings as the root file, and the key
value will automatically resolve to data-stores/mysql/terraform.tfstate. This means that your Terraform state will be stored in the same folder structure as your live repo, which will make it easy to know which module produced which state files.
To deploy this module, run terragrunt apply
:
$ terragrunt apply --terragrunt-log-level debug
DEBU[0001] Reading Terragrunt config file at terragrunt.hcl
DEBU[0001] Included config live/stage/terragrunt.hcl
DEBU[0001] Downloading configurations into .terragrunt-cache
DEBU[0001] Generated file backend.tf
DEBU[0013] Running command: terraform init
(...)
Initializing the backend...
Successfully configured the backend "s3"! Terraform will automatically use this backend unless the backend configuration changes.
(...)
DEBU[0024] Running command: terraform apply
(...)
Terraform will perform the following actions:
(...)
Plan: 5 to add, 0 to change, 0 to destroy.
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
(...)
Apply complete! Resources: 5 added, 0 changed, 0 destroyed.
Normally, Terragrunt only shows the log output from Terraform itself, but as I included--terragrunt-log-level debug
, the preceding output shows what Terragrunt does under the hood:
apply
.source
URL into the .terragrunt- cache scratch folder.backend
configuration.init
has not been run and run it automatically (Terragrunt will even create your S3 bucket and DynamoDB table automatically if they don’t already exist).apply
to deploy changes.Not bad for a couple of tiny terragrunt.hcl files!
You can now deploy the hello-world-app
module in staging by adding live/stage/ services/hello-world-app/terragrunt.hcl and running terragrunt apply
:
terraform {
source = "github.com//modules//services/hello-world-app?ref=v0.0.7"
}
include {
path = find_in_parent_folders()
}
dependency "mysql" {
config_path = "../../data-stores/mysql"
}
inputs = {
environment = "stage"
ami = "ami-0fb653ca2d3203ac1"
min_size = 2
max_size = 2
enable_autoscaling = false
mysql_config = dependency.mysql.outputs
}
This terragrunt.hcl file uses the source
URL and inputs
just as you saw before and uses include
to pull in the settings from the root terragrunt.hcl file, so it will inherit the same backend
settings, except for the key
, which will be automatically set to services/hello-world-app/terraform.tfstate, just as you’d expect. The one new thing in this terragrunt.hcl file is the dependency
block:
dependency "mysql" {
config_path = "../../data-stores/mysql"
}
This is a Terragrunt feature that can be used to automatically read the output variables of another Terragrunt module, so you can pass them as input variables to the current module, as follows:
mysql_config = dependency.mysql.outputs
In other words, dependency
blocks are an alternative to using terraform_remote_state
data sources to pass data between modules. While terraform_remote_state
data sources have the advantage of being native to Terraform, the drawback is that they make your modules more tightly coupled together, as each module needs to know how other modules store state. Using Terragrunt dependency
blocks allows your modules to expose generic inputs like mysql_config
and vpc_id
, instead of using data sources, which makes the modules less tightly coupled and easier to test and reuse.
Once you’ve got hello-world-app
working in staging, create analogous terragrunt.hcl files in live/prod and promote the exact same v0.0.7
artifact to production by running terragrunt apply
in each module.
You’ve now seen how to take both application code and infrastructure code from development all the way through to production. The following table shows an overview of the two workflows side-by-side:
If you follow this process, you will be able to run application and infrastructure code in dev, test it, review it, package it into versioned, immutable artifacts, and promote those artifacts from environment to environment:
If you’ve made it to this point in the Comprehensive Guide to Terraform series, you now know just about everything you need to use Terraform in the real world:
Thank you for reading!
For an expanded version of this blog post series, pick up a copy of the book Terraform: Up & Running (3rd edition available now!). If you need help with Terraform, DevOps practices, or AWS at your company, feel free to reach out to us at Gruntwork.