Once a month, we send out a newsletter to all Gruntwork customers that describes all the updates we’ve made in the last month, news in the DevOps industry, and important security updates. Note that many of the links below go to private repos in the Gruntwork Infrastructure as Code Library and Reference Architecture that are only accessible to customers.
Hello Grunts,
In the last month, we integrated Kubernetes into the Gruntwork Reference Architecture, wrote a new blog post series on how to build an end-to-end production-grade architecture on AWS, updated Terratest and Terragrunt and many of our modules to work with Terraform 0.12 (which is now out officially!), updated the Reference Architecture to use OpenJDK, fixed a security bug in the SQS module, and much more.
As always, if you have any questions or need help, email us at support@gruntwork.io!
Motivation: For the last few years, the Gruntwork Reference Architecture has supported Auto Scaling Groups (ASGs) and EC2 Container Service (ECS) as the primary ways to run workloads. As Kubernetes has grown in popularity, we got steadily more and more requests to add support for it as a first-class offering. Today, we’re excited to announce that we can now offer Kubernetes as a new option for running workloads in the Gruntwork Reference Architecture!
Solution: We’ve created a number of production-grade modules for running Kubernetes on AWS and integrated them into the Reference Architecture (including hooking them into monitoring, alerting, networking, CI/CD, and so on). Under the hood, we run the Kubernetes control plane on top of Amazon’s Elastic Kubernetes Service (EKS), so it’s a fully managed service. On top, we run Helm and Tiller to make it easy to deploy and manage workloads in your Kubernetes cluster. And in between, we’ve spent a lot of time configuring everything for high availability (see our Zero Downtime Server Updates For Your Kubernetes Cluster blog post series), scalability, security (including TLS auth, namespaces, and strict RBAC controls), and testing (see our blog post, Automated Testing for Kubernetes and Helm Charts using Terratest).
If you’re a Gruntwork customer, you can see an example of what the Reference Architecture integration looks like in our Acme Company examples below (and if you’re not a customer, sign up now to get access!):
See also the corresponding changes in the infrastructure-live
repository to see how they are deployed.
What to do about it: We can deploy a Kubernetes-based Reference Architecture for you in about one day as part of the Gruntwork Subscription. Alternatively, if you’re already a subscriber, check out the links in the previous section to learn how to deploy Kubernetes into your existing infrastructure-modules
and infrastructure-live
repos. Let us know how it works for you, and if you’ve got any comments questions, contact us at support@gruntwork.io!
Motivation: Many people have asked us about the details of what it takes to go to production on AWS. We’ve captured these details in the Gruntwork Reference Architecture, but haven’t done a great job of explaining what those details include. Potential customers wanted to know the specific components of the architecture and how they were set up before purchasing. Existing customers wanted to know about some of the design choices we made.
Solution: We wrote a new blog post series, How to Build an End to End Production-Grade Architecture on AWS! This series is designed to build up to the Reference Architecture from the perspective of addressing the various concerns that need to be answered when going to production on AWS. This includes both an overview of which infrastructure components to choose (e.g., Kubernetes, VPCs, KMS, Jenkins, etc), as well as why those choices make sense.
What to do about it: Click on the links below and start reading!
Motivation: Terraform 0.12 final is now out (see the DevOps News section below), so we’ve been hard at work updating all of our modules and tooling to work with it.
Solution: Here are the latest updates:
Terragrunt v0.19.0 and above now supports Terraform 0.12! As a bonus, we’re now using HCL2 syntax with Terragrunt, which (a) makes your code cleaner and (b) allows you to use built-in Terragrunt functions everywhere in your Terragrunt configuration! Make sure to read the migration guide for upgrade instructions. Also, check out Terragrunt: how to keep your Terraform code DRY and maintainable for an overview of how to use Terragrunt in 2019.
Terratest v0.15.8 and above now supports Terraform 0.12. See below for more info on Terratest updates.
Infrastructure as Code Library: we’ve updated a number of modules in the Infrastructure as Code Library—see the module version compatibility chart—, but we still have quite a few more to go. Note that these are backwards incompatible releases, so the latest versions of our modules will no longer support Terraform 0.11.
What to do about it: Since we are still in the process of upgrading all of our modules to work with Terraform 0.12, and since the upgrade process is backwards incompatible, for the time being, we recommend that you continue to use Terraform 0.11.x. Once everything is ready to go with Terraform 0.12.x, we’ll send out full upgrade instructions. We know you’re excited to upgrade, so we’re making every effort to have everything ready by the end of June, but take that as a good faith estimate, and be aware of the usual caveats about DevOps time estimates and yak shaving!
Motivation: We discovered that our sqs
module had a very unsafe default configuration that allowed unauthenticated incoming requests from any IP.
Solution: We’ve updated the sqs
module so that IP-based access is now disabled completely by default. Unless you intend to allow unauthenticated IP-based, we strongly recommend updating to this new version. If you do need to allow IP-based access, set apply_ip_queue_policy
to true
and specify the IPs that should be able to access the queue via allowed_cidr_blocks
.
What to do about it: We strongly recommend updating to package-messaging, v0.2.0 ASAP.
Motivation: Last month we discovered that Oracle changed their policies to require authentication for all downloads of their JDK, which broke our install-oracle-jdk
module. As a solution, we introduced an install-open-jdk
module, and updated all our Java based infrastructure packages to use it Kafka, Zookeeper, ELK. However, customers were asking how to apply these changes to their Reference Architectures.
Solution: This month, we updated our Reference Architecture examples to point to the install-open-jdk
module where it was referencing install-oracle-jdk
. If you use Kafka, Zookeeper, or ELK, you will want to apply the same update to your Packer templates.
What to do about it: Check out this commit for an example of the locations you will need to update.
Motivation: In our bash scripts for the Reference Architecture, we have been using local readonly
to mark variables as locally scoped and immutable. However, this does not actually do what you would think it would do.
Solution: We updated all our bash scripts in the Reference Architecture to replace the usage of local readonly
with local -r
. We also took care to mark read only arrays using local -r -a
.
What to do about it: Check out this commit for an example of the locations you will need to update.
Motivation: We needed to make a number of Terratest updates, including improving adding support forour on-going work to update to Terraform 0.12, improved GCP support, and improved features to work around flaky tests.
Solution: We’ve made the following updates:
-var-file
in the packer
module to use json files as variable input. Check out our example usage.-backend-config
parameters during terraform init
. We were using a space as a separator, but Terraform requires using equals.GetEc2InstanceIdsByFilters
which provides an interface to retrieve EC2 instances by defining filters as a map
. This release also introduced functionality for testing dynamodb.terragrunt
. Check out the release notes for more info.terraform.OutputList
and terraform.OutputMap
methods to work with Terraform 0.12.DoWithRetryableErrors
method that takes in a map of retryable errors and an action to execute, and if the action returns an error, retries it if the error or the action's stdout/stderr matches one of the retryable errors. Updated the terraform
code to use this DoWithRetryableErrors
method under the hood for retries. Added support for retryable errors for Packer builds via the new RetryableErrors
, MaxRetries
, and TimeBetweenRetries
settings in packer.Options
.NewAuthenticatedSession
in modules/aws
now supports returning credentials set by assuming a role. This can be done by setting the environment variable TERRATEST_IAM_ROLE
to the ARN of the IAM role that should be assumed. When this env var is not set, it reverts to the old behavior of looking up credentials from the default location.InitAndPlan
and InitAndPlanE
now return the text output from stdout
and stderr
, instead of the exit code as an integer. The original versions that returned the exit code have been renamed to InitAndPlanWithExitCode
and InitAndPlanWithExitCodeE
. As a part of this, introduced Plan
and PlanE
functions, which can be used to just run terraform plan
. These will return the stdout
and stderr
outputs.logging_service
and monitoring_service
defaults were changed to use Stackdriver Kubernetes Engine Monitoring instead of the legacy Stackdriver support.internal-load-balancer
-module that can be used to create Internal TCP/UDP Load Balancers using internal forwarding rules.helm revoke
command, which will remove access to Tiller from the specified RBAC entities.logs/load-balancer-access-logs
module. You can now set create_resources = false
on the module call to avoid creating the S3 bucket.logs/load-balancer-access-logs
module’s policy so that NLBs can write to the S3 bucket.period
setting for the SQS alarm to use a minimum of 5 minutes rather than 1 minute, as SQS metrics are only collected once every 5 minutes, so trying to alert more often doesn't work.iam-groups
module to not create the "access-all" group by setting the new input variable should_create_iam_group_cross_account_access_all
to false. This can help work around an AWS limitation where we exceed the max IAM policy length.cloudtrail
module using a new sns_delivery_topic
input variable.elasticsearch-cluster-backup
and elasticsearch-cluster-restore
modules over to using Node 8.10 as the runtime, as 6.10 has been deprecated. The runtime is now also configurable via the lambda_runtime
input variable.attach-eni
script is now compatible with Ubuntu 18.04.var.custom_tags
now propagate to EIP resources created in the VPCs.What happened: HashiCorp has released Terraform 0.12 final. They also followed up shortly after with 0.12.1, which fixes some important bugs.
Why it matters: Terraform 0.12 brings with it a number of powerful new features, but will also require a significant upgrade.
What to do about it: See the “Terraform 0.12 update” section above.
What happened: Amazon’s managed Kafka service, MSK, is now generally available in all AWS accounts.
Why it matters: Before, MSK was only available in “preview mode” to select accounts. The service is now a bit more mature and available everywhere as a managed way to run Apache Kafka (and Apache ZooKeeper).
What to do about it: Give MSK a shot and let us know what you think! We do not have a dedicated module for it, but you can try out the aws_msk_cluster resource to deploy it yourself.
What happened: AWS has added support for trunking, which allows certain instance types to have a higher ENI limit for ECS Tasks in awsvpc
networking mode.
Why it matters: When using awsvpc
networking mode, each ECS Task gets its own IP address by way of an Elastic Network Interface (ENI). Under the hood, each ECS Task runs on an EC2 Instance, and those instances typically had very low limits on how many ENIs you could attach (e.g., typically only 1–2 until you got to really large instance types). That meant you would often run out of ENIs long before you ran out of CPU or memory resources. Now, if you enable the new awsvpcTrunking
mode, certain instance types will allow you to attach 3–8x as many ENIs as before, allowing you to make much better use of your CPU and memory resources.
What to do about it: Check out the announcement blog post for instructions.
What happened: AWS Lambda now allows you to use Node.js v10 as a runtime, while the older Node.js v6 runtime is now deprecated.
Why it matters: If you were using Node.js v6, you need to update immediately, as it will stop working soon. Node.js v10 includes a number of performance improvements and is generally a safe upgrade.
What to do about it: If you’re using package-lambda to manage your Lambda functions, update your runtime parameter to nodejs10.x
.
Below is a list of critical security updates that may impact your services. We notify Gruntwork customers of these vulnerabilities as soon as we know of them via the Gruntwork Security Alerts mailing list. It is up to you to scan this list and decide which of these apply and what to do about them, but most of these are severe vulnerabilities, and we recommend patching them ASAP.