Once a month, we send out a newsletter to all Gruntwork customers that describes all the updates we’ve made in the last month, news in the DevOps industry, and important security updates. Note that many of the links below go to private repos in the Gruntwork Infrastructure as Code Library and Reference Architecture that are only accessible to customers.

Hello Grunts,

In the last month, we hit a big milestone at Gruntwork: $1 million in annual recurring revenue! Then, we got right back to work, and made a huge number of updates, including making major changes to our ELK code to work around NLB limitations, updating Terratest so it can take a “snapshot” of your configs and logs to make it easier to debug test failures, updating Terragrunt so it automatically retries on errors that are known to be transient, fixing the perpetual diffs issue with S3 bucket lifecycle settings, adding support for Oracle Cloud Infrastructure to Terratest, and a huge number of other fixes and improvements. In other news, you can now use Yubikeys with AWS and the Oracle JDK now requires a paid support contract for production usage, so you may need to change JDKs soon.

As always, if you have any questions or need help, email us at support@gruntwork.io!

Gruntwork Updates

Gruntwork is now generating $1 million in annual recurring revenue

Motivation: Our mission is to make it 10x easier to understand, build, and deploy software. To do that at scale, we realized that we needed to build a sustainable company.

Solution: We created Gruntwork and began offering access to world-class infrastructure code, DevOps software, training, and support as a part of a subscription. This subscription is now bringing in over $1 million in annual recurring revenue (ARR). We are deeply grateful to our customers for making this possible.

What to do about it: Check out How we got to $1 million in annual recurring revenue with $0 in fundraising for all the details.

Major Release: ELK Package

Motivation: While using our ELK code the last couple months, we hit a few limitations with using an NLB as the load balancer of choice for our inter-cluster communication:

Solution: We replaced the NLB with an ALB for communication between clusters. However, since Filebeat can only communicate with Logstash on a pure TCP protocol, and the ALB only supports HTTP/HTTPS, we can’t use the ALB with Filebeat. To get around this issue, we came up with an auto discovery mechanism that resides on the application server. It runs as a cron job on the server, periodically looking up Logstash EC2 instance IPs using the AWS APIs, updating the Filebeat configuration with the IPs of the returned instances, and restarting Filebeat to load the new configuration. We also rely on Filebeat’s built-in load balancing feature to distribute requests among the Logstash instances.

What to do about it: This is a hugely backwards incompatible change and special care needs to be taken to ensure a smooth upgrade. The following steps are a good starting point:

  1. Remove your use of the nlb module and replace with an alb. See example here: https://github.com/gruntwork-io/package-elk/blob/master/examples/elk-multi-cluster/main.tf#L436
  2. Replace your use of the load-balancer-target-group module with newly added load-balancer-alb-target-group. See example of using the new module https://github.com/gruntwork-io/package-elk/blob/master/examples/elk-multi-cluster/main.tf#L71
  3. Finally, update the various target_group_arns arguments passed to the cluster modules. https://github.com/gruntwork-io/package-elk/blob/master/examples/elk-multi-cluster/main.tf#L40
  4. If you’re using SSL with the ALB, you’ll need to take note of the ALB upgrade notes: module-load-balancer, v0.12.0

Terratest can now help take a snapshot of your config/logs

Motivation: When a infrastructure test fails, to understand what went wrong, you typically need the logs and config files from your deployed apps and services. Currently, getting at this information is a bit of a pain: you’d need some way to run the tests, “pause” (i.e., not tear down) the infrastructure after a failure, ssh to individual instances, and then view the logs and config files to see what went wrong. This is hard to do, especially when your tests are running automatically on a CI server.

Solution: Terratest can now automate the task of taking a “snapshot” of your whole deployment by grabbing a copy of log files, config files, and any other files useful for debugging. If you configure your CI server correctly, you can make this “snapshot” easy to browse. For example, when one of our ELK automated tests fails, here is how we can use CircleCI to debug what went wrong:

What to do about it: Update your code to use Terratest v0.13.0 and then take a look at our example readme for a full walk-through of the functionality and how to use it.

Terragrunt will now automatically retry on transient errors

Motivation: Occasionally, when you run a command like terraform apply, you get a transient/intermittent error, such as a TLS handshake timeout or CloudWatch concurrency error. If you just re-run apply, the error goes away, but having to deal with these intermittent failures is frustrating, especially in CI environments, and especially when running many commands at once (e.g., via apply-all).

Solution: We’ve updated Terragrunt to automatically retry commands when you hit an error that is known to be transient! There’s nothing for you to do to enable it: if Terragrunt recognizes the error, it will automatically re-run the last command up to a configurable number of times (default is 3) with a configurable sleep between retries (default is 5 seconds). You can find the list of known transient errors in auto_retry_options.go. We will add support for specifying a custom list of retryable errors in the future (if you want this feature soon, PRs are very welcome!).

What to do about it: Give Terragrunt v0.17.0 a shot and see if it makes your Terraform usage a little more stable and reliable. Check out the Auto Retry docs for more details, including how to configure retries and sleeps, and how to disable retry functionality if, for some reason, it doesn’t work with your use cases.

Fix perpetual diff errors with S3 buckets

Motivation: For a while, some of our modules that used S3 buckets with lifecycle settings would always show a diff when you ran plan, even though nothing had changed.

Solution: Thanks to the help of one of our customers, we believe we’ve figured out the cause: you should not set both the expired_object_delete_marker and days parameters in an expiration block. We’ve fixed this issue in our load-balancer-access-logs and cloudtrail modules.

What to do about it: To pick up these fixes, update to module-aws-monitoring, v0.9.3 and module-security, v0.15.2.

Terratest now supports OCI

Motivation: Terratest is Gruntwork’s swiss army knife for infrastructure testing. Last month, we updated Terratest with support for testing infrastructure on Google Cloud Platform (GCP). This month, someone wanted to use Terratest to test infrastructure on Oracle Cloud Infrastructure (OCI).

Solution:**** Terratest now has initial support for OCI! Check out packer_oci_example_test.go for an example.

What to do about it: Grab Terratest v0.12.0 and take the oci package for a spin.

Jenkins backup cleanup fix

Motivation: There was a bug in how we configured the code that cleans up old backups for Jenkins in the Reference Architecture. As a result, backups wouldn’t be cleaned up, and more and more snapshots would pile up over time.

Solution: The fix requires tweaking the value of a single parameter, delete_older_than, from 15 to 15d, as shown in this commit in the Acme sample Reference Architecture.

What to do about it: If you’re using Jenkins with the Reference Architecture:

  1. Update your delete_older_than parameter as shown above.
  2. Publish a new version of your infrastructure-modules repo.
  3. Run terragrunt apply in your infrastructure-live repo to deploy the changes.

Package SAM updates

Motivation: There were several small bugs and no way to pass environment variables to AWS SAM CLI while testing locally.

Solution: We implemented some bug fixes and also added support for passing environment variables to AWS SAM CLI through the Swagger file.

What to do about it: To pick up these fixes, update to package-sam, v0.1.7.

Gruntwork Houston updates

We’ve made a number of updates to Gruntwork Houston in the last month:

Are you interested in joining the Houston beta? Email us at info@gruntwork.io!

ELK updates

In addition to the NLB replacement mentioned at the top of this newsletter, we also made a number of other updates to package-elk in the last month:

Terragrunt updates

We made a number of other updates to Terragrunt in the last month:

Terratest updates

We made a number of other updates to Terratest in the last month:

Other open source updates

Other updates

DevOps News

AWS now supports Yubikey for MFA

What happened: AWS now supports the Yubikey as a Multi-Factor Auth device.

Why it matters: The Yubikey is a tiny hardware USB device that supports a range of security functionality, including generating one-time passwords that can be used for Multi-Factor Authentication (MFA). It’s easier to use and (arguably) more secure than other MFA options, such as using the Google Authenticator app on your phone.

The way it works is you (or your company) buy a Yubikey and register it with (a) Yubico’s online service and (b) the online service you’re trying to log into, such as AWS. Then, whenever you’re logging into your online service, it will ask you not only for a username and password, but also a Yubikey token. To enter the token, you simply click on the text field in your browser, push a button on the Yubikey itself, and it will automatically enter the token for you (the Yubikey behaves as a USB keyboard), without you having to take your phone out of your pocket or type anything in manually. The web service will then check your token with the Yubikey service, and if it’s valid, allow you to login.

What to do about it: If you wish to start using a Yubikey with AWS, follow the instructions here.

As of version 11, Oracle JDK will no longer be free

Motivation: Oracle has released Java 11, but the terms come with a catch: you may no longer use Oracle’s JDK for commercial or production purposes without a paid support contract from Oracle.

Why it matters: For many years, the Oracle JDK was the recommended JDK for most Java apps, as it was the best maintained, had all the bells and whistles, and gave you the option to purchase support from Oracle. While you can still use the Oracle JDK for developing, testing, prototyping, and learning, the support contract is now no longer optional for production or commercial usage.

What to do about it: If you don’t want to pay Oracle for a support contract, you need to move to one of the flavors of OpenJDK:

The good news is that OpenJDK is more or less identical to Oracle JDK these days, so this should not generally cause issues. We will be updating our code (namely, the JDK installer in package-zookeeper) to use one of the OpenJDK flavors in the future.

RDS now supports deletion protection

What happened: Amazon has added support for deletion protection for RDS and Aurora databases.

Why it matters: You can turn on deletion protection with a single click (or single line of code). Once enabled, if you try to delete a database with deletion protection, you get an error (the only way to delete such a database is to explicitly disable deletion protection). This provides an extra sanity check to help protect your production databases from accidental deletion (e.g., accidental terraform destroy).

What to do about it: You can enable deletion protection via the UI now. We’ll be exposing a flag to enable this feature in module-data-storage in the future (if you need it sooner, PRs are welcome!).

ElastiCache for Redis now supports read replicas for sharded Redis

What happened: Amazon has announced that ElastiCache for Redis now supports adding and removing read replica nodes for both sharded and non-sharded Redis clusters.

Why it matters: This makes it easier to scale your reads and improve availability for your Redis Cluster environments without requiring manual steps or needing to make application changes.

What to do about it: Check out the announcement blog post for the details.

Security Updates

Below is a list of critical security updates that may impact your services. We notify Gruntwork customers of these vulnerabilities as soon as we know of them via the Gruntwork Security Alerts mailing list. It is up to you to scan this list and decide which of these apply and what to do about them, but most of these are severe vulnerabilities, and we recommend patching them ASAP.

Jenkins