Once a month, we send out a newsletter to all Gruntwork customers that describes all the updates we’ve made in the last month, news in the DevOps industry, and important security updates. Note that many of the links below go to private repos in the Gruntwork Infrastructure as Code Library and Reference Architecture that are only accessible to customers.
Hello Grunts,
In the last month, we hit a big milestone at Gruntwork: $1 million in annual recurring revenue! Then, we got right back to work, and made a huge number of updates, including making major changes to our ELK code to work around NLB limitations, updating Terratest so it can take a “snapshot” of your configs and logs to make it easier to debug test failures, updating Terragrunt so it automatically retries on errors that are known to be transient, fixing the perpetual diffs issue with S3 bucket lifecycle settings, adding support for Oracle Cloud Infrastructure to Terratest, and a huge number of other fixes and improvements. In other news, you can now use Yubikeys with AWS and the Oracle JDK now requires a paid support contract for production usage, so you may need to change JDKs soon.
As always, if you have any questions or need help, email us at support@gruntwork.io!
Motivation: Our mission is to make it 10x easier to understand, build, and deploy software. To do that at scale, we realized that we needed to build a sustainable company.
Solution: We created Gruntwork and began offering access to world-class infrastructure code, DevOps software, training, and support as a part of a subscription. This subscription is now bringing in over $1 million in annual recurring revenue (ARR). We are deeply grateful to our customers for making this possible.
What to do about it: Check out How we got to $1 million in annual recurring revenue with $0 in fundraising for all the details.
Motivation: While using our ELK code the last couple months, we hit a few limitations with using an NLB as the load balancer of choice for our inter-cluster communication:
Solution: We replaced the NLB with an ALB for communication between clusters. However, since Filebeat can only communicate with Logstash on a pure TCP protocol, and the ALB only supports HTTP/HTTPS, we can’t use the ALB with Filebeat. To get around this issue, we came up with an auto discovery mechanism that resides on the application server. It runs as a cron job on the server, periodically looking up Logstash EC2 instance IPs using the AWS APIs, updating the Filebeat configuration with the IPs of the returned instances, and restarting Filebeat to load the new configuration. We also rely on Filebeat’s built-in load balancing feature to distribute requests among the Logstash instances.
What to do about it: This is a hugely backwards incompatible change and special care needs to be taken to ensure a smooth upgrade. The following steps are a good starting point:
nlb
module and replace with an alb
. See example here: https://github.com/gruntwork-io/package-elk/blob/master/examples/elk-multi-cluster/main.tf#L436load-balancer-target-group
module with newly added load-balancer-alb-target-group
. See example of using the new module https://github.com/gruntwork-io/package-elk/blob/master/examples/elk-multi-cluster/main.tf#L71target_group_arns
arguments passed to the cluster modules. https://github.com/gruntwork-io/package-elk/blob/master/examples/elk-multi-cluster/main.tf#L40Motivation: When a infrastructure test fails, to understand what went wrong, you typically need the logs and config files from your deployed apps and services. Currently, getting at this information is a bit of a pain: you’d need some way to run the tests, “pause” (i.e., not tear down) the infrastructure after a failure, ssh to individual instances, and then view the logs and config files to see what went wrong. This is hard to do, especially when your tests are running automatically on a CI server.
Solution: Terratest can now automate the task of taking a “snapshot” of your whole deployment by grabbing a copy of log files, config files, and any other files useful for debugging. If you configure your CI server correctly, you can make this “snapshot” easy to browse. For example, when one of our ELK automated tests fails, here is how we can use CircleCI to debug what went wrong:
What to do about it: Update your code to use Terratest v0.13.0 and then take a look at our example readme for a full walk-through of the functionality and how to use it.
Motivation: Occasionally, when you run a command like terraform apply
, you get a transient/intermittent error, such as a TLS handshake timeout or CloudWatch concurrency error. If you just re-run apply
, the error goes away, but having to deal with these intermittent failures is frustrating, especially in CI environments, and especially when running many commands at once (e.g., via apply-all
).
Solution: We’ve updated Terragrunt to automatically retry commands when you hit an error that is known to be transient! There’s nothing for you to do to enable it: if Terragrunt recognizes the error, it will automatically re-run the last command up to a configurable number of times (default is 3) with a configurable sleep between retries (default is 5 seconds). You can find the list of known transient errors in auto_retry_options.go. We will add support for specifying a custom list of retryable errors in the future (if you want this feature soon, PRs are very welcome!).
What to do about it: Give Terragrunt v0.17.0 a shot and see if it makes your Terraform usage a little more stable and reliable. Check out the Auto Retry docs for more details, including how to configure retries and sleeps, and how to disable retry functionality if, for some reason, it doesn’t work with your use cases.
Motivation: For a while, some of our modules that used S3 buckets with lifecycle settings would always show a diff when you ran plan
, even though nothing had changed.
Solution: Thanks to the help of one of our customers, we believe we’ve figured out the cause: you should not set both the expired_object_delete_marker
and days
parameters in an expiration
block. We’ve fixed this issue in our load-balancer-access-logs
and cloudtrail
modules.
What to do about it: To pick up these fixes, update to module-aws-monitoring, v0.9.3 and module-security, v0.15.2.
Motivation: Terratest is Gruntwork’s swiss army knife for infrastructure testing. Last month, we updated Terratest with support for testing infrastructure on Google Cloud Platform (GCP). This month, someone wanted to use Terratest to test infrastructure on Oracle Cloud Infrastructure (OCI).
Solution:**** Terratest now has initial support for OCI! Check out packer_oci_example_test.go for an example.
What to do about it: Grab Terratest v0.12.0 and take the oci package for a spin.
Motivation: There was a bug in how we configured the code that cleans up old backups for Jenkins in the Reference Architecture. As a result, backups wouldn’t be cleaned up, and more and more snapshots would pile up over time.
Solution: The fix requires tweaking the value of a single parameter, delete_older_than
, from 15
to 15d
, as shown in this commit in the Acme sample Reference Architecture.
What to do about it: If you’re using Jenkins with the Reference Architecture:
delete_older_than
parameter as shown above.infrastructure-modules
repo.terragrunt apply
in your infrastructure-live
repo to deploy the changes.Motivation: There were several small bugs and no way to pass environment variables to AWS SAM CLI while testing locally.
Solution: We implemented some bug fixes and also added support for passing environment variables to AWS SAM CLI through the Swagger file.
What to do about it: To pick up these fixes, update to package-sam, v0.1.7.
We’ve made a number of updates to Gruntwork Houston in the last month:
houston-cli, v0.0.7
: Added the ability to create and setup the houston
configuration from the command line using the newly introduced houston configure
command.houston-cli, v0.0.8
: Improved help text output and bugfix to houstonUrl
in config file to allow trailing slashes.Are you interested in joining the Houston beta? Email us at info@gruntwork.io!
In addition to the NLB replacement mentioned at the top of this newsletter, we also made a number of other updates to package-elk in the last month:
iam_role_id
as an output variable for the logstash-cluster
module. This variable is useful for adding ssh-grunt IAM policies to this ASG=
character to a terraform local
declaration. There was some inconsistent behavior with some customers reporting issues as a result while other tests running and passing without issue.logstash-cluster
for passing through allowed security groups for collectd and beats to the underlying logstash-security-group-rules
module. This is very handy for specifying allowed security groups without having to have a 2nd logstash-security-groups module
.allowed_ssh_security_group_ids
to aws_launch_configuration
resources in both kibana and elastalert modules. Also added proper plumbing for allow_ssh_from_security_group_ids
to be specified in the elastalert
module and then be passed all the way through to the underlying elastalert-security-group-rules
modulekibana-cluster
will now create egress rules for the security group that it creates. Stabilized the ELK tests. Added better documentation/clarified examples with our AMI and example code READMEsvars.tf
to variables.tf
We made a number of other updates to Terragrunt in the last month:
force_path_style
in the S3 config. Add support for skipping S3 bucket versioning via the skip_bucket_versioning
config.shared_credentials_file
config for S3 backends, using it when creating S3 buckets and DynamoDB tables.xxx-all
commands (e.g., apply-all
) by using the --terragrunt-exclude-dir
flag. This flag supports wildcard expressions and may be specified multiple times.prevent_destroy
flag so it works even when configs are inherited from a parent .tfvars
file.extra_arguments
, Terragrunt will no longer pass -var
or -var-file
arguments to Terraform when you call apply
with a plan file.v0.16.13
that fixes a bug where -var
and -var-file
were still passed if you called apply
with a plan file and other arguments in between (e.g., terragrunt apply <other args> <plan file>
).We made a number of other updates to Terratest in the last month:
terraform plan
and extracting the exit code, including InitAndPlan
and PlanExitCode
.ScpFileFrom
and ScpDirFrom
that will allow for the transfer of files from remote EC2 instances to the local machine. The main idea with these helper methods is to make it easy to tell terratest
to grab all of the various log and config files from your app running on some remote machine in the case that a test is going to fail. We already had methods in terratest
that would grab the contents of those files and return the contents as string. The new methods introduced in this release expand upon that functionality and open up the possibility of easily grabbing and archiving all of the log and configuration files on your CI of choice.WorkspaceSelectOrNew
method that can be used to create and select Terraform workspaces at test time.consul-cluster
, consul-security-group-rules
, and consul-client-security-group-rules
modules.vault-security-group-rules
module now adds a self
rule so that Vault servers can talk to each other via their API port.ebs_block_device
parameter in the nomad-cluster
module.HelpPrinter
function that will wrap help text at specified line width, while preserving indentations in the output table. To use, you can call entrypoint.NewApp()
to construct the cli app which will take care of applying the modifications, or manually apply the changes yourself on the cli
app. You can also modify the line width by changing entrypoint.HelpTextLineWidth
(defaults to 80).https_listener_ports_and_acm_ssl_certs_num
and https_listener_ports_and_ssl_certs_num
to specify the length of the mappings between ports and their associated (non)acm certificates. This allows the values of the mappings to be dependent on dynamic resources. See: hashicorp/terraform#11482ecs-service-with-discovery
module now outputs the security group ID via the output variable ecs_task_security_group_id
.ecs-service
module using the new volumes
parameter.wait_for
to the lambda
module. All the resources in the module will not be created until wait_for
is resolved, which allows you to execute other steps (e.g., create zip file) before this module runs. This is a workaround for the lack of depends_on
for modules in Terraform.boto3
library zip file in get-desired-capacity.py
. We will now attempt to unzip the archive and catch any exception, and if it is the exception related to our concurrency issue, simply sleep for 5 seconds and try again.cloudwatch-log-aggregation-scripts
, cloudwatch-memory-disk-metrics-scripts
, and syslog
modules now support Amazon Linux 2.package-openvpn
now uses bash-commons under the hood. The behavior is identical, but you must now install bash-commons
before installing any of the package-openvpn
modules.What happened: AWS now supports the Yubikey as a Multi-Factor Auth device.
Why it matters: The Yubikey is a tiny hardware USB device that supports a range of security functionality, including generating one-time passwords that can be used for Multi-Factor Authentication (MFA). It’s easier to use and (arguably) more secure than other MFA options, such as using the Google Authenticator app on your phone.
The way it works is you (or your company) buy a Yubikey and register it with (a) Yubico’s online service and (b) the online service you’re trying to log into, such as AWS. Then, whenever you’re logging into your online service, it will ask you not only for a username and password, but also a Yubikey token. To enter the token, you simply click on the text field in your browser, push a button on the Yubikey itself, and it will automatically enter the token for you (the Yubikey behaves as a USB keyboard), without you having to take your phone out of your pocket or type anything in manually. The web service will then check your token with the Yubikey service, and if it’s valid, allow you to login.
What to do about it: If you wish to start using a Yubikey with AWS, follow the instructions here.
Motivation: Oracle has released Java 11, but the terms come with a catch: you may no longer use Oracle’s JDK for commercial or production purposes without a paid support contract from Oracle.
Why it matters: For many years, the Oracle JDK was the recommended JDK for most Java apps, as it was the best maintained, had all the bells and whistles, and gave you the option to purchase support from Oracle. While you can still use the Oracle JDK for developing, testing, prototyping, and learning, the support contract is now no longer optional for production or commercial usage.
What to do about it: If you don’t want to pay Oracle for a support contract, you need to move to one of the flavors of OpenJDK:
The good news is that OpenJDK is more or less identical to Oracle JDK these days, so this should not generally cause issues. We will be updating our code (namely, the JDK installer in package-zookeeper) to use one of the OpenJDK flavors in the future.
What happened: Amazon has added support for deletion protection for RDS and Aurora databases.
Why it matters: You can turn on deletion protection with a single click (or single line of code). Once enabled, if you try to delete a database with deletion protection, you get an error (the only way to delete such a database is to explicitly disable deletion protection). This provides an extra sanity check to help protect your production databases from accidental deletion (e.g., accidental terraform destroy
).
What to do about it: You can enable deletion protection via the UI now. We’ll be exposing a flag to enable this feature in module-data-storage in the future (if you need it sooner, PRs are welcome!).
What happened: Amazon has announced that ElastiCache for Redis now supports adding and removing read replica nodes for both sharded and non-sharded Redis clusters.
Why it matters: This makes it easier to scale your reads and improve availability for your Redis Cluster environments without requiring manual steps or needing to make application changes.
What to do about it: Check out the announcement blog post for the details.
Below is a list of critical security updates that may impact your services. We notify Gruntwork customers of these vulnerabilities as soon as we know of them via the Gruntwork Security Alerts mailing list. It is up to you to scan this list and decide which of these apply and what to do about them, but most of these are severe vulnerabilities, and we recommend patching them ASAP.