Once a month, we send out a newsletter to all Gruntwork customers that describes all the updates we’ve made in the last month, news in the DevOps industry, and important security updates. Note that many of the links below go to private repos in the Gruntwork Infrastructure as Code Library and Reference Architecture that are only accessible to customers.
Hello Grunts,
We’ve got lots of fun updates for you from the last month! We added a new “Infrastructure Module Cookbook” course to our DevOps Training Library, wrote two new blog post series, one on zero-downtime updates for Kuberentes clusters and one on automating HashiCorp Vault, updated our Kubernetes modules with support for securely configuring Helm/Tiller, refactored our Kubernetes modules, made a number of fixes and improvements to our ZooKeeper and Kafka modules, added out-of-the-box CloudWatch Dashboards to the Gruntwork Reference Architecture, and much more.
As always, if you have any questions or need help, email us at support@gruntwork.io!
Motivation: Many customers were asking us for best practices on how to write reusable, production-grade, battle-tested infrastructure code.
Solution: We’ve added a new course to called The Gruntwork Infrastructure Module Cookbook. It goes over all the lessons we’ve learned from building a library of over 300,000 lines of infrastructure code that’s used in production by hundreds of companies. Topics covered: the tools we use, how to build small, reusable modules, how to keep your infrastructure code DRY, how to write automated tests for infrastructure code, and how to version, release, and deploy infrastructure code.
What to do about it: The new course is available as part of your Gruntwork Subscription! So, if you’re a subscriber, just log into https://training.gruntwork.io/ and you should already have access (if it’s not working for some reason, email us at support@gruntwork.io). If you’re not a subscriber, sign up now, and get access to all of the courses in the DevOps Training Library plus all the battle-tested infrastructure code in the Infrastructure as Code Library.
Motivation: Vault is an open source tool created by HashiCorp for securely storing secrets, such as database passwords, API keys, and TLS certs. Authenticating to Vault manually is easy, but many of our customers were struggling with figuring out how to automate Vault—especially automatically unsealing Vault (e.g., after an outage) and automatically authenticating to Vault from another server.
Solution: We’ve updated our Vault code for both AWS and GCP to make automation easier and wrote a 3-part blog post series that shows you how to implement this automation in your own deployments:
What to do about it: Start with the first part of the series, Auto-unsealing, and let us know what you think!
Motivation: Although Kubernetes has built-in support for rolling out updates to the Docker containers you’re running in your cluster, how do you update the cluster itself? Last month, we updated our Kubernetes modules with support for automated rolling deployment for EKS clusters, and we learned that it wasn’t enough to follow best practices in draining the instance workers to achieve a zero downtime rollout. There are many settings and considerations you have to think through on the deployed resources in Kubernetes, such as graceful shutdowns, pod shutdown lifecycle, and PodDisruptionBudgets
.
Solution: We decided to share our learning in a 4 part blog post series that covers how to deploy your containers on Kubernetes to achieve zero downtime cluster updates. The 4 parts are:
What to do about it: Head on over to part 1 and share your feedback!
Motivation: As we started developing the Kubernetes modules, we realized that they were reusable across different Kubernetes clusters and across different clouds (e.g AWS, GCP, Azure). In the old structure, with both the Kubernetes modules and the EKS modules in the same Github repository, the multicloud nature of the Kubernetes modules was not obvious. To make it clearer that the Kubernetes modules worked with other flavors of Kubernetes and not just EKS, we decided to split out the modules into multiple repositories.
Solution: We have split the repo that used to be called package-k8s
into the following repositories:
[terraform-aws-eks](https://github.com/gruntwork-io/terraform-aws-eks)
: This repo now holds the modules related to EKS (eks-cluster-control-plane
, eks-cluster-workers
, eks-k8s-role-mapping
, eks-vpc-tags
).[terraform-kubernetes-helm](https://github.com/gruntwork-io/terraform-kubernetes-helm)
: This repo now holds the modules related to setting up Helm on a Kubernetes cluster (k8s-namespace
, k8s-service-account
, k8s-helm-server
).[helm-k8s-services](https://github.com/gruntwork-io/helm-k8s-services)
: This repo will hold all the helm charts that can be used to package your application to be deployed on Kubernetes via Helm (k8s-job
, k8s-daemon-set
, k8s-service
).[kubergrunt](https://github.com/gruntwork-io/kubergrunt)
: This repo now holds the kubergrunt
toolbox, which includes support for configuring kubectl
to authenticate to an EKS cluster, securely configuring the Helm server (Tiller) and its TLS certificates, and securely configuring Helm clients and their TLS certificates.What to do about it: If you are part of the Gruntwork Kubernetes beta, take a look at the migration notes in package-k8s
. If you’re a Gruntwork Subscriber and would like to join the Kubernetes beta, email us at support@gruntwork.io and we’ll grant you access (and if you’re not a subscriber, sign up now)!
Motivation: We found that deploying Tiller (the server component of Helm) while following all the security best practices can be challenging. By default, Tiller does not apply any security configurations, requiring you to navigate RBAC authorization, TLS, and overriding configs to make use of Secrets
in order to follow the best practices for deploying Tiller.
Solution: We implemented a deployment tool in kubergrunt
that will enforce the recommended security best practices when deploying Tiller. This entails:
We also wrote a guide to Helm, describing its security model and how kubergrunt
helps with following the best practices.
Finally, we implemented a full terraform example in terraform-kubernetes-helm
that wraps kubergrunt
in terraform to deploy Tiller on your Kubernetes cluster. This includes an example of how to create a Namespace
and ServiceAccount
with the minimal set of permissions for Tiller to operate.
What to do about it: If you are a beta participant, check out v0.3.0 of kubergrunt
and v0.2.1 of terraform-kubernetes-helm
. Be sure to read the background guide on Helm, the command README for information on all the features of managing Tiller in kubergrunt
, and our quickstart guide on deploying to minikube
. The repos are currently in private beta, which is open to any Gruntwork subscriber. If you are interested in participating, email us at support@gruntwork.io and we’ll grant you access (and if you’re not a subscriber, sign up now)!
Motivation: In our Kafka and ZooKeeper packages, we use the server-group module’s support for zero-downtime rolling deployment and an Elastic Load Balancer (ELB) for health checks. In the previous versions, the ZooKeeper health check was done by attempting to establish a TCP connection to the ZooKeeper client port. During cluster launch and rolling deployment, even when the TCP connection was successful, one of the ZooKeeper nodes sometimes failed to re-join the cluster, leaving the cluster in an unstable state.
Solution: We introduced an improved way to perform health checks on the ZooKeeper nodes. By using our simple, open source [health-checker](https://github.com/gruntwork-io/health-checker)
library with a custom health check script target, we’re now able to monitor the ZooKeeper cluster status much more precisely — especially during cluster launch and rolling upgrades.
We also rewrote big parts of the test suites for both packages, greatly increasing test coverage and reliability. Both suites now run the module and cluster tests using multiple Linux distributions — Amazon Linux, Ubuntu, and CentOS.
Additionally, we fixed a number of smaller issues, including adding num_*
variables for security-group rules to avoid terraform
issues with count
.
What to do about it: We recommend updating to package-zookeeper, v0.5.0 (update: see v0.5.0 for the upgrade instructions, but use v0.5.2 to pick up a couple important bug fixes) and package-kafka, v0.5.0. Also, check out the examples to see how you can make use of the improved health checking for ZooKeeper.
Motivation: We’ve always thought that it’ll be great to have an easy way to get out-of-the-box dashboards for the CloudWatch metrics that are published as part of the standard Reference Architecture deployment.
Solution: We now automatically create a CloudWatch Dashboard in each environment (i.e. dev, stage, prod). A dashboard contains a basic set of graphs that allow you to monitor, in real-time, some of the key metrics for services in the Reference Architecture, including ECS Clusters, ECS Services, Application Load Balancers, and RDS databases. We will add support for more resources in the future.
What to do about it: All future deployments of the Reference Architecture will have this feature automatically included. However, if you’d like to add this to your existing deployment, here is a pull request that adds metric widgets and a dashboard module to the Acme Reference Architecture that you can use as a guide: https://github.com/gruntwork-io/infrastructure-modules-multi-account-acme/pull/11.
swappiness
setting where we were missing a “p” and therefore weren’t setting the value correctly. Fix a libtinfo
symlink issue that prevented Couchbase from starting on Amazon Linux 2.consul-cluster
now allows you to make the iam setup optional and receive a role created externally. run-consul
now configures autorestart
with unexpected
so attempts to shutdown gracefully won't automatically reboot Consul.allowed_ssh_security_group_count
parameter referenced any resources, as the consul-cluster
module was using that variable in a count
parameter. The only available option is that you now MUST pass in a redundant variable called allowed_ssh_security_group_count
that specifies the number of security group IDs in allowed_ssh_security_group_count
.--script
argument.GetPod
and GetPodE
) as well as waiting for pods to become available (WaitUntilPodAvailable
and WaitUntilPodAvailableE
).test_structure.CopyTerraformFolderToTemp
will now log the new temporary directory, so you can see which directory it used for the test. Also added new functions for retrieving Kubernetes Secrets (GetSecret
, GetSecretE
), ServiceAccounts (GetServiceAccount
, GetServiceAccountE
), and creating ServiceAccounts (CreateServiceAccount
, CreateServiceAccountE
).k8s
module for authenticating against GKE clusters.ecs-service-with-discovery
module now sets the default family name for the ECS Task Definition to ${var.service_name}
rather than "${var.service_name}-task-definition"
to be consistent with the other ECS modules.ecs-service-with-discovery
module now includes a new domain_name
output variable that will be set to the fully-qualified domain name configured for the module (if any).ecs-daemon-service
module now exposes a deployment_minimum_healthy_percent
parameter you can use to set the lower limit (as a percentage of the service's desiredCount
) of the number of running tasks that must remain running and healthy in a service during a deployment.ecs-daemon-service
module now exposes setting pid_mode
via a new variable: ecs_task_definition_pid_mode
. This allow setting the process namespace to use for the containers in the task. The valid values are host
and task
.DEBIAN_FRONTEND=noninteractive
to calls to apt-get
in the fail2ban
module so that the install doesn't hang during automated builds. Use systemctl
instead of update-rc.d
to boot fail2ban
on Ubuntu.--java-args
flag with the install-jenkins
module to configure additional JVM args for Jenkins.vpc-app-network-acls
module now all outbound DNS (UDP, port 53) traffic, by default. Most services need DNS, so it seems like a bug to not have exposed this properly before. Note that internal AWS DNS seems to work without this, but for other DNS systems, such as the one used by Kubernetes, this is an important fix.timeouts
on create
on the aws_route
resources to 5 minutes to work around a Terraform bug.server-group
module now allows you to configure IOPS for your EBS volumes by specifying the iops
attribute for each EBS volume you configure via the ebs_volumes
parameter.asg-rolling-deploy
module now allows you to configure enhanced monitoring on the instances by specifying the enabled_metrics
parameter.alb-alarms
, alb-target-group-alarms
, and rds-alarms
now support directly setting the datapoints_to_alarm
setting. You can read more about datapoints_to_alarm
in the official AWS documentation.alarms/alb-target-group-alarms
where the wrong variable was used for treating missing data on the high request count alarm.mount-ebs-volume
now uses the UUID instead of the device name to mount volumes. With some OS and volume configurations, the device name can change after a reboot, so using the UUID ensures that the volume is always identified the same way.aurora
module now exposes the skip_final_snapshot
parameter to allow you to skip a final snapshot when deleting a database.option_group_name
and monitoring_role_arn_path
, respectively.gruntsam
now supports passing through timeout configurations.gruntsam
frequently fails with ConflictException
.alb
module to return a blank 404 page, which is a more appropriate status code.What happened: Amazon has launched a new service called AWS Backup that allows you to centrally configure backup policies for EBS volumes, RDS databases, DynamoDB tables, EFS file systems, and so on.
Why it matters: Before, you had to create separate, custom solutions to backup your data for each AWS service (e.g., through cron jobs). Moreover, you had to build your own solutions for monitoring these backups and make sure they never failed. Now, you have a single, centralized service where you can configure all your backup policies that will take care of doing the backups on a schedule and alerting you if anything goes wrong.
What to do about it: Check out the AWS Backup product page for more info. Note that Terraform support for configuring AWS Backup is not yet available. Monitor this issue for progress.
What happened: Amazon’s Network Load Balancer (NLB) now supports TLS Termination.
Why it matters: Before, if you were using the NLB, the services you had behind the NLB had to manage their own TLS certificates. Now, you can let the NLB handle it for you, including the built-in integration with the AWS Certificate Manager, which can provision free, auto-renewing TLS certs instantly.
What to do about it: Check out the announcement blog post for details. Note that Terraform does not yet support specifying TLS certs for the NLB. Watch this issue for progress.
What happened: Amazon has announced that it’s managed Kubernetes service, EKS, now meets the criteria for ISO and PCI DSS Level 1 compliance in AMER, EMEA, and APAC.
Why it matters: Companies subject to ISO and PCI compliance requirements can now run all their workloads on EKS! Our support for EKS is in beta, and if you’re interested in getting access, email us at support@gruntwork.io!
What to do about it: Check out the announcement post for more info.