Configuring AWS Autoscaling Event Notifications in Slack

One of the easiest ways of building resilience into a system running in AWS is to use an autoscaling group. Generally speaking, I use one for any service which is required to self-heal - even when aiming to maintain a steady number of instances, as is desirable when running servers for Consul and Nomad, as well as a whole host of other clustered systems. Unhealthy instances can simply be replaced, usually without operator intervention, and launch configurations can be used to simplify upgrading clustered software one instance at a time.

However, it is often useful to be able to easily track activity within the chat system of your choice. In this post, we'll look at how to use Terraform to deploy an AWS Lambda function which posts a message in Slack whenever a scaling operation happens - regardless of whether it was caused by an operator in the AWS console, API-driven changes, or automatic scaling for health.

Configuring Autoscaling Notifications

One of the features of AWS Autoscaling is the ability to deliver notifications to an SNS topic whenever a scaling event happens. We can configure notifications for the following types of events:

  • successful launch of a new instance
  • failed launch of a new instance
  • successful termination of a running instance
  • failed termination of a running instance
  • test notifications (more on these later)

In Terraform, a resource named aws_autoscaling_notification is used to configure notification delivery. We need to specify the notification event types we are interested in, the names of the autoscaling groups whose events we want, and the ARN of the SNS topic the notifications should be delivered to.

First though, we'll use the aws_sns_topic resource to configure the SNS topic for notifications to be delivered to:

resource "aws_sns_topic" "asg_slack_notify" {  
    name = "SlackNotify-ASG"
    display_name = "Autoscaling Notifications to Slack"
}

Then we can configure notifications to be delievered to the topic we created:

resource "aws_autoscaling_notification" "slack_notify" {  
    group_names = ["${var.asg_names}"]
    notifications  = [
        "autoscaling:EC2_INSTANCE_LAUNCH",
        "autoscaling:EC2_INSTANCE_TERMINATE",
        "autoscaling:EC2_INSTANCE_LAUNCH_ERROR",
        "autoscaling:EC2_INSTANCE_TERMINATE_ERROR",
        "autoscaling:TEST_NOTIFICATION"
    ]
    topic_arn = "${aws_sns_topic.asg_slack_notify.arn}"
}

For now, we're setting the autoscaling group names to the value of a variable named asg_names - we'll look more at how that gets populated later, when we talk about the overall structure of this module.

Sending Notifications to Slack

Lambda Function

Now we have notifications being delivered, we can write a Lambda function to extract the important information and use the Slack Webhooks API to send messages into the channel of our choice. I'm using JavaScript for this, but in principle you could use any of the supported Lambda platforms.

var https = require('https');  
var util = require('util');

exports.handler = function(event, context) {  
    try {
        var message = JSON.parse(event.Records[0].Sns.Message);

        var channel = process.env.SLACK_CHANNEL
        var username = process.env.SLACK_USERNAME
        var webhookId = process.env.SLACK_WEBHOOK

        var eventType = message.Event;
        var autoScaleGroupName = message.AutoScalingGroupName;
        var description = message.Description;
        var cause = message.Cause;

        var slackMessage = [
            "*Event*: " + eventType,
            "*Description*: " + description,
            "*Cause*: " + cause,
        ].join("\n");

        var postData = {
            channel: channel,
            username: username,
            text: "*" + autoScaleGroupName + "*",
            attachments: [{ text: slackMessage, mrkdwn_in: ["text"] }]
        };

        var options = {
            method: 'POST',
            hostname: 'hooks.slack.com',
            port: 443,
            path: '/services/' + webhookId
        };

        var req = https.request(options, function(res) {
            res.setEncoding('utf8');
            res.on('data', function (chunk) {
                context.done(null);
            });
        });

        req.on('error', function(e) {
            context.fail(e);
            console.log('request error: ' + e.message);
        });

        req.write(util.format("%j", postData));
        req.end();
    } catch (e) {
        context.fail(e)
    }
};

This is fairly self-explanatory code - the important things to note are the variables which must be set in the function's environment - SLACK_CHANNEL, SLACK_USERNAME and SLACK_WEBHOOK_ID.

Packaging

Lambda requires the code that makes up a function to be packaged as a zip archive before it can be deployed. We can use of Terraform's archive_file resource to do this:

data "archive_file" "notify_js" {  
    type = "zip"
    source_file = "../../lambda/asgSlackNotifications.js"
    output_path = "../../lambda/asgSlackNotifications.zip"
}

In this case we're not using any third party NPM modules, so simply archiving the JavaScript file itself is sufficient.

Creating the Lambda Function

Next, we can use the aws_lambda_function resource to create the lambda function itself, using the archive:

resource "aws_lambda_function" "slack_notify" {  
    depends_on = ["data.archive_file.notify_js"]

    function_name = "asgSlackNotifications"
    description = "Send notifications to Slack when Autoscaling events occur"

    runtime = "nodejs4.3"
    handler = "asgSlackNotifications.handler"

    role = "${aws_iam_role.slack_notify.arn}"

    filename = "${data.archive_file.notify_js.output_path}"
    source_code_hash = "${base64sha256(file(data.archive_file.notify_js.output_path))}"

    environment {
        variables {
            SLACK_CHANNEL = "${var.channel}"
            SLACK_USERNAME = "${var.username}"
            SLACK_WEBHOOK = "${var.asg_hook_id}"
        }
    }
}

There are a few interesting points about this resource:

  • The depends_on specification ensures that the archive file has finished being to processing this resource - it consists of the Terraform type and the specified name of the resource.

  • Assigning a hash of the source code archive ensures that we will appropriately update the lambda function if the code package changes.

  • The environment variables we called out in the code above are set in the environment block. A future improvement is to encrypt the ID of the webhook using KMS, and use the AWS kms:Decrypt operation in the lambda function in order to obtain the value so it is not availabe in plain text to an operator looking at the console.

  • The handler must match the module name and function name in the source file, or invocations of the function will fail.

  • We assign an IAM role to the function by ARN. We'll look at the content of this next.

Creating an IAM Role

In order to create the role associated with the Lambda function, we need a couple of resources and data sources:

Using the aws_iam_policy_document data source in Terraform allows us to author policies using HCL rather than templating. Whether you choose to use this is something of a matter of preference, but we tend to find it substantially better than writing or templating JSON.

First, we'll look at the data source for the policy for who can assume the role:

data "aws_iam_policy_document" "assume_role" {  
    statement {
        effect = "Allow"
        actions = [
            "sts:AssumeRole",
        ]
        principals {
            type = "Service"
            identifiers = ["lambda.amazonaws.com"]
        }
    }
}

When reified at plan time, this will produce the following JSON policy text as the json attribute:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            }
        }
    ]
}

We can use the rendered JSON to create our role:

resource "aws_iam_role" "slack_notify" {  
    name = "SlackNotifications"
    assume_role_policy = "${data.aws_iam_policy_document.assume_role.json}"
}

Next, we can write the text of the policy, which allows writing logs to CloudWatch.

data "aws_iam_policy_document" "slack_notify" {  
    statement {
        sid = "CloudwatchLogs"
        effect = "Allow"
        actions = [
            "logs:CreateLogGroup",
            "logs:CreateLogStream",
            "logs:GetLogEvents",
            "logs:PutLogEvents"
        ]
        resources = ["arn:aws:logs:*:*:*"]
    }
}

Finally, we create an inline policy on the role, using the reified policy text:

resource "aws_iam_role_policy" "slack_notify" {  
    name = "SlackNotifications"
    role = "${aws_iam_role.slack_notify.id}"
    policy = "${data.aws_iam_policy_document.slack_notify.json}"
}

Subscribing to the SNS topic

Before we can subscribe a Lambda function to an SNS topic, we must first add a permission to the function to allow the lambda:InvokeFunction permission to the SNS topic. We can use the aws_lambda_permission resource to do so:

resource "aws_lambda_permission" "with_sns" {  
    statement_id = "AllowExecutionFromSNS"
    action = "lambda:InvokeFunction"
    function_name = "${aws_lambda_function.slack_notify.arn}"
    principal = "sns.amazonaws.com"
    source_arn = "${aws_sns_topic.asg_slack_notify.arn}"
}

Finally, we can create a subscription with a aws_sns_topic_subscription resource:

resource "aws_sns_topic_subscription" "lambda" {  
    depends_on = ["aws_lambda_permission.with_sns"]
    topic_arn = "${aws_sns_topic.asg_slack_notify.arn}"
    protocol = "lambda"
    endpoint = "${aws_lambda_function.slack_notify.arn}"
}

This is the final piece of the configuration puzzle needed to provision all our cloud resources. Before we can plan or apply it though, we need to talk a bit about module arrangement and instantiation.

The Composition Root Pattern

Many of the Terraform best practices discussed on the web today revolve around the idea of building an entire infrastructure with one command. I prefer a world of small, cohesive modules instead - where infrastructure is made up of many states representing individual components. I'll go into the rationale for this shortly, but first let's look at the layout of a component:

$ tree
.
├── README.md
└── terraform
    ├── environments
    │   ├── production
    │   │   └── main.tf
    │   └── staging
    │       └── main.tf
    ├── lambda
    │   └── asgSlackNotifications.js
    └── modules
        └── asg-notifications
            ├── iam.tf
            ├── interface.tf
            ├── lambda.tf
            └── notifications.tf

In this repository layout, we separate individual functional units into modules, and then use a composition root per individual environment - in this case staging and production.

We consider there to be a number of benefits to this approach versus the commonly seen terraform.tfvars-per-environment approach. They will be covered in a lot more depth in future posts, but the big reason for now is that composition roots which are Terraform configuration files rather than variables files can use data sources to obtain values to plug in to the modules, and additional modules can be composed on a per-environment basis as necessary.

Composition roots tend to have a pattern which includes the following elements:

  • Provider instantiation
  • Data sources to query module values
  • Module instantiation
  • Outputs to provide for use in other composition roots

Staging Environment

The composition root for our staging environment looks like this:

provider "aws" {  
    region = "us-west-2"
}

data "aws_autoscaling_groups" "all" {}

module "asg_notifications" {  
    source = "../../modules/asg-notifications"

    asg_names = "${data.aws_autoscaling_groups.all.names}"

    asg_hook_id = "<redacted>"
    channel = "#ops-staging"
    username = "aws"
}

Notice the hard-coded environment specific variables such as the channel name, which would normally live within a .tfvars file. In this case we do not need to provide any outputs.

In the case of needing to replicate this in another environment, a separate composition root in a different subdirectory of environments would be used - for example environments/production/main.tf.

Planning and Applying

Now the module and composition root for our target environment root are ready, we can run a plan and ensure all is as we expect. To do this, we use the terraform plan command, with the -out flag to ensure the plan is saved.

$ export AWS_ACCESS_KEY_ID=<redacted>
$ export AWS_SECRET_ACCESS_KEY=<redacted>
$ cd staging
$ terraform plan -out 001.plan

# Plan output removed for brevity

Plan: 10 to add, 0 to change, 0 to destroy.  

Finally, we can apply the plan using terraform apply 001.plan to create the resources and start receiving notifications!

Summary

In this post we've seen a few things which will feature more heavily in future in my posts on this blog:

  • Managing SNS topics and subscriptions, Lambda functions and permissions
  • Configuring Autoscaling Notifications
  • Using aws_iam_policy_document to write IAM policies in HCL rather than JSON
  • The composition root pattern for Terraform.

If you want to run this for yourself, you'll need Terraform version 0.8.5 (for the aws_autoscaling_groups data source). If you're running in Terraform Enterprise and using the composition root pattern, be sure to set the TF_ATLAS_DIR environment variable to the root of the environment you are provisioning for.

In my next post, we'll be looking at using Terraform to build a high quality reusable VPC module, configuring a VPC and all of it's accoutrements such as flow logging, VPC endpoints, NAT and routing.