Reactive Apps with AWS Lambda

I build tools to make deployments smoother, infrastructure cheaper, and feedback loops faster.

AWS Lambda is a service (released 04/09/2015) that consumes events from Kinesis, S3, DynamoDB, SNS, and more. You can use it to make advanced materialized views out of DynamoDB tables, react to uploaded images, or archive old content. In short, you write a function (currently only in node.js) and it is presented with JSON containing information about the event's source and content.

Another way to run Node.js? Why bother?

-- Everyone

In a way, Lambda is a unique take on the Platform as a Service concept. A typical PaaS might offer to serve your web app, but Lambda takes the "serve" part out and replaces it with "reactively run". The instance your Lambda function runs on isn't running all the time, and you can have as many functions as you can trigger running at once. You could use it as a replacement for resque or another background job processor with a managed solution.

This post is a tour of the powerful ways you can use Lambda to react to events. First, we'll tour a sample application I built that generates a static site from markdown files in S3, then we'll examine more effective ways to use Lambda.

Caveat Emptors

Before we get started, let's get a few things out of the way. Lambda as a service name is a bit annoying because it stomps over several other useful contexts for the word, but we'll suspend those for the moment.

Lambda also has several limitations at the time of this writing.

Function runtime is limited to 60 seconds
Node.js is the only supported language
Maximum of 500MB (ephemeral) storage and 1GB memory
Debugging involves a lot waiting for CloudWatch logs to show up
Only one Lambda trigger can exist per S3 bucket

Hugo-Lambda: Demo App

Being able to react to events without needing to constantly run (and pay for) EC2 instances opens up new ways to use existing tools. Hugo-lambda rebuilds a static site from source whenever a change is uploaded to S3.

It's likely the cheapest hosted CMS around. Using S3 website hosting for generated content, Route53 for DNS, and Lambda to generate the site from source can host your entire site within the AWS free tier. Even if you don't qualify for the free tier, the total cost for a site updated daily would be less than $1 per month.

Every time new content is uploaded, hugo-lambda downloads your site templates, themes, and content to run hugo and uploads the generated site (with the correct storage ACLs) to the public bucket for your site.

Of course, if you're like me you don't get around to updating your blog daily, but that's ok. As with all of AWS, you only pay for what you use. You're charged only for time hugo-lambda actually spends generating your site instead of paying to run WordPress, Drupal, or another CMS 24/7.

Running Unsupported Languages

Over the last several years there has been a huge crop of excellent static site generators, led by Jekyll. I prefer hugo, and since it's written in Go it's distributed as a single static binary.
When included with the Node.js dependencies for the function, hugo can be invoked as a subprocess using spawn.

var async = require('async');
var spawn = require('child_process').spawn;
exports.handler = function(event, context) {
 async.waterfall([
 // function to download content skipped for brevity
 function runHugo(next) {
 var child = spawn("./hugo", ["-v", "--source=/tmp", "--destination=/tmp/public"], {});
 child.on('close', function(code) {
 console.log("hugo exited with code: " + code);
 next(null);
 });
 },
 // function to upload finished site skipped for brevity
 ], function(err) {
 if (err) console.error("Failure because of: " + err)
 else console.log("Site generated successfully!");
 context.done();
 }}
}

The above code is an abbreviated version of RunHugo.js from the hugo-lambda project, but it can (almost) stand on its own.

Handling Events

Lambda can take events from a variety of sources, but hugo-lambda only needs to listen to S3 events. S3 is sort of the odd duck of Lambda notifications because it doesn't show up in the list-event-sources API, instead it's attached to the bucket and is a part of S3's get-bucket-notification API.

{
 "CloudFunctionConfiguration": {
 "InvocationRole": "arn:...InvokeRole-SQU198TLCHES",
 "CloudFunction": "arn:...:function:HugoLambdaGenerate",
 "Events": [
 "s3:ObjectCreated:*"
 ],
 "Id": "HugoLambdaGenerate-notification",
 "Event": "s3:ObjectCreated:*"
 },
}

Event sources for other DynamoDB and Kinesis follow a similar format, requiring an invocation role, function ARN, and source ARN. An Amazon Resource Name (ARN) is a unique, namespaced identifier that lets you refer to resources in configurations and API calls

Is Lambda a Microservice Platform?

As an aside: if you haven't, I really recommend reading Martin Fowler's definitive piece on Microservices.

Now, you may be thinking "small programs with limited state and transparent scaling? That's just microservices right?" There are certainly overlapping advantages, let's see what matches up.

Componentization

Each Lambda function is an independent component, and they can be chained together by having the output of one trigger the next function (or group of functions). Because of this, they are easy to experiment with and play well with other data systems.

Smart endpoints and dumb pipes

In a lot of definitions of microservices, people take this to mean "uses RESTful HTTP interfaces between components". Lambda events follow a strict JSON format. Here's an abbreviated example of an S3 event for a new object.

{
 "Records": [
 {
 "s3": {
 "object": {
 "eTag": "50ed8c18234b65e3baf1417eac1bb03f",
 "size": 307,
 "key": "content/posts/test.md"
 },
 "bucket": {
 "arn": "arn:aws:s3:::some-bucket-name
 "ownerIdentity": {
 "principalId": "..."
 },
 "name": "some-bucket-name
 },
 "configurationId": "Kappa-HugoLambdaGenerate-notification",
 "s3SchemaVersion": "1.0"
 },
 "eventVersion": "2.0",
 "eventSource": "aws:s3",
 "awsRegion": "us-east-1",
 "eventTime": "2015-04-02T22:50:04.028Z",
 "eventName": "ObjectCreated:Put",
 "userIdentity": {
 "principalId": "..."
 }
 }
 ]
}

That seems pretty simple, it even includes extra metadata about the object like it's size and etag (md5sum). The message format is one part of the pipe, the other part is how messages are received. The event notification system is very straightforward because it only needs the ARN of the sender (source) and receiver (Lambda function) to successfully route messages. All the delivery semantics are hidden completely.

Decentralized Data Management

This is up to you. Of course, hugo-lambda is a case of highly centralized data management as each function run needs all the site sources to do its job. The best use cases for Lambda have events that contain all (or most) of the information needed to process it. An example might be the event generated by an image upload to be resized in Lambda, or a new document to be indexed.

Design for Failure

Lambda functions abstract away most failure modes, since instance- and availability-zone-level failures can be routed around by triggering functions to run elsewhere.

Hugo-Lambda Usage Patterns

Hugo-lambda is a great demo application, but not a great use of Lambda. In fact, it violates two pretty critical assumptions made by the service. Lambda is on the idea that every event is independent can be processed incrementally. Unfortunately, for a full static site (in my case a blog), this isn't true. Edits can be interdependent, and it isn't easy to tell what parts of the site are affected by a new post or partial template.

A new post can cause changes all over the site. The sidebar of every page, the tag listing page (if the post has a new tag), the archives page, and more. Without having these changes expressed when a new file is added to S3 it's impossible to regenerate the site without downloading all the content and templates first.

Improved Usage Patterns

The only way to really fix this would be to express the site dependency tree between inputs (templates, content, etc) to allow each hugo-lambda run to only download content that depends on a piece of content. This would further reduce costs and make each run that much faster.
A better use case for Lambda would be to have it roll up events into summary events, or into other indices. Let's walk through what an example that makes better use of Lambda.

DynamoDB Event Roll-Ups

Let's take an online game as an example, where a list of top scores need to be displayed. The Lambda function will roll up the stream of incoming scores into a "recent best" record that has the best scores in the past hour. You may even think of that record as a sort of materialized view put together by your Lambda function. This fits Lambda much better because each event (game play-through) is independent and the high score list doesn't need to be updated by the client, and is high-traffic so it can't be computed on every read.

Problem Outline

Writes and reads both need to be quick for this case, because you don't want users to wait after they finish a game to start the next one or wait to see the high score list when the app opens. At the same time, you can afford to have some latency between a game completing and the score being posted to the high score list.

To solve this with Lambda, we can build a flow like:

Game completes and writes information to DynamoDB
Lambda function is invoked with the score event
Lambda views the new scores and if it beats the old scores, updates the list.
If changed, the score list is stored in a well-known DynamoDB key in the same table to be read by everyone

Event Format

at the end of each game, this record is stored to the DynamoDB table. First, let's see what an item looks like.

{
 "event_key": "{uid}",
 "length_seconds": 55,
 "start_timestamp": 1428071547,
 "completed": true,
 "score": 924,
 "handle": "EdScissorHands",
 "achievements": [
 {"bonus_score": 200, "id": "{uid}", "shortname": "30 Kill Streak"},
 ...
 ]
}

The KeySchema is also important here. That looks like:

[
 {
 AttributeName: "event_key",
 KeyType: "HASH"
 },
 {
 AttributeName: "start_timestamp",
 KeyType: "RANGE"
 },
]

The event key is composed of the UID of the player and a range key of the event timestamp. This isn't a great key design, and you can learn more about shard key design in this AWS Advent DynamoDB post or in the MongoDB docs, but that's way beyond this article's scope.

Function Roll-Up

// ProcessScores.js
var AWS = require('aws-sdk');
var async = require('async');
exports.handler = function(event, context) {
 var ddb = new AWS.DynamoDB();
 console.log("Event: %j", event);
 async.waterfall([
 function getScores(next) {
 ddb.getItem({
 // ... scores record info ...
 }, function(err, data) {
 // pass the scores to the next step
 next(null, data.Item);
 });
 },
 function readNew(scores, next) {
 var newScores = false;
 for(i = 0; i < event.Records.length; ++i) {
 // for all the new scores, see if any of them beat the old scores
 }
 // if they do, update the "scores" item
 if (newScores) next(null, scores)
 else context.done(); // bail out if there is no change
 },
 function writeNew(scores, next) {
 ddb.putItem({
 Item: scores,
 TableName: "scores-table"
 }, function(err, data){
 next(err);
 })
 }
 ], function(err) {
 if (err) console.error("Failure! " + err);
 context.done();
 });
}

The steps we outlined earlier translated easily for this example, and we can even handle batches of writes (say, 100 completed games at a time) to reduce the number of Lambda function calls that are made. It's just as simple to trigger this every time there's a new score, but it's optional since a high score list doesn't always need to be up-to-date.

Wrapping Up

Here we've seen two applications of Lambda to different problems, and learned why some workloads make more sense for this new service. Most of the marketing for Lambda centers around mobile apps and games, but there are plenty of other places Lambda fits well.

Labels

Technology