AWS Lambda Retry Pattern

This article will examine the AWS Lambda Retry Policy, as well as various error-handling strategies and recommended practices.

Intro

Regardless of your level of experience with serverless services in AWS, the Amazon lambda retry mechanism can be very confusing.

In a distributed system, many services are often triggered by each asynchronous call. In a microservice architecture every node needs to be thought of as a black box. A reliable fallback system with automatic retries and a strong error-handling mechanism is essential when designing a distributed system.

AWS Retry policy

  • API Gateway and other synchronous events won’t cause any auto-retry policies to kick in, so a fallback system must be implemented by the application.
  • Async events (like SNS and SQS): they will automatically cause two retries. It’s critical to save the event someplace if all attempts at retrying have failed. In a Dead Letter Queue (DLQ), for instance.
  • Stream-based events, like Dynamo DB Streams, will keep trying until the data runs out or is handled properly.

When it’s worth retrying

It is not necessary to retry every request. Retrying is sometimes just a time and resources waster and it may be that a failed request has no likelihood of success on further retry attempts. It is your responsibility to understand when it’s the d^case and avoid the retry. But how?

The easy way is just to set the Maximum Retry Attempt value to 0.

However, incorporating a Global Error Handler into your method is a more refined approach.
Here, Lambda will return with a valid JSON object and won’t retry any requests as it has properly handled all potential exceptions:

module.exports.handler = async (event) => {
	console.log('Received event:', JSON.stringify(event, null, 2));
	try {
		// Function Logic
	} catch (err) {
		console.log('[Handler] - Main function: ', err);
		// Send the error to Third-party monitoring services
	}
};

The request ID

Every request has a different request ID. You won’t receive the same ID until a Lambda retry has occurred. The value is located in the context.awsRequestId property (for NodeJS). So for you to understand if the current request is in effect a retry, the request ID must be kept somewhere. For this purpose, you can cache the ID in simple no-sql storage table like DynamoDB or Redis. At every request, you retrieve the cache and if it matches the current request ID, it means the current request is a retry.

DLQ

You can reroute failed events to an SNS topic or SQS queue using the dead letter queue. You can then choose to include an additional Lambda function to handle the unsuccessful events and forward them to a notification system (E.G. Send a message in a Slack channel).

Using CloudFormation, Terraform or the Serverless framework, you have the ability to directly setup the DLQ.

Step functions

With Step Functions, workflows may be represented as state machines. Some argue (for good reasons) that it is overly lengthy and laborious. But for sake of completness I’ll list the advantages of this solution, too.

Step functions aid in the construction of a microservice-oriented architecture. Often, I see developers performing several actions within a single Lambda. This is the best recipy for what’s called the “Serverless Monolith”.

However, using the state machine technique, it makes more sense to execute each action in a separate state, which are actually separate Lambdas.

The developer can choose the amount of retries and wait time for the retry behaviour, as well as the transition between states, by using Step Functions. Every job is capable of having an infinite timeout value. A StateTimeouterror is created if the task is not finished in the allotted time. Ensure that the Task timeout is set to match the Lambda’s timeout.

At the moment, there are only two types of events that can cause Step Functions to be triggered: API Gateway events and SDK events. The most popular method is to design a trigger-like Lambda proxy function. For instance, the SQS queue will trigger your proxy Lambda if you wish to use SQS to trigger your Step function. After parsing the SQS message, the proxy Lambda must call the Step Functions StartExecution API as needed.

Conclusion

Error handling in AWS Lambda can appear unclear and confusing at first.

In my opinion, the DLQ approach works well for simple jobs and small projects.

Step Functions can be a choice in more complicated circumstances requiring fine-grained control over the entire workflow and more control over the retry behaviour. Although this adds to the cost of state transitions, you gain greater flexibility (control over the amount of retries and timeout).

As final not, other methods, including the use of middleware (like Middy), can also be helpful in handling failures.