AWS Lambda monitoring - How to optimize function execution time and performance

20-Feb-2018 07:48 PM by Lakshmi Narayan J

Function as a service (FaaS) products like AWS Lambda, Azure Functions, and Google Cloud Functions have instigated a paradigm shift in the way Ops teams provision and manage their organization's infrastructure. With everyday administrative tasks like provisioning, patching, maintaining compliance, and configuring operating systems all being abstracted away, your Ops team only has one task to work on - writing world-class code.

But going serverless has its share of challenges, especially when it comes to monitoring. Ephemeral container models, coupled with the inability to access underlying servers, have rendered traditional methods of running agents/daemons to monitor infrastructure useless.

In this blog, we'll go over which parameters you need to keep an eye on while you start building your serverless Lambda functions.

Number of invocations

Request rate gives you an idea of how frequently your event source (trigger) is publishing events to invoke your Lambda function. These event sources could be anything from alerts, external HTTP requests, upstream services like S3, DynamoDB, and Kinesis, or even your very own custom application. Setting up alarms and monitoring the total number of requests across all your functions is important if you are operating within Lambda's free tier limits. Although you can use Lambda's free tier indefinitely, it only includes one million free requests every month.

Duration

If you are running your application code as a back-end service, then execution time might not be your first concern. But if that code is being used to serve external HTTP requests and an end-user is waiting on the other line, then you have to keep your code's execution time to a minimum.

Possible reasons for slow code execution

Cold starts

Cold starts come into play in several scenarios, such as:

When your function is executed for the first time or after a long period of time.
When you update your application code to a newer version.
When you run out of warm containers or if AWS decides to swap containers.

Error retries

When your Lambda function is unable to process a record in the shard, it'll stop processing subsequent records and keep retrying until it succeeds or the record expires. If the error isn't handled properly, it could further add to the delay between your event source and function.

Complexity

Next comes the complexity of your function. A straightforward function performing simple back-end processes generally takes very little time, whereas a complex function requiring external dependencies and libraries to load into memory could take a lot of time to run.

Application logic

Finally, there's the business logic of the function. Generally, a Lambda function is deployed as a glue service, connecting different components in your application architecture. Lambda functions can typically call any number of downstream services or external API endpoints so issues in any one of the dependency resources - network latency, non-responsiveness, throttles/retries - can all add to a function's execution time.

Errors

Uncaught exceptions occurring as part of the Lambda runtime can cause your application code to exit, leading to function execution failure. To troubleshoot and debug your function, you can push the code-generated logs to the log group associated with your function.

You can also embed your code with logging statements to publish exception prints for specific methods. The logging calls you insert depend on the function's scripting language. You can either use common stdout or stderr statements or use specific methods like:

Console.log() for Node.js
Console.write() for C#
logging.debug() for Python
log.debug() for Java

But before you use any of those, please make sure the IAM role (execution role) assigned when you created the Lambda function has the necessary permission to publish function execution logs to CloudWatch Logs. If it doesn't have the right permissions, you'll see a permission denied error in your logs.

Throttles

If your function invocation requests become throttled, you should start worrying. Currently, Amazon Web Services has a default upper limit of 1,000 concurrent executions across all functions deployed in a particular region. That limit might seem large, but too many executions of a single function can quickly end up throttling invocation attempts of other critical functions deployed within the same region. Another key point to note is that the concurrent execution count will depend upon how your function is invoked.

Stream-based event sources follow the number of shards equal to the number of invocations concurrency model. For event sources that aren't stream-based, you need to multiply the invokes per second with function duration time to get the current concurrency figure. Avoid throttles and stay within the concurrency limits by optimizing your function invocations.

For an event source that isn't stream-based, like S3, you can buffer the load by using a service like SQS.
For a stream-based event source like Kinesis, you can stay within limits by specifying the batch size and decreasing the number of shards.

Keep tab of your Lambda costs

For AWS Lambda, you are charged based on the number of requests (including test and error invocations) and function execution time. So, from the perspective of your finance team, these two metrics are essential.

Find the right balance

The resources used for execution (memory) and the time spent executing the function are inversely proportional to each other. You can reduce the run time of your application code by increasing the amount of allocated memory, but you'll end up paying more. Similarly, decreasing the function memory setting might help reduce costs, but it would also increase your execution time, and in the worst case, lead to timeouts or memory exceeded errors. For AWS account holders, it is essential to strike a balance between allocated memory and costs.

Each time your function is executed, it writes a log entry in the associated log group. In the above image, you can see Duration, Billed Duration, Memory Size, and Max Memory Used. Here, it's quite straightforward to see that the user has over-provisioned memory. (The function is only using 35 MB during execution, but it has 192 MB configured.) Look at your function's own log entries and compare Max Memory Used with Memory Size to determine whether your application needs more memory or whether you have over-provisioned.

Things get even more interesting when you compare Billed Duration with Duration. Imagine a scenario where your function's average invocation duration is somewhere in the range of 1,120 ms and the average billed duration is 1,200 ms (billed duration is rounded up to the nearest 100 ms). Here you can try reducing your function's memory setting to see if you can reduce the gap between the two.

Conclusion

To stay on top of everything that's going on within your Lambda environment, you really need systematic monitoring. The best way to accomplish this is by using a full-stack monitoring solution like Site24x7.

If you're new to Site24x7 and you haven't tried our AWS infrastructure monitoring capabilities yet, then all you need to do is go through an initial setup process. First, you'll need to enable access to your AWS account by creating a cross-account IAM role and grant read-only permissions. Once you're done, within minutes you'll see time-series graphs for each performance counter, with near real-time data available for analysis.