AWS Lambda Serverless, Automatic Rollbacks, and Wavefront Monitoring

April 3, 2017 Conor Beverland

Automatic rollbacks are a way to quickly revert back to a previous version of your application to limit the impact to your end users. Typically, the application developer specifies what the conditions are to a monitoring system, and if those conditions are met, the application is reverted to a previous version that is known to work.

Your application always needs to roll forward for enhancements. But it also needs to roll backwards when necessary. If there is state in a database or a cache, your application will have to be designed to take that into consideration and handle it correctly. While out of the scope of this blog post, there are a variety of ways to handle this cautiously, and rollbacks won’t help in every case. However, when used with care, it can be a valuable tool in one’s toolbox to limit the impact of a bad code push on your users.

A Peek into Serverless

Here at Wavefront, we’re excited about the possibilities of serverless technology and all the implications around monitoring with it, and of it. A growing percentage of our customers are also starting to use it too. If you’re not familiar with the concept of serverless, for the purpose of this discussion, we simply mean AWS Lambda. The key point of “serverless” is not that there are no servers, but that you don’t have to worry about them anymore, i.e. they are handled by your cloud provider. All you focus on is the core logic of your function and the cloud provider takes care of running it somewhere for you.

Quick side note: we recently introduced a turn-key monitoring suite for AWS cloud services including for AWS Lambda, see its prepackaged dashboard below. The suite also applies analytics with real-time AWS pricing to shows you how much you can reduce your overall cloud services bill. If you’re attending the AWS Summit in San Francisco on April 18-19, come by the Wavefront booth for a product demo. Or we can get you started with a free trial immediately.

Automatic Rollback Plugin

Let’s get back to automatic rollback. In the following example, we have a test program that we’re deploying using Serverless.com’s framework. It includes an API Gateway endpoint that in turn calls our test Lambda function that can be put into a mode where it returns a high error rate on the requests it’s processing, and reports these to the Wavefront Proxy.

module.exports.transaction = (event, context, callback) => {
let status = processTransaction();
if (status) {
sendWavefrontMetric(1, ‘demoapp.mymetric.success’, callback);
} else {
sendWavefrontMetric(1, ‘demoapp.mymetric.error’, callback);
}
};

We also have our serverless.yaml configuration specify that our automatic rollback plugin should be used and trigger a rollback if the 5 minute moving average error rate goes over 20%:

plugins:
– wavefront-serverless-rollback-plugin

custom:
wavefrontDebugMode: false
wavefrontForceDeploy: true
wavefrontApiKey: “xyzxyzyx-f706-4265-9acd-efbce085e82f”
wavefrontApiInstanceUrl: “https://try.wavefront.com”
wavefrontRollbackAlertTriggerThreshold: 2
wavefrontRollbackAlertCondition: ‘mavg(5m, sum(msum(5m, rate(ts(“demoapp.mymetric.error”))))/sum(msum(5m, rate(ts(“demoapp.mymetric.*”))))) > .2’

In the following run, we can see the automatic rollback happen after the Wavefront alert trigger fires.

$ watch -t -n0 curl -sq https://e42fywgyje.execute-api.us-east-1.amazonaws.com/dev/process
SENT to Wavefront: demoapp.mymetric.success 1 1490741766 source=demoapp-dev-transaction

As seen in the Wavefront graph above, the success rate (in percent) is high after the function is deployed as depicted by the orange line, with a tolerable amount of errors, shown by the blue line.

But then a newer faulty version is deployed that returns a higher rate of errors:

$ watch -t -n0 curl -sq https://e42fywgyje.execute-api.us-east-1.amazonaws.com/dev/process
SENT to Wavefront: demoapp.mymetric.error 1 1490743751 source=demoapp-dev-transaction

Seen above, the error rate keeps on climbing after the newer function has been deployed.

Thankfully, a Wavefront alert (configured below) is already set to catch this condition:

Alert definition So not long afterwards, the alert triggers a rollback to revert the faulty function to its previous version.

Now seen above, the success rate is coming back up after the alert triggered at the time marked by the highlighted event.

So, what happened here?

The Serverless.com plugin hooks into serverless’ deploy command. When the deploy command is invoked, it sets up an automated rollback as follows:

It downloads the service’s previous CloudFormation template to be used as the rollback target.
It uploads a rollback Lambda and sets up an API Gateway by which to trigger it.
It creates a Wavefront webhook and alert to trigger the rollback.
When the alert is triggered, it invokes the rollback Lambda via the API Gateway end point and executes the rollback using the previous version of the CloudFormation template.

The code is published as an npm and is also available here:

https://github.com/wavefrontHQ/wavefront-serverless-rollback-plugin

https://github.com/wavefrontHQ/wavefront-serverless-rollback-plugin-demo

Accuracy of Alerting Triggers

This plugin allows you to take advantage of Wavefront’s rich alerting expressions to specify when your application should be automatically rolled back. We feel that this flexibility and the ability to be accurate are really important to highlight. Alerting capabilities in most monitoring tools simply don’t have the range of analytics expressiveness to allow you to zero in on the exact conditions to alert on. Either they’re too broad with just simple percentages or even a single static value, or are too complex and add a level of indirection that requires time-consuming calibration. Alerting is a safety net, and if it’s not moldable into your real-world scenarios, or requires settling for levels of false alerts or missing alerts. then there will be significant gaps in its ability to support you.

One example of a real-world problem is having a sporadic volume service, or simply a service that has lulls in the early mornings, such as a web forum or perhaps a promo redemption service where the error rate percentage is higher. In the following example, we take the time of day into consideration. Between midnight and 2am, we allow for higher error rates since we know there is known noise.

if(between(hour(“US/Pacific“),0,2),
mavg(5m, sum(msum(5m, rate(ts(“demoapp.mymetric.error“))))/sum(msum(5m, rate(ts(“demoapp.mymetric.*“)))) > .3,
mavg(5m, sum(msum(5m, rate(ts(“demoapp.mymetric.error“))))/sum(msum(5m, rate((ts(“demoapp.mymetric.*“))))) > .2)

So now no one is needlessly woken up at 2am, because you were able to tailor the condition to a specific real world condition.

Rolling Back on Our Discussion

We hope this discussion has been of some use, and that the Serverless.com plugin is useful too. This is just one example of how Wavefront’s powerful, analytics-driven alerting can be integrated into the deployment process for an application and improve the end user experience by detecting a bad deployment before a human may be able to react. Automatic rollbacks are not appropriate in every deployment depending on the code of course, however when used with care, they’re a valuable tool in improving your end user experience when things go wrong.

Interested to learn more about how Wavefront can monitor your applications and cloud services on AWS like Lambda and more? Want to alert more intelligently using advanced analytics and dynamic queries to reduce alert noise, and automate more with confidence? Sign up now for a free hands-on demo and trial of the Wavefront service.

Thank you to Anggoro Dewanto at AccelByte Inc for developing the Serverless plugin.

Get Started with Wavefront Follow @conor_bev

The post AWS Lambda Serverless, Automatic Rollbacks, and Wavefront Monitoring appeared first on Wavefront by VMware.

Containers are Leaky Abstractions (and other truths I hide from my kids)

I recently sat on a panel at the ContainerWorld 2017 conference, discussing the maturity model of container...

Metrics vs. Logs Dilemma: Selecting the Right Platform for Monitoring Your Cloud Services (part 3 of 3)

The post Metrics vs. Logs Dilemma: Selecting the Right Platform for Monitoring Your Cloud Services (part 3 ...

Visionary in Gartner® Magic Quadrant™

Learn More

Return to Home

AWS Lambda Serverless, Automatic Rollbacks, and Wavefront Monitoring

A Peek into Serverless

Automatic Rollback Plugin

Accuracy of Alerting Triggers

Rolling Back on Our Discussion

Previous

Next

AWS Lambda Serverless, Automatic Rollbacks, and Wavefront Monitoring

A Peek into Serverless

Automatic Rollback Plugin

Accuracy of Alerting Triggers

Rolling Back on Our Discussion

Previous

Next

Related content in this Stream

Monitoring collects data, while observability offers contextualization and strategic insights into complex systems. Learn more about the differences and why observability is so powerful.

The unified observability platform in VMware Aria Operations for Applications brings together metrics, traces, and log management to deliver critical business outcomes.

With nearly 100 percent compatibility with Grafana dashboard queries, VMware Tanzu Observability delivers excellent support for PromQL.

VMware Tanzu Observability offers easy integration with AWS CloudTrail, enabling operators to view events related to governance, compliance, and operational and risk auditing for your AWS account.

See how VMware Tanzu Observability gave a British smart meter company unprecedented visibility into its platform and smoothed the path creating more innovative products.

A change to Grafana licensing means limited functionality for users of some platforms that rely on it. Here’s how Tanzu Observability can fill the gaps.

OpenShift users can now take advantage of VMware’s revamped full-stack monitoring solution of Kubernetes clusters with Tanzu Observability by Wavefront.

Updates to VMware Tanzu Observability include new ecosystem integrations and usability features designed to improve incident response.

We are holding two different design studio research sessions at VMworld that will give you the opportunity to influence the direction of VMware Tanzu Observability.

In addition to VMware Tanzu Observability supporting various instrumentation and ingestion methods for distributed tracing, it now natively supports OpenTelemetry.

Highlights from SpringOne Day 2 include more details about Tanzu Application Platform, demos of Application Accelerator and Tanzu Observability, plus summaries of some of our favorite talks.

We’re excited to announce enhancements to the VMware Tanzu Observability by Wavefront platform.

The integration of Jaeger with Tanzu Observability will help you visualize the application traces and identify any errors or performance issues.

We at VMware Tanzu recently published our first-ever summary of the current state of observability, a report entitled The State of Observability 2021.

The VMware Tanzu Observability by Wavefront engineering team recently completed 30 days of improvement focused on query quality.

VMware Tanzu Observability was named as a fast-moving leader in technology research and analysis provider GigaOm's forward-looking assessment of the cloud observability vendor space in 2021.

VMware recently announced that Apdex is now available in Tanzu Observability by Wavefront.

Companies running cloud-native apps and infrastructure will improve the user experience and boost app availability by adopting real-time alerting and predictive analysis.

New functionalities of Tanzu Observability by Wavefront accelerate analytics-driven insights and data onboarding for DevOps teams, including developers, Kubernetes operators, and wider ops teams.

Looking for a way to proactively troubleshoot complex application performance issues? Look no further than Tanzu Observability by Wavefront.