Intelligent Alert Design: Three Simple Tips for Increasing Alert Robustness

February 10, 2017 Parag Sanghavi

The wider adoption of microservices and containers is leading change in modern application delivery and monitoring approaches. With DevOps, where code in production is continuously updated, and application architecture is highly segmented, a metrics-driven approach coupled with analytics is critical to understanding and maintaining optimal application health. To avoid outages and performance slowdowns, DevOps, SREs and development teams must rely on intelligent alerting. This also means intelligent alerts have to work properly, despite potential metric flow disruptions. You design your cloud applications to be resistant to sub-system failures; critical alerts should have the same design robustness. False positives take their toll, no matter what the underlying cause.

Enter Wavefront! The Wavefront metrics platform offers a powerful yet intuitive way to alert analytically on all your metrics data – from application to infrastructure – no matter the shape, distribution or format. Gathered from our monitoring experience with large SaaS and digital enterprises leaders, here are a few simple tips for making your critical alerts even more robust:

Tip 1 – Alerts that account for delayed metrics

Network delays or slow processing of application metric data at the backend can have a negative impact on alert processing, which can lead to false triggers. An alerting mechanism that is too sensitive to delayed metric data can falsely trigger an alert. As the delayed metric data points are processed, the backfill data will arrive, and the alerts will resolve. The “backfill data” concept means adding missing past data to make a chart complete with no voids and to keep all formulas working. Adjusting the alerting query to account for delayed metric data points will prevent false positives. Use the lag function to avoid this situation:

lag(30m, sum(ts(“aws.elb.requestcount”))) < 0.3 * lag(1w, sum(ts(“aws.elb.requestcount”)))

The above example analyzes the last 30 minutes of the “aws.elb.requestcount” metric. It then compares it with the value measured one week ago and determines if the request count had dropped below 30%. With this alert query, we have not only insured that delayed metric data points do not falsely trigger the alert as we are looking at a wider 30-minute window which allows delayed data points to catch up but also looks at the overall trend of the data.

As an alternative approach, it’s possible to set the “minutes-to-fire” threshold higher than the default two minutes. This setting depends on the frequency of the arrival of data points, and it accounts for all possible delays in the application metrics delivery pipeline. This compensates for external delays of metrics to the Wavefront Collector service.

Screen Shot 2017-02-09 at 6.49.27 PM

Tip 2 – Alerts that account for missing metric data points

Sometimes, your host or application can stop sending metrics. Use the `mcount()` function to compensate for that situation. It will count the number of reported points per time series in the last ‘X’ minutes. A general query could be:

mcount(5m, ts(my.metric)) = 0

Depending on the particular implementation and use case, see the following recommendations:

The interval associated with `mcount()`should be unique to your set of data. If data is expected to be reported once a minute, then `mcount(30s,)` may not be the best approach. Also using `mcount(1m)` can result in false positives due to delays. Selecting, `mcount(5m,)` is a better choice as it requires 5 minutes of “NO DATA” to trigger.

The `= 0` clause in the prior example query can also be tweaked. If you want to know when there has been “NO DATA” being reported, then it’s the right approach. However, if you expect data to be reported once a minute, and you’d like to know when it’s not consistently getting reported, then mcount(5m, ts(my.metric)) <= 3 would work better. With this approach, the alert is triggered with only two missing data points in the 5-minute window.

Tip 3 – Alerts that check metric data flows from Wavefront Proxy

The Wavefront Proxy is software that allows you to collect application and infrastructure metrics using your open-source agent of choice like collectd, statsd, telegraf and others. After data collection, the Wavefront Proxy then pushes the data to the hosted Wavefront Collector service. It’s essential to confirm that the Proxy is checking-in with the Collector service to ensure that metrics data are actually being pushed to the cloud. To create this alert, use the following Wavefront time-series query:

default (1d, 2m, -1, sum(rate(ts(~agent.check-in)), sources)) = -1

This query uses the ~agent.check-in metric to verify if the Wavefront Proxy is reporting-in. If they don’t report, it uses the default function to provide a default value of -1. So it triggers an alert when that value equals -1.

To learn more about Wavefront’s query-driven smart alerting on metrics refer to our documentation. Try the Wavefront platform today for free!

Follow @WavefrontHQ Follow @stela_udo Follow @paragsf

The post Intelligent Alert Design: Three Simple Tips for Increasing Alert Robustness appeared first on Wavefront by VMware.

Advanced Cloud Metrics and Analytics Help uShip Deliver Quality Software Faster

The post Advanced Cloud Metrics and Analytics Help uShip Deliver Quality Software Faster appeared first on ...

Five Lessons from Pioneers of Data-Driven Cloud Operations

The post Five Lessons from Pioneers of Data-Driven Cloud Operations appeared first on Wavefront by VMware.

Visionary in Gartner® Magic Quadrant™

Learn More

Return to Home

Intelligent Alert Design: Three Simple Tips for Increasing Alert Robustness

Tip 1 – Alerts that account for delayed metrics

Tip 2 – Alerts that account for missing metric data points

Tip 3 – Alerts that check metric data flows from Wavefront Proxy

Previous

Next

Intelligent Alert Design: Three Simple Tips for Increasing Alert Robustness

Tip 1 – Alerts that account for delayed metrics

Tip 2 – Alerts that account for missing metric data points

Tip 3 – Alerts that check metric data flows from Wavefront Proxy

Previous

Next

Related content in this Stream

Monitoring collects data, while observability offers contextualization and strategic insights into complex systems. Learn more about the differences and why observability is so powerful.

The unified observability platform in VMware Aria Operations for Applications brings together metrics, traces, and log management to deliver critical business outcomes.

With nearly 100 percent compatibility with Grafana dashboard queries, VMware Tanzu Observability delivers excellent support for PromQL.

VMware Tanzu Observability offers easy integration with AWS CloudTrail, enabling operators to view events related to governance, compliance, and operational and risk auditing for your AWS account.

See how VMware Tanzu Observability gave a British smart meter company unprecedented visibility into its platform and smoothed the path creating more innovative products.

A change to Grafana licensing means limited functionality for users of some platforms that rely on it. Here’s how Tanzu Observability can fill the gaps.

OpenShift users can now take advantage of VMware’s revamped full-stack monitoring solution of Kubernetes clusters with Tanzu Observability by Wavefront.

Updates to VMware Tanzu Observability include new ecosystem integrations and usability features designed to improve incident response.

We are holding two different design studio research sessions at VMworld that will give you the opportunity to influence the direction of VMware Tanzu Observability.

In addition to VMware Tanzu Observability supporting various instrumentation and ingestion methods for distributed tracing, it now natively supports OpenTelemetry.

Highlights from SpringOne Day 2 include more details about Tanzu Application Platform, demos of Application Accelerator and Tanzu Observability, plus summaries of some of our favorite talks.

We’re excited to announce enhancements to the VMware Tanzu Observability by Wavefront platform.

The integration of Jaeger with Tanzu Observability will help you visualize the application traces and identify any errors or performance issues.

We at VMware Tanzu recently published our first-ever summary of the current state of observability, a report entitled The State of Observability 2021.

The VMware Tanzu Observability by Wavefront engineering team recently completed 30 days of improvement focused on query quality.

VMware Tanzu Observability was named as a fast-moving leader in technology research and analysis provider GigaOm's forward-looking assessment of the cloud observability vendor space in 2021.

VMware recently announced that Apdex is now available in Tanzu Observability by Wavefront.

Companies running cloud-native apps and infrastructure will improve the user experience and boost app availability by adopting real-time alerting and predictive analysis.

New functionalities of Tanzu Observability by Wavefront accelerate analytics-driven insights and data onboarding for DevOps teams, including developers, Kubernetes operators, and wider ops teams.

Looking for a way to proactively troubleshoot complex application performance issues? Look no further than Tanzu Observability by Wavefront.