The Holberton School of Software Engineering recently hosted a meetup in San Francisco with presenters from API company, Algolia, and Wavefront.Wavefront’s CTO, Dev Nag, opened with an intro to data science in the context of operational monitoring and anomaly detection. Then Algolia’s CTO, Julien Lemoine, put Dev’s words into real-world context. He detailed how Algolia uses Wavefront analytics to optimize its Hosted Search API service for its 1,600 customers worldwide.
The video of Julien’s talk can be found here.
About Algolia’s Business
Launched in September 2013, Algolia provides a hosted search API that focuses on developer and user experiences. End users perceive this service as instantaneous, from anywhere. Algolia does this by replicating your search engine in different locations worldwide. Combined with a different way to handle ranking, you can have a good mix of textual relevance and business data without trying to do dark magic by tuning weights all day.
Algolia’s customers range from startups to Fortune 100 customers, including: Arcteryx, DC Shoes, Crunchbase, Digital Ocean, Vevo, Medium, and Hacker News. In summary, here is Algolia in a few numbers:
- 1,600+ customers in 100+ countries
- 30B+ write operations per month
- 12B+ user-generated queries per month
Monitoring API is Critical for Algolia
Search Performance in terms of response time and availability, are key differentiators for the Algolia service, and have to be constantly managed by Algolia engineers to improve as the service scales. Algolia has infrastructure in 36 data centers in 15 different regions (primarily operating on 500 bare metal servers), and they continue to increase their worldwide presence. They handle a large volume of API calls per month, in both read and write for queries, since their clients call them each time there is a data change.
For these reasons, monitoring is very important for Algolia. They initially used Server Density to manage their infrastructure and Datadog for the front end, but it quickly become apparent that they needed monitoring with a much finer precision, a greater scale particularly at API load, and an analytics driven approach to be more proactive. They started looking at a variety of open source tools, but each tool had scale issues with their data load and the variety of metrics they wanted to work with. Then, other companies with a similar focus on performance at scale recommended Wavefront to them. Upon using Wavefront (full deployment occurring in less than 2 weeks), they quickly realized it satisfied all of their needs. It soon became their one place to store all metrics. These metrics include both internal (operational) and external (business), currently pushing over 250K data metrics daily.
to get started with Wavefront
[inbound_forms id=”7249″ name=”Get Started Form”]
Key operational metrics include:
- CPU, RAM, disk usage, read/write IOs
- SSD state
- Indexing speed / queue size
- Network, inter-provider latency
- Number of crashes
- User agent
To Algolia, Wavefront is much more than a monitoring solution. As Julien described it, “Wavefront is a data science studio for DevOps and Ops, and it’s used by practically all of our engineers across a variety of use cases.
Wavefront Use Case #1: Debug Any Problem
As Julien explained, “Wavefront is an essential tool for us to debug any problem. Wavefront templates are extremely powerful. Templates can also quickly visualize a specific part of the infrastructure. Having all the data in one place combined with the power of its query engine gives us a lot of options to view the data and quickly find the needle in the haystack.”
“We use Wavefront to analyze performance, particularly for analysis of the global service in aggregate. For some problems, it’s really important to apply analytics and visualize the sum of many time series data streams, such as when detecting and acting on DDoS attacks. As the example below, here is a graph (see below) of the total amount of queries we have in our system, comparing the current week with the prior. Only a few DNS services are saturated, but they result in several seconds delay. The DNS resolution due to timeout was an unexpected bug at a provider, compounded by some exotic DNS servers’ behavior (IPv6 related) – together, this led to a difficult day! But Wavefront made it very clear to us what was happening, and we were able to reroute key customers earlier to minimize their disruption.”
Wavefront Use Case #2: Define Intelligent Alerts
For Algolia, alerting is essential to being proactive and responsive. Algolia currently defines 38 different types of alerts within Wavefront, each classified and processed in two ways:
- Informative – useful for trending and forensics, logged over email, for example a software crash.
- Critical – requires a clear action, notify who is on-call with Pager Duty, for example, indexing has stopped.
Julian explained, “Static, threshold-based alerts based on a single metric just don’t work for us, and this is another reason why we had to replace our previous monitoring tools. Wavefront has a great UI for creating truly intelligent, dynamic alerts. Its query language is the best out there, and we love how alert creation is so well integrated right within the dashboard, not as some separate tool within the platform.”
“As an example of the sophistication of alerts that Algolia needs to create, here is an alert (see figure) that looks at the evolution of jobs in the queue over time. This alert is defined to fire only when the number of jobs crosses 2 standard deviations, its maximum is higher than 3 standard deviations, and the number of jobs keeps growing. Not all of our alarms are this complex, but it is precisely this analytics-driven approach to alert creation that allows us to be more precise and actionable with alerts. We like that Wavefront gives us lots of flexibility to customize alerts.”
Wavefront Use Case #3: Enable Optimization Loop
Julien elaborated, “It is critical for us to continue to move away from being completely reactive to problems and instead become more proactive in avoiding problems altogether. To do this well, you need a culture that’s proactive, and you need the richness in measurements to know how you’re doing.”
Algolia engineers are now spending more time on improvements to optimize performance on both its multi-tenant and single-tenant environments. They use Wavefront to perform analytical comparisons to focus optimization work, and to see how resulting enhancements are performing.
“In our previous monitoring systems, we had to think ahead of time about what metrics to store and how it would be stored, and often we ended up not having the historical data we needed to do optimizations. With Wavefront, all the data is there and the visualizations are rich with the query engine. We can easily view aggregated system data in the past with the detail as if it’s happening right now. We can also correlate different data points and better understand their relationships. We can see all of this aggregated and correlated data on dashboards and change the views very quickly – it is very useful for optimization work at the system level.”
Wavefront User Case #4: Engine for Monitoring API
As an API company, Algolia now offers customers access to its system metrics and analytic capabilities via a full monitoring API. They use Wavefront to create and expose this monitoring data to every customer. Algolia customers can query the metric data and the queries go directly to Wavefront via its API.
Julien explained, “Wavefront stores all the metric data and we wrote some code to share it with the customer portal using the Wavefront API. Customers can view metrics in real-time, and query the data as they see fit. The Wavefront API is complete and well documented, so doing this integration was easy.”
Every day, Algolia engineers have lots of questions about operational performance, and they don’t want to spend a huge amount of time trying to find the answers. Wavefront helps them to find the answers faster and easier. This is why Algolia sees Wavefront as much more than monitoring; it’s data science for DevOps and Ops.
Julien shared what’s next for Wavefront at Algolia, “Right now, 250K metric data points go into Wavefront daily. We continue to add more metrics and alerts every week. The next iteration is to bring in more business metrics into Wavefront from Salesforce.com so that even more business and operational metrics can be correlated.”
“We like that Wavefront does not compress the stored data, so it’s an excellent archive of our system performance to baseline and compare historically. We also like that we can control the precision of each metric, e.g. if we are suspicious of a specific infrastructure region, we can set a finer granularity of the metrics there, noting that our queries are <1 sec, so if they are >1 sec, we need a very fine granularity of the metrics to troubleshoot what is happening. This is something that Wavefront does very well.”
“What’s also next is that we have started to move into automation of problem handling, so that some problems can remediated without human intervention. We have realized that Wavefront’s ability to create more intelligent and better quality alerts increases our confidence that the alert isn’t a false positive. In fact, with Wavefront, we have effectively eliminated all the alert noise, which then allows us to create new metrics that better characterize the state of the system. So we feel we’re ready to start with some of these automations.”
Algolia is just one example of the many SaaS leaders that rely on Wavefront to give them real competitive advantage. To learn more about the Wavefront service, visit Wavefront.com, Or Get Started with Wavefront.