Showing content from https://github.com/ExpediaDotCom/adaptive-alerting/wiki below:
Home · ExpediaGroup/adaptive-alerting Wiki · GitHub
Skip to content Navigation Menu
Search code, repositories, users, issues, pull requests...
Saved searches Use saved searches to filter your results more quickly
Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert
Willie Wheeler edited this page
Jun 19, 2018
·
4 revisions
Welcome to the Adaptive Alerting wiki. Please use the menu in the sidebar to have a look around.
We're creating AA with several goals in mind:
- The overarching goal is to reduce mean time to detect (MTTD) production incidents. This is the average time between the onset of some production incident and somebody's knowing about it, where "somebody" could be a person or else an automated response system. There are different ways of thinking about what it means to "know about" an incident, but for us it doesn't count if there's an alert that gets lost in a flood of alerts, even if there's some sense in which the monitoring system "knew about" the incident. (AA can monitor things besides production incidents too. For example it can monitor signals that predict production incidents. But fundamentally we're trying to keep the site up.)
- To support this, we need to monitor as many signals of health as possible. Otherwise we miss problems, and MTTD suffers. The signals or metrics in question can represent business-, application- or system-level concerns. Our working assumption is that in a large business there are many thousands if not millions of such concerns, and so AA needs to scale accordingly.
- For scalability, we must aggressively limit the number of false positives (i.e., spurious alerts). These draw attention away from the true positives and undermine the effectiveness of the monitoring system.
- Also for scalability, we must automate model selection and tuning. Typical users won't know the difference between (say) an EWMA-based anomaly detector and an LSTM-based anomaly detector, much less how to tune them. Multiply that by thousands or millions of metrics and it's clear that we have to automate this.
Toggle table of contents Pages 8
Clone this wiki locally
You can’t perform that action at this time.
RetroSearch is an open source project built by @garambo
| Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4