《Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning》筆記

如下我爲這篇《Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning》作的閱讀筆記 - Jeanva算法

ABSTRACT

However, even though dozens of anomaly detectors have been proposed over the years, deploying them to a given service remains a great challenge, requiring manually and iteratively tuning detector parameters and thresholds. This paper tackles this challenge through a novel approach based on supervised machine learning. 這篇文章主要解決手動調參和調threshold的問題數組

With our proposed system, Opprentice (Operators’ apprentice), operators’ only manual work is to periodically label the anomalies in the performance data with a convenient tool. 爲了KPI stream的訓練仍是須要手動標記app

Then the features and the labels are used to train a random forest classifier to automatically select the appropriate detector-parameter combinations and the thresholds. 用隨機森林自動選擇檢測器和參數組合和閾值less

For three different service KPIs in a top global search engine, Opprentice can automatically satisfy or approximate a reasonable accuracy preference (recall ≥ 0.66 and precision ≥ 0.66). 預設目標，自動知足目標或者找到一個最好的dom

Keywords

Anomaly Detection; Tuning Detectors; Machine Learningide

1. INTRODUCTION

there exists no convenient method to automatically match operators’ practical detection requirements with the capabilities of different detectors 尚未簡便方法適配檢測器和實際需求工具

operators are used to specify simple requirements for detection accuracy and manually spot-check anomalies occasionally. As a result, services either settle with simple static thresholds (e.g., Amazon Cloud Watch Alarms [24]), intuitive to operators although unsatisfying in detection performance, or, after time-consuming manual tuning by algorithm designers, end up with a detector specifically tailored for the given service, which might not be directly applicable to other services. 運營習慣於設置簡單條件，使得檢測的效果通常，方法又不能遷移學習

the first step for the anomaly detection practitioner is to collect the requirements from the service operators. This step encounters Definition Challenges: it is difficult to precisely define anomalies in reality 異常的自己就很難精肯定義測試

In addition, it is often impossible for the operators to quantitatively define anomalies, 運營很難把異常量化ui

Detector Challenges: In order to provide a reasonable detection accuracy, selecting the most suitable detector requires both the algorithm expertise and the domain knowledge about the given service KPI. 須要算法和領域專家合理才能找到一個好的檢測器

Our approach relies on two key observations. First, it is straightforward for operators to visually inspect the time series data and label anomaly cases they identified 運營喜歡在圖上發現時間序列的異常

The second key observation is that the anomaly severities measured by different detectors can naturally serve as the features in machine learning, so each detector can serve as a feature extractor 不一樣檢測器報告的異常嚴重程度能夠做爲特徵來學習

Specifically, multiple detectors are applied to the KPI data in parallel to extract features. Then the features and the labels are used to train a machine learning model, i.e., random forests [28], to automatically select the appropriate detector-parameter combinations and the thresholds. The training objective is to maximally satisfy the operators’ accuracy preference. 運營輸入label，不一樣檢測器輸入特徵，用隨機森林來訓練，直到模型達到運營須要的準確度

More importantly, Opprentice takes operators only tens of minutes to label data.

We believe this is the first anomaly detection framework that does not require manual detector selection, parameter configuration, or threshold tuning. 首發

2. BACKGROUND

2.1 KPIs and KPI Anomalies

Beyond the physical meanings, the characteristics of these KPI data are also different.

Since we have to hide the absolute values, we use the coefficient of variation ($C_{v}$) to measure the dispersions, which equals the standard deviation divided by the mean.

2.2 Problem and Goal

The KPI data labeled by operators are the so called 「ground truth」. 標記的纔是真相

The fundamental goal of anomaly detection is to be accurate, e.g., identifying more anomalies in the ground truth, and avoiding false alarms. 發生儘可能多真相，減小假陽

For example, the operators we worked with specified 「recall ≥ 0.66 and precision ≥ 0.66」 as the accuracy preference, which is considered as the quantitative goal of Opprentice in this paper. 目標被量化了

As anomalies are relatively few in the data, it is difficult for those detectors to achieve both high recall and precision.

precision and recall are often conflicting. The trade-off between them is often adjusted according to real demands. For example, busy operators are more sensitive to precision, as they do not want to be frequently disturbed by many false alarms. 工做忙的運營但願精度高

On the other hand, operators would care more about recall if a KPI, e.g., revenue, is critical, even at the cost of a little lower precision. 異常負面影響大的，recall要設置高

Opprentice has one qualitative goal: being automatic enough so that the operators would not be involved in selecting and combining suitable detectors, or tuning them 咱們的定性的目標是不須要參與選擇和調試檢測器

3. OPPRENTICE OVERVIEW

3.1 Core Ideas

Opprentice approaches the above problem through supervised machine learning. Supervised machine learning can be used to automatically build a classification model from historical data, and then classify future data based on the model. 有監督的學習

we use existing basic anomaly detectors to quantify anomalous level of the data from their own perspectives, respectively. 用檢測器根據本身的角度來衡量異常的程度

The results of the detectors are used as the features of the data. The features and operators’ labels together form the training set. Then a machine learning algorithm takes advantage of a certain technique to build a classification model. 檢測器的結果做爲特徵

Opprentice is the first framework that use machine learning to automatically combine and tune existing detectors to satisfy operators’ detection requirements (anomaly definitions and the detection accuracy preference). Furthermore, to the best of our knowledge, this is the first time that different detectors are modeled as the feature extractors in machine learning 咱們的算法是首次自動組合和調參基本檢測器

3.2 Addressing Challenges in Machine Learning

When learning from such 「imbalanced data」, the classifier is biased towards the large (normal) class and ignores the small (anomaly) class We solve this problem in §4.5 through adjusting the machine learning classification threshold (cThld henceforth). 經過閾值解決數據不平衡問題

some of the features would be either irrelevant to the anomalies or redundant with each other. We solve this problem by using an ensemble learning algorithm, i.e., random forests 不相關特徵和冗餘特徵，經過隨機森林解決

4. DESIGN

4.1 Architecture

First, before the system starts up, operators specify an accuracy preference (recall ≥ x and precision ≥ y), which we assume does not change in this paper. This preference is later used to guide the automatic adjustment of the cThld. 設定目標拿來調整閾值

Second, the operators use a convenient tool to label anomalies in the historical data at the beginning and label the incoming data periodically (e.g., weekly). All the data are labeled only once 用了一個可視化工具來作標記

4.2 Labeling Tool

the data of the last day and the last week are also shown in light colors

4.3 Detectors

4.3.1 Detectors As Feature Extractors

we represent different detectors with a unified model data point $\rightarrow$ severity $\rightarrow$ {1, 0} 不一樣的檢測器都是一個統一的模型

First, when a detector receives an incoming data point, it internally produces a non-negative value, called severity to measure how anomalous that point is

For example, Holt-Winters [6] uses the residual error (i.e., the absolute difference between the actual value and the forecast value of each data point) to measure the severity Holt-Winters就是用殘差

historical average [5] assumes the data follow Gaussian distribution, and uses how many times of standard deviation the point is away from the mean as the severity. 歷史數據就是用標準差的倍數

Afterwards, a detector further needs a threshold to translate the severity into a binary output, i.e., anomaly (1) or not (0). We call this threshold the severity threshold: sThld henceforth.

Since the severity describes the anomalous level of data, it is natural to deem the severity as the anomaly feature. 將嚴重程度看成特徵

We call a detector with specific sampled parameters a (detector) configuration. Thus a configuration acts as a feature extractor 檢測器的一個參數配置就是一個特徵

4.3.2 Choosing Detectors

When choosing detectors, we have two general requirements. First, the detectors should fit the above model, or they should be able to measure the severities of data.

Second, since anomalies should be detected timely, we require that the detectors can be implemented in an online fashion. This requires that once a data point arrives, its severity should be calculated by the detectors without waiting for any subsequent data.
檢測器能夠在線，數據來了就能夠當即計算嚴重程度

4.3.3 Sampling Parameters

We have two strategies to sample the parameters of detectors. The first one is to sweep the parameter space.

For example, EWMA (Exponentially Weighted Moving Average) [11], a prediction based detector, has only one weight parameter $\alpha \in$ [0, 1]. As $\alpha$ goes up, the prediction relies more upon the recent data than the historical data. EWMA指數權重移動平均的$alpha$值越大越依賴於最近的數據

On the other hand, the parameters of some complex detectors, e.g., ARIMA (Autoregressive Integrated Moving Average) [10], can be less intuitive. To deal with such detectors, we estimate their 「best」 parameters from the data, and generate only one set of parameters, or one configuration for each detector. Besides, since the data characteristics can change over time, it is also necessary to update the parameter estimates periodically 複雜的算法直接估計最優參數

4.4 Machine Learning Algorithm

4.4.1 Considerations and Choices

in our problem, there are redundant and irrelevant features, caused by using detectors without careful evaluation. a promising algorithm should be less-parametric and insensitive to its parameters, so that Opprentice can be easily applied to different data sets. 由於特徵沒有通過評估，適合的算法是那種對參數不敏感的

4.4.2 Random Forest

Preliminaries: decision trees. A decision tree [41] is a popular learning algorithm as it is simple to understand and interpret At a high level, it provides a tree model with various if-then rules to classify data.

The numbers on branches, e.g., 3 for time series decomposition, are the feature split points. 特徵分裂點有對應的數值

In the decision tree, a feature is more important for classification if it is closer to the root. 越靠近根的特徵越重要

There are two major problems of decision tree. One is that the greedy feature selection at each step may not lead to a good final classifier; the other is that the fully grown tree is very sensitive to noisy data and features, and would not be general enough to classify future data, which is called overfitting. 決策樹兩個問題：貪婪算法不是全局最優，過擬合

A Random forest is an ensemble classifier using many decision trees. Its main principle is that a group of weak learners (e.g., individual decision trees) can together form a strong learner 弱分類器的組合能造成一個強分類器

First, each tree is trained on subsets sampled from the original training set. Second, instead of evaluating all the features at each level, the trees only consider a random subset of the features each time.

All the trees are fully grown in this way without pruning. The random forest then combines those trees by majority vote. 不用剪枝，又省了一個樹的深度的參數

By default, the random forest uses 50% as the classification threshold (i.e., cThld). 默認的閾值是50%

4.5 Configuring cThlds

4.5.1 PC-Score: A Metric to Select Proper cThlds

We need to configure cThlds rather than using the default one (e.g., 0.5) for two reasons 這裏的cThlds是隨機森林的得票率，默認是50%

Configuring cThlds is a general method to trade off between precision and recall [31]. In consequence, we should configure the cThld of random forests properly to satisfy the operators’ preference. 運營對精度和召回的取捨決定了閾值

PR curves is widely used to evaluate the accuracy of a binary classifier [45], especially when the data is imbalanced A PR curve plots precision against recall for every possible cThld of a machine learning algorithm (or for every sThld of a basic detector). PR曲線就是畫出了全部閾值可能的結果

F-Score based method, which selects the point that maximizes F-Score = $\frac{2precisionrecall}{ precision+recall}$ ; F值把二者綜合

we develop a simple but effective accuracy metric based on F-Score, namely PC-Score (preferencecentric score), to explicitly take operators’ preference into account when deciding cThlds. 使用PC分值考慮了運營的偏好

we calculate its PC-Score as follows $p(x)=\left{ \begin{array}{ll} \frac{2rp}{r+p}+1 &, if\ r>=R\ and\ p>=P\ \frac{2rp}{r+p} &, others \end{array}\right.$

In order to identify the point satisfying operators’ preference (if one exists), we add an incentive constant of 1 to F-score if r>=R and p>=P. Since F-Score is no more than 1, this incentive constant ensures that the points satisfying the preference must have the PCScore larger than others that do not 加1保證了知足運營條件的PCScore大於不知足運營條件的，而後再在知足條件的裏面選最好的，即便沒有知足運營條件的，也能選一個F值最大的

4.5.2 EWMA Based cThld Prediction

These cThlds are the best ones we can configure for detecting those data, and are called best cThlds. However, in online detection, we need to predict cThlds for detecting future data 在線學習須要改變cThlds

To this end, an alternative method is k-fold cross-validation.In each test (k tests in total), a classifier is trained using k? 1 of the subsets and tested on the rest one with a cThld candidate. The candidate that achieves the smallest average PC-Score across the k tests is used for future detection. 用k折交叉嚴重，k次測試PC分值最小的那個做爲cThld，最差的那個最好

the best cThlds can differ greatly over weeks. As a result, in the cross-validation, the cThld that achieves the highest average performance over all the historical data might not be similar to the best cThld of the future week. 可是發現cThlds在不一樣時期不同

Hence, we adopt EWMA [11] to predict the cThld of the ith week (or the ith test set) based on the historical best cThlds. $cThld_{i}^{p}=\left{ \begin{array}{ll} \alpha*cThld_{i-1}^{b}+(1-\alpha)*cThld_{i-1}^{p} &, i>1\ 5-fold\ prediction &, others \end{array}\right.$ 使用EWMA方法使用上週的最佳的上週的預測值的加權和

$cThldb_{i-1}^{b}$ is the best cThld of the $(i - 1)^{th}$ week. $cThld_{i}^{p}$ is the predicted cThld of the ith week, and also the one used for detecting the ith-week data.

We use $\alpha$ = 0.8 in this paper to quickly catch up with the cThld variation

5. EVALUATION

5.1 Data sets

These data are labeled by the operators from the search engine using our labeling tool. There are 7.8%, 2.8%, and 7.4% data points are labeled as anomalies for PV, #SR, and SRT, respectively 數據中異常的比例

5.2 Detector and Parameter Choices

Two of the detectors were already used by the search engine we studied before this study. One is namely 「Diff」, which simply measures anomaly severities using the differences between the current point and the point of last slot, the point of last day, and the point of last week. "Diff" 檢測器

The other one, namely 「MA of diff」, measures severities using the moving average of the difference between current point and the point of last slot. This detector is designed to discover continuous jitters. "MA of diff"檢測器

Among these detectors, there are two variants of detectors using MAD (Median Absolute Deviation) around the median, instead of the standard deviation around the mean, to measure anomaly severities. This patch can improve the robustness to missing data and outliers MAD對缺失數據和離羣點更加不敏感

5.3 Accuracy of Random Forests

Alternatively, we use the area under the PR curve (AUCPR) [50] as the accuracy measure. The AUCPR is a single number summary of the detection performance over all the possible thresholds

5.3.1 Random Forests vs. Basic Detectors and Static Combinations of Basic Detectors

The result shows that random forests significantly outperforms the two static combination methods, and perform similarly to or even better than the most accurate basic detector for each KPI 隨機森林比靜態組合好，在不一樣的KPI檢測中都和最好的單一檢測差不一樣

5.3.2 Random Forests vs. Other Algorithms

We also compare random forests with several other machine learning algorithms: decision trees, logistic regression, linear support vector machines (SVMs), and naive Bayes

The result demonstrates that random forests are quite robust to irrelevant and redundant features in practice 特徵越多，隨機森林的效果越明顯

6. DISCUSSION

Anomaly detection, not troubleshooting. Sometimes, although the operators admit the anomalies in the KPI curve, they tend to ignore them as they know that the anomalies are caused by some normal activities as expected, such as service upgrades and predictable social events. 異常檢測不是解決問題

For example, the troubleshooting system may find that the anomalies are due to normal system upgrades and suggest operators to ignore them.

Detection across the same types of KPIs. Some KPIs are of the same type and operators often care about similar types of anomalies for them. Note that, in order to reuse the classifier for the data of different scales, the anomaly features extracted by basic detectors should be normalized. 同類型的KPI標準化後能夠重用模型

Dirty data. A well known problem is that detectors are often affected by 「dirty data」. Dirty data refer to anomalies or missing points in data, and they can contaminate detectors and cause errors of detectors. 髒數據的問題

We address this problem in three ways. (a) Some of our detectors, e.g., weighted MA and SVD, can generate anomaly features only using recent data. Thus, they can quickly get rid of the contamination of dirty data.

(b) We take advantage of MAD [3, 15] to make some detectors, such as TSD, more robust to dirty data (c) Since Opprentice uses many detectors simultaneously, even if a few detectors are contaminated, Opprentice could still automatically select and work with the remaining detectors.

Learning limitations. Another issue is that a learning based approach is limited by the anomalies within a training set. For example, anomalies can be rare, and new types of anomalies might appear in the future [16]. We solve this problem by incrementally retraining the classifier to gather more anomaly cases and learn emerging types of anomalies 異常太少了，經過在線學習不斷改進

《Opprentice: Towards Practical and Automatic Anomaly Detection Through Machine Learning》 筆記