SLS機器學習最佳實戰：批量時序異常檢測

時間 2019-11-19

標籤 sls 機器學習最佳實戰批量時序異常檢測简体版

原文原文鏈接

0.文章系列連接

1. 高頻檢測場景

1.1 場景一

集羣中有N臺機器，每臺機器中有M個時序指標（CPU、內存、IO、流量等），若單獨的針對每條時序曲線作建模，要手寫太多重複的SQL，且對平臺的計算消耗特別大。該如何更好的應用SQL實現上述的場景需求？html

1.2 場景二

針對系統中的N條時序曲線進行異常檢測後，有要如何快速知道：這其中有哪些時序曲線是有異常的呢？算法

2. 平臺實驗

2.1 解決一

針對場景一中描述的問題，咱們給出以下的數據約束。其中數據在日誌服務的LogStore中按照以下結構存儲：數組

timestamp : unix_time_stamp
machine: name1
metricName: cpu0
metricValue: 50
---
timestamp : unix_time_stamp
machine: name1
metricName: cpu1
metricValue: 50
---
timestamp : unix_time_stamp
machine: name1
metricName: mem
metricValue: 50
---
timestamp : unix_time_stamp
machine: name2
metricName: mem
metricValue: 60

在上述的LogStore中咱們先獲取N個指標的時序信息：機器學習

* | select timestamp - timestamp % 60 as time, machine, metricName, avg(metricValue) from log group by time, machine, metricName

如今咱們針對上述結果作批量的時序異常檢測算法，並獲得N個指標的檢測結果：學習

* | 
select machine, metricName, ts_predicate_aram(time, value, 5, 1, 1) as res from  ( 
    select
        timestamp - timestamp % 60 as time, 
        machine, metricName, 
        avg(metricValue) as value
    from log group by time, machine, metricName )
group by machine, metricName

經過上述SQL，咱們獲得的結果的結構以下url

| machine | metricName | [[time, src, pred, upper, lower, prob]] |
| ------- | ---------- | --------------------------------------- |

針對上述結果，咱們利用矩陣轉置操做，將結果轉換成以下格式，具體的SQL以下：spa

* | 
select 
    machine, metricName, 
    res[1] as ts, res[2] as ds, res[3] as preds, res[4] as uppers, res[5] as lowers, res[6] as probs
from ( select machine, metricName, array_transpose(ts_predicate_aram(time, value, 5, 1, 1)) as res from  ( 
    select
        timestamp - timestamp % 60 as time, 
        machine, metricName, 
        avg(metricValue) as value
    from log group by time, machine, metricName )
group by machine, metricName )

通過對二維數組的轉換後，咱們將每行的內容拆分出來，獲得符合預期的結果，具體格式以下：3d

| machine | metricName | ts | ds | preds | uppers | lowers | probs |
| ------- | ---------- | -- | -- | ----- | ------ | ------ | ----- |

2.2 解決二

針對批量檢測的結果，咱們該如何快速的將存在特定異常的結果過濾篩選出來呢？日誌服務平臺提供了針對異常檢測結果的過濾操做。unix

select ts_anomaly_filter(lineName, ts, ds, preds, probs, nWatch, anomalyType)

其中，針對anomalyType有以下說明：日誌

0：表示關注所有異常
1：表示關注上升沿異常
-1：表示降低沿異常

其中，針對nWatch有以下說明：

表示從實際時序數據的最後一個有效的觀測點開始到最近nWatch個觀測點的長度。

具體使用以下所示：

* | 
select 
    ts_anomaly_filter(lineName, ts, ds, preds, probs, cast(5 as bigint), cast(1 as bigint))
from
( select 
    concat(machine, '-', metricName) as lineName, 
    res[1] as ts, res[2] as ds, res[3] as preds, res[4] as uppers, res[5] as lowers, res[6] as probs
from ( select machine, metricName, array_transpose(ts_predicate_aram(time, value, 5, 1, 1)) as res from  ( 
    select
        timestamp - timestamp % 60 as time, 
        machine, metricName, 
        avg(metricValue) as value
    from log group by time, machine, metricName )
group by machine, metricName ) )

經過上述結果，咱們拿到的是一個Row類型的數據，咱們能夠使用以下方式，將具體的結構提煉出來：

* | 
select 
    res.name, res.ts, res.ds, res.preds, res.probs 
from
    ( select 
        ts_anomaly_filter(lineName, ts, ds, preds, probs, cast(5 as bigint), cast(1 as bigint)) as res
    from
        ( select 
            concat(machine, '-', metricName) as lineName, 
            res[1] as ts, res[2] as ds, res[3] as preds, res[4] as uppers, res[5] as lowers, res[6] as probs
          from ( 
                select 
                    machine, metricName, array_transpose(ts_predicate_aram(time, value, 5, 1, 1)) as res 
                from  ( 
                    select
                        timestamp - timestamp % 60 as time, 
                        machine, metricName, avg(metricValue) as value
                    from log group by time, machine, metricName )
                group by machine, metricName ) ) )

經過上述操做，就能夠實現對批量異常檢測的結果進行過濾處理操做，幫助用戶更好的批量設置告警。