Elasticsearch聚合之 Histogram 直方圖聚合

時間 2019-11-19

標籤 elasticsearch 聚合 histogram 直方圖欄目日誌分析简体版

原文原文鏈接

Elasticsearch支持最直方圖聚合，它在數字字段自動建立桶，並會掃描所有文檔，把文檔放入相應的桶中。這個數字字段既能夠是文檔中的某個字段，也能夠經過腳本建立得出的。數組

桶的篩選規則

舉個例子，有一個price字段，這個字段描述了商品的價格，如今想每隔5就建立一個桶，統計每隔區間都有多少個文檔（商品）。code

若是有一個商品的價格爲32，那麼它會被放入30的桶中，計算的公式以下：排序

rem = value % interval
if (rem < 0) {
    rem += interval
}
bucket_key = value - rem

經過上面的方法，就能夠肯定文檔屬於哪個桶。rem

不過也有一些問題存在，因爲上面的方法是針對於整型數據的，所以若是字段是浮點數，那麼須要先轉換成整型，再調用上面的方法計算。問題來了，正數還好，若是該值是負數，就會出現計算出錯。好比，一個字段的值爲-4.5，在進行轉換整型時，轉換成了-4。那麼按照上面的計算，它就會放入-4的桶中，可是其實-4.5應該放入-6的桶中。文檔

min_doc_count過濾

聚合的dsl以下：it

{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50
            }
        }
    }
}

獲得的數據爲：io

{
    "aggregations": {
        "prices" : {
            "buckets": [
                {
                    "key": 0,
                    "doc_count": 2
                },
                {
                    "key": 50,
                    "doc_count": 4
                },
                {
                    "key": 100,
                    "doc_count": 0
                },
                {
                    "key": 150,
                    "doc_count": 3
                }
            ]
        }
    }
}

上面的數據中，100-150是沒有文檔的，可是卻顯示爲0.若是不想要顯示count爲0的桶，能夠經過min_doc_count來設置。ast

{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50,
                "min_doc_count" : 1
            }
        }
    }
}

這樣返回的數據，就不會出現爲0的了。方法

{
    "aggregations": {
        "prices" : {
            "buckets": [
                {
                    "key": 0,
                    "doc_count": 2
                },
                {
                    "key": 50,
                    "doc_count": 4
                },
                {
                    "key": 150,
                    "doc_count": 3
                }
            ]
        }
    }
}

extend_bounds,指定最小值和最大值邊界

默認狀況下，ES中的histogram聚合起始都是自動的，好比price字段，若是沒有商品的價錢在0-5之間，0這個桶就不會顯示。若是最便宜的商品是11，那麼第一個桶就是10.
能夠經過設置extend_bounds強制規定最小值和最大值，可是要求必須min_doc_count不能大於0，否則即使是規定了邊界，也不會返回。
im

另外須要注意的是，若是規定的extend_bounds.min要大於文檔中的最小值，那麼就會按照文檔中的最小值來（extend_bounds.max也是如此）。
好比下面的這個例子，規定的extend_bounds.min和max分別是40和50，可是文檔中含有比40還要小的數據，所以桶的定義仍然是按照文檔中的數據來。

order排序

排序大同小異，能夠按照_key的名字排序：

{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50,
                "order" : { "_key" : "desc" }
            }
        }
    }
}

也能夠按照文檔的數目:

{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50,
                "order" : { "_count" : "asc" }
            }
        }
    }
}

或者指定排序的聚合：

{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50,
                "order" : { "price_stats.min" : "asc" } 
            },
            "aggs" : {
                "price_stats" : { "stats" : {} } 
            }
        }
    }
}

keyed設置返回的方式

正常返回的數據如上面所示，是按照數組的方式返回。若是要按照名字返回，能夠設置keyed爲true

{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50,
                "keyed" : true
            }
        }
    }
}

那麼返回的數據就爲：

{
    "aggregations": {
        "prices": {
            "buckets": {
                "0": {
                    "key": 0,
                    "doc_count": 2
                },
                "50": {
                    "key": 50,
                    "doc_count": 4
                },
                "150": {
                    "key": 150,
                    "doc_count": 3
                }
            }
        }
    }
}

缺省的值

缺省值經過MissingValue設置：

{
    "aggs" : {
        "quantity" : {
             "histogram" : {
                 "field" : "quantity",
                 "interval": 10,
                 "missing": 0 
             }
         }
    }
}