搜索引擎(Elasticsearch聚合分析)

學習目標

掌握聚合分析的查詢語法。
掌握指標聚合、桶聚合的用法html

聚合分析簡介

ES聚合分析是什麼?正則表達式

聚合分析是數據庫中重要的功能特性,完成對一個查詢的數據集中數據的聚合計算,如:找出某字段(或計算表達式的結果)的最大值、最小值,計算和、平均值等。ES做爲搜索引擎兼數據庫,一樣提供了強大的聚合分析能力。數據庫

對一個數據集求最大、最小、和、平均值等指標的聚合,在ES中稱爲指標聚合   metric
而關係型數據庫中除了有聚合函數外,還能夠對查詢出的數據進行分組group by,再在組上進行指標聚合。在 ES 中group by 稱爲分桶,桶聚合  bucketing
less

ES中還提供了矩陣聚合(matrix)、管道聚合(pipleline),但還在完善中。elasticsearch

ES聚合分析查詢的寫法ide

在查詢請求體中以aggregations節點按以下語法定義聚合分析:函數

"aggregations" : {
    "<aggregation_name>" : {         //aggregations 也可簡寫爲 aggs
        "<aggregation_type>" : {
            <aggregation_body>
        }
        [,"meta" : {  [<meta_data_body>] } ]?
        [,"aggregations" : { [<sub_aggregation>]+ } ]?
    }
    [,"<aggregation_name_2>" : { ... } ]*
}

聚合分析的值來源學習

聚合計算的值能夠取字段的值,也但是腳本計算的結果。ui

 

指標聚合

max  min  sum  avg搜索引擎

POST /bank/_search?
{
  "size": 0, 
  "aggs": {
    "masssbalance": {
      "max": {
        "field": "balance"
      }
    }
  }
}
查詢全部客戶中餘額的最大值
POST /bank/_search?
{
  "size": 2, 
  "query": {
    "match": {
      "age": 24
    }
  },
  "sort": [
    {
      "balance": {
        "order": "desc"
      }
    }
  ],
  "aggs": {
    "max_balance": {
      "max": {
        "field": "balance"
      }
    }
  }
}
年齡爲24歲的客戶中的餘額最大值
POST /bank/_search?size=0
{
    "aggs" : {                   //值來源於腳本
        "avg_age" : {
            "avg" : {
                "script" : {    //查詢全部客戶的平均年齡是多少
                    "source" : "doc.age.value"
                }
            }
        },
        "avg_age10" : {
            "avg" : {
                "script" : {
                    "source" : "doc.age.value + 10"
                }
            }
        }
    }}
POST /bank/_search?size=0
{
  "aggs": {
    "sum_balance": {
      "sum": {
        "field": "balance",   //指定field,在腳本中用_value 取字段的值
        "script": {
            "source": "_value * 1.03"
        }
      }
    }
  }
}
POST /bank/_search?size=0
{
  "aggs": {
    "avg_age": {
      "avg": {
        "field": "age",
        "missing": 18
      }
    }  }}
POST /bank/_search?size=0
{
  "aggs": {
    "avg_age": {
      "avg": {
        "field": "age",    //爲缺失值字段,指定值。如未指定,缺失該字段值的文檔將被忽略。
        "missing": 18
      }
    }
  }
}

文檔計數 count

POST /bank/_doc/_count
{
  "query": {
    "match": {
      "age" : 24
    }
  }
}

cardinality  值去重計數

POST /bank/_search?size=0
{
  "aggs": {
    "age_count": {
      "cardinality": {
        "field": "age"
      }
    },
    "state_count": {
      "cardinality": {
        "field": "state.keyword"
      }
    }
  }
}
state的使用它的keyword版

Value count 統計某字段有值的文檔數

POST /bank/_search?size=0
{
    "aggs" : {
        "age_count" : { "value_count" : { "field" : "age" } }
    }
}

stats 統計 count max min avg sum 5個值

POST /bank/_search?size=0
{
  "aggs": {
    "age_stats": {
      "stats": {
        "field": "age"
      }
    }
  }
}

Extended stats

高級統計,比stats多4個統計結果: 平方和、方差、標準差、平均值加/減兩個標準差的區間

POST /bank/_search?size=0
{
  "aggs": {
    "age_stats": {
      "extended_stats": {
        "field": "age"
      }
    }
  }

Percentiles 佔比百分位對應的值統計

對指定字段(腳本)的值按從小到大累計每一個值對應的文檔數的佔比(佔全部命中文檔數的百分比),返回指定佔比比例對應的值。默認返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。以下中間的結果,能夠理解爲:佔比爲50%的文檔的age值 <= 31,或反過來:age<=31的文檔數佔總命中文檔數的50%

POST /bank/_search?size=0
{
  "aggs": {
    "age_percents": {
      "percentiles": {
        "field": "age"
      }
    }
  }
}
"aggregations": {
    "age_percents": {
      "values": {
        "1.0": 20,
        "5.0": 21,
        "25.0": 25,
        "50.0": 31,
        "75.0": 35,
        "95.0": 39,
        "99.0": 40
      }
    }
  }
POST /bank/_search?size=0
{
  "aggs": {
    "age_percents": {
      "percentiles": {
        "field": "age",
        "percents" : [95, 99, 99.9] 
      }
    }
  }
}
指定分位值

Percentiles rank 統計值小於等於指定值的文檔佔比

POST /bank/_search?size=0
{
  "aggs": {
    "gge_perc_rank": {
      "percentile_ranks": {
        "field": "age",
        "values": [
          25,
          30
        ]
      }
    }
  }
}
"aggregations": {
    "gge_perc_rank": {
      "values": {
        "25.0": 26.1,
        "30.0": 49.3
      }
    }
  }

Geo Bounds aggregation 求文檔集中的座標點的範圍

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geobounds-aggregation.html

Geo Centroid aggregation  求中心點座標值

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html

桶聚合

Terms Aggregation  根據字段值項分組聚合

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age"
      }
    }
  }
}
"aggregations": {
    "age_terms": {
      "doc_count_error_upper_bound": 0,    //文檔計數的最大誤差值
      "sum_other_doc_count": 463,          //未返回的其餘項的文檔數
      "buckets": [                         //默認狀況下返回按文檔計數從高到低的前10個分組
        {
          "key": 31,
          "doc_count": 61
        },
        {
          "key": 39,
          "doc_count": 60
        },
        {
          "key": 26,
          "doc_count": 59
        },
        ….
       ]
    }
  }

size 指定返回多少個分組

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "size": 20
      }
    }  }}
POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "size": 5,
        "shard_size":20    //shard_size 指定每一個分片上返回多少個分組
      }                   
    } }}                    
shard_size 的默認值爲:
 索引只有一個分片:= size
多分片:=  size * 1.5 + 10
POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "size": 5,                       //每一個分組上顯示誤差值
        "shard_size":20,
        "show_term_doc_count_error": true
      }    }  }}

order  指定分組的排序

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order" : { "_count" : "asc" }   //根據文檔計數排序
      }
    }
  }
}
POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order" : { "_key" : "asc" }   //根據分組值排序
      }
    }
  }
}

取分組指標值

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order": {
          "max_balance": "asc"
        }
      },
      "aggs": {
        "max_balance": {
          "max": {
            "field": "balance"
          }
        },
        "min_balance": {
          "min": {
            "field": "balance"
          }
        }      }    }  }}

根據分組指標值排序

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order": {
          "max_balance": "asc"
        }
      },
      "aggs": {
        "max_balance": {
          "max": {
            "field": "balance"
          }
        }
      }
    }  }}
POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "order": {
          "stats_balance.max": "asc"
        }
      },
      "aggs": {
        "stats_balance": {
          "stats": {
            "field": "balance"
          }
        }
      }
    }  }}

篩選分組

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "min_doc_count": 60   //用文檔計數來篩選
      }
    }
  }
}
POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "terms": {
        "field": "age",
        "include": [20,24]   //篩選指定的值列表
      }
    }
  }
}
GET /_search
{
    "aggs" : {
        "tags" : {
            "terms" : {
                "field" : "tags",
                "include" : ".*sport.*",
                "exclude" : "water_.*"     //正則表達式匹配值
            }
        }
    }
}
GET /_search
{
    "aggs" : {
        "JapaneseCars" : {
             "terms" : {
                 "field" : "make",
                 "include" : ["mazda", "honda"]
             }   
         },                               //指定值列表
        "ActiveCarManufacturers" : {
             "terms" : {
                 "field" : "make",
                 "exclude" : ["rover", "jensen"]
             }
         }
    }
}

根據腳本計算值分組

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : {
                "script" : {
                    "source": "doc['genre'].value",
                    "lang": "painless"
                }
            }
        }
    }
}

缺失值處理

GET /_search
{
    "aggs" : {
        "tags" : {
             "terms" : {
                 "field" : "tags",
                 "missing": "N/A" 
             }
         }
    }
}

filter Aggregation  對知足過濾查詢的文檔進行聚合計算

在查詢命中的文檔中選取複合過濾條件的文檔進行聚合

POST /bank/_search?size=0
{
  "aggs": {
    "age_terms": {
      "filter": {"match":{"gender":"F"}},
      "aggs": {
        "avg_age": {
          "avg": {
            "field": "age"
          }
        }
      }
    }
  }
}

Filters Aggregation  多個過濾組聚合計算

PUT /logs/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "filters" : {
          "errors" :   { "match" : { "body" : "error"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }    }  }}
GET logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "other_bucket_key": "other_messages",
        "filters" : {
          "errors" :   { "match" : { "body" : "error"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }    //爲其餘值組指定key
    }
  }
}

Range Aggregation  範圍分組聚合

POST /bank/_search?size=0
{
  "aggs": {
    "age_range": {
      "range": {
        "field": "age",
        "ranges": [
          {"to":25},
          {"from": 25,"to": 35},
          {"from": 35}
        ]
      },
      "aggs": {
        "bmax": {
          "max": {
            "field": "balance"
          }
        }
      }    }  }}
POST /bank/_search?size=0
{
  "aggs": {
    "age_range": {
      "range": {
        "field": "age",
        "keyed": true, 
        "ranges": [
          {"to":25,"key": "Ld"},
          {"from": 25,"to": 35,"key": "Md"},
          {"from": 35,"key": "Od"}
        ]
      }
    }              //爲組指定key
  }
}

Date Range Aggregation  時間範圍分組聚合

POST /sales/_search?size=0
{
    "aggs": {
        "range": {
            "date_range": {
                "field": "date",
                "format": "MM-yyy",
                "ranges": [
                    { "to": "now-10M/M" }, 
                    { "from": "now-10M/M" } 
                ]
            }
        }
    }
}

Date Histogram Aggregation  時間直方圖(柱狀)聚合

就是按天、月、年等進行聚合統計。可按 year (1y), quarter (1q), month (1M), week (1w), day (1d), hour (1h), minute (1m), second (1s) 間隔聚合或指定的時間間隔聚合。

POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            }
        }
    }
}
POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "90m"
            }
        }
    }
}

Missing Aggregation  缺失值的桶聚合

缺失指定字段值的文檔做爲一個桶進行聚合分析

POST /bank/_search?size=0
{
    "aggs" : {
        "account_without_a_age" : {
            "missing" : { "field" : "age" }
        }
    }
}

Geo Distance Aggregation  地理距離分區聚合

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geodistance-aggregation.html

相關文章
相關標籤/搜索