Python Elasticsearch DSL 查詢、過濾、聚合操做實例

同時，也歡迎關注個人微信公衆號 AlwaysBeta，更多精彩內容等你來。git

Elasticsearch 基本概念

Index：Elasticsearch用來存儲數據的邏輯區域，它相似於關係型數據庫中的database 概念。一個index能夠在一個或者多個shard上面，同時一個shard也可能會有多個replicas。github

Document：Elasticsearch裏面存儲的實體數據，相似於關係數據中一個table裏面的一行數據。數據庫

document由多個field組成，不一樣的document裏面同名的field必定具備相同的類型。document裏面field能夠重複出現，也就是一個field會有多個值，即multivalued。django

Document type：爲了查詢須要，一個index可能會有多種document，也就是document type. 它相似於關係型數據庫中的 table 概念。但須要注意，不一樣document裏面同名的field必定要是相同類型的。微信

Mapping：它相似於關係型數據庫中的 schema 定義概念。存儲field的相關映射信息，不一樣document type會有不一樣的mapping。markdown

下圖是ElasticSearch和關係型數據庫的一些術語比較：app

Relationnal database	Elasticsearch
Database	Index
Table	Type
Row	Document
Column	Field
Schema	Mapping
Schema	Mapping
Index	Everything is indexed
SQL	Query DSL
SELECT * FROM table…	GET http://…
UPDATE table SET	PUT http://…

Python Elasticsearch DSL 使用簡介

鏈接 Es：elasticsearch

import elasticsearch

es = elasticsearch.Elasticsearch([{'host': '127.0.0.1', 'port': 9200}])
複製代碼

先看一下搜索，q 是指搜索內容，空格對 q 查詢結果沒有影響，size 指定個數，from_ 指定起始位置，filter_path 能夠指定須要顯示的數據，如本例中顯示在最後的結果中的只有 _id 和 _type。oop

res_3 = es.search(index="bank", q="Holmes", size=1, from_=1)
res_4 = es.search(index="bank", q=" 39225 5686 ", size=1000, filter_path=['hits.hits._id', 'hits.hits._type'])
複製代碼

查詢指定索引的全部數據：

其中，index 指定索引，字符串表示一個索引；列表表示多個索引，如 index=["bank", "banner", "country"]；正則形式表示符合條件的多個索引，如 index=["apple*"]，表示以 apple 開頭的所有索引。

search 中一樣能夠指定具體 doc-type。

from elasticsearch_dsl import Search

s = Search(using=es, index="index-test").execute()
print s.to_dict()
複製代碼

根據某個字段查詢，能夠多個查詢條件疊加：

s = Search(using=es, index="index-test").query("match", sip="192.168.1.1")
s = s.query("match", dip="192.168.1.2")
s = s.excute()
複製代碼

多字段查詢：

from elasticsearch_dsl.query import MultiMatch, Match

multi_match = MultiMatch(query='hello', fields=['title', 'content'])
s = Search(using=es, index="index-test").query(multi_match)
s = s.execute()

print s.to_dict()
複製代碼

還能夠用 Q() 對象進行多字段查詢，fields 是一個列表，query 爲所要查詢的值。

from elasticsearch_dsl import Q

q = Q("multi_match", query="hello", fields=['title', 'content'])
s = s.query(q).execute()

print s.to_dict()
複製代碼

Q() 第一個參數是查詢方法，還能夠是 bool。

q = Q('bool', must=[Q('match', title='hello'), Q('match', content='world')])
s = s.query(q).execute()

print s.to_dict()
複製代碼

經過 Q() 進行組合查詢，至關於上面查詢的另外一種寫法。

q = Q("match", title='python') | Q("match", title='django')
s = s.query(q).execute()
print(s.to_dict())
# {"bool": {"should": [...]}}

q = Q("match", title='python') & Q("match", title='django')
s = s.query(q).execute()
print(s.to_dict())
# {"bool": {"must": [...]}}

q = ~Q("match", title="python")
s = s.query(q).execute()
print(s.to_dict())
# {"bool": {"must_not": [...]}}
複製代碼

過濾，在此爲範圍過濾，range 是方法，timestamp 是所要查詢的 field 名字，gte 爲大於等於，lt 爲小於，根據須要設定便可。

關於 term 和 match 的區別，term 是精確匹配，match 會模糊化，會進行分詞，返回匹配度分數，（term 若是查詢小寫字母的字符串，有大寫會返回空即沒有命中，match 則是不區分大小寫均可以進行查詢，返回結果也同樣）

# 範圍查詢
s = s.filter("range", timestamp={"gte": 0, "lt": time.time()}).query("match", country="in")
# 普經過濾
res_3 = s.filter("terms", balance_num=["39225", "5686"]).execute()
複製代碼

其餘寫法：

s = Search()
s = s.filter('terms', tags=['search', 'python'])
print(s.to_dict())
# {'query': {'bool': {'filter': [{'terms': {'tags': ['search', 'python']}}]}}}

s = s.query('bool', filter=[Q('terms', tags=['search', 'python'])])
print(s.to_dict())
# {'query': {'bool': {'filter': [{'terms': {'tags': ['search', 'python']}}]}}}
s = s.exclude('terms', tags=['search', 'python'])
# 或者
s = s.query('bool', filter=[~Q('terms', tags=['search', 'python'])])
print(s.to_dict())
# {'query': {'bool': {'filter': [{'bool': {'must_not': [{'terms': {'tags': ['search', 'python']}}]}}]}}}
複製代碼

聚合能夠放在查詢，過濾等操做的後面疊加，須要加 aggs。

bucket 即爲分組，其中第一個參數是分組的名字，本身指定便可，第二個參數是方法，第三個是指定的 field。

metric 也是一樣，metric 的方法有 sum、avg、max、min 等，可是須要指出的是，有兩個方法能夠一次性返回這些值，stats 和 extended_stats，後者還能夠返回方差等值。

# 實例1
s.aggs.bucket("per_country", "terms", field="timestamp").metric("sum_click", "stats", field="click").metric("sum_request", "stats", field="request")

# 實例2
s.aggs.bucket("per_age", "terms", field="click.keyword").metric("sum_click", "stats", field="click")

# 實例3
s.aggs.metric("sum_age", "extended_stats", field="impression")

# 實例4
s.aggs.bucket("per_age", "terms", field="country.keyword")

# 實例5，此聚合是根據區間進行聚合
a = A("range", field="account_number", ranges=[{"to": 10}, {"from": 11, "to": 21}])

res = s.execute()
複製代碼

最後依然要執行 execute()，此處須要注意，s.aggs 操做不能用變量接收（如 res=s.aggs，這個操做是錯誤的），聚合的結果會保存到 res 中顯示。

排序

s = Search().sort(
    'category',
    '-title',
    {"lines" : {"order" : "asc", "mode" : "avg"}}
)
複製代碼

分頁

s = s[10:20]
# {"from": 10, "size": 10}
複製代碼

一些擴展方法，感興趣的同窗能夠看看：

s = Search()

# 設置擴展屬性使用`.extra()`方法
s = s.extra(explain=True)

# 設置參數使用`.params()`
s = s.params(search_type="count")

# 如要要限制返回字段，可使用`source()`方法
# only return the selected fields
s = s.source(['title', 'body'])
# don't return any fields, just the metadata
s = s.source(False)
# explicitly include/exclude fields
s = s.source(include=["title"], exclude=["user.*"])
# reset the field selection
s = s.source(None)

# 使用dict序列化一個查詢
s = Search.from_dict({"query": {"match": {"title": "python"}}})

# 修改已經存在的查詢
s.update_from_dict({"query": {"match": {"title": "python"}}, "size": 42})
複製代碼

參考文檔：

fingerchou.com/2017/08/12/…

fingerchou.com/2017/08/13/…

blog.csdn.net/JunFeng666/…