學習python庫：elasticsearch-dsl

時間 2020-01-12

標籤學習 python elasticsearch dsl 欄目 Python 简体版

原文原文鏈接

1、簡介

elasticsearch-dsl是基於elasticsearch-py封裝實現的，提供了更簡便的操做elasticsearch的方法。html

2、具體使用

elasticsearch的官方文檔介紹一共包括六個部分，分別是：configuration、search dsl、persistence、update by query、API document。python

2.1 Configuration

有許多方式能夠配置鏈接，最簡單且有效的方式是設置默認鏈接，該默認鏈接能夠被未傳遞其餘鏈接的API調用使用。web

2.1.1 Default connection

默認鏈接的實現須要使用到connections.create_connection()方法。django

from elasticsearch_dsl import connections

connections.create_connection(hosts=['localhost'], timeout=20)

同時還能夠經過alias給鏈接設置別名，後續能夠經過別名來引用該鏈接，默認別名爲defaultjson

from elasticsearch_dsl import connections

connections.create_connection(alias='my_new_connection', hosts=['localhost'], timeout=60)

2.1.2 Multiple clusters

能夠經過configure定義多個指向不一樣集羣的鏈接。安全

from elasticsearch_dsl import connections

connections.configure(
    default={'hosts': 'localhost'},
    dev={
        'hosts': ['esdev1.example.com:9200'],
        'sniff_on_start': True
    }
)

還能夠經過add_connection手動添加鏈接。restful

2.1.2.4 Using aliases

下面的例子展現瞭如何使用鏈接別名。app

s = Search(using='qa')

2.1.3 Manual

若是你不想提供一個全局的鏈接，你能夠經過使用using參數傳遞一個elasticsearch.Elasticsearch的實例作爲鏈接，以下：elasticsearch

s = Search(using=Elasticsearch('localhost'))

你還能夠經過下面的方式來覆蓋已經關聯的鏈接。ide

s = s.using(Elasticsearch('otherhost:9200'))

2.2 Search DSL

2.2.1 The search object

search對象表明整個搜索請求，包括：queries、filters、aggregations、sort、pagination、additional parameters、associated client。

API被設置爲可連接的。search對象是不可變的，除了聚合，對對象的全部更改都將致使建立包含該更改的淺表副本。

當初始化Search對象時，你能夠傳遞low-level elasticsearch客戶端做爲參數。

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

client = Elasticsearch()

s = Search(using=client)

注意

全部的方法都返回一個該對象的拷貝，這樣能夠保證它被傳遞給外部代碼時是安全的。

該API是能夠連接的，容許你組合多個方法調用在一個語句中：

s = Search().using(client).query("match", title="python")

執行execute方法將請求發送給elasticsearch：

response = s.execute()

若是僅僅是想要遍歷返回結果提示，能夠經過遍歷Search對象（前提是執行過execute方法）：

for hit in s:
    print(hit.title)

能夠經過to_dict()方法將Search對象序列化爲一個dict對象，這樣能夠方便調試。

print(s.to_dict())

2.2.1.1 Delete By Query

能夠經過調用Search對象上的delete方法而不是execute來實現刪除匹配查詢的文檔，以下：

s = Search(index='i').query("match", title="python")
response = s.delete()

2.2.1.2 Queries

該庫爲全部的Elasticsearch查詢類型都提供了類。以關鍵字參數傳遞全部的參數，最終會把參數序列化後傳遞給Elasticsearch，這意味着在原始查詢和它對應的dsl之間有這一個清理的一對一的映射。

from elasticsearch_dsl.query import MultiMatch, Match

# {"multi_match": {"query": "python django", "fields": ["title", "body"]}}
MultiMatch(query='python django', fields=['title', 'body'])

# {"match": {"title": {"query": "web framework", "type": "phrase"}}}
Match(title={"query": "web framework", "type": "phrase"})

你可使用快捷方式Q經過命名參數或者原始dict類型數據來構建一個查詢實例：

from elasticsearch_dsl import Q

Q("multi_match", query='python django', fields=['title', 'body'])
Q({"multi_match": {"query": "python django", "fields": ["title", "body"]}})

經過.query()方法將查詢添加到Search對象中：

q = Q("multi_match", query='python django', fields=['title', 'body'])
s = s.query(q)

該方法還能夠接收全部Q的參數做爲參數。

s = s.query("multi_match", query='python django', fields=['title', 'body'])

2.2.1.2.1 Dotted fields

有時候你想要引用一個在其餘字段中的字段，例如多字段（title.keyword）或者在一個json文檔中的address.city。爲了方便，Q容許你使用雙下劃線‘__’代替關鍵詞參數中的‘.’

s = Search()
s = s.filter('term', category__keyword='Python')
s = s.query('match', address__city='prague')

除此以外，若是你願意，也能夠隨時使用python的kwarg解壓縮功能。

s = Search()
s = s.filter('term', **{'category.keyword': 'Python'})
s = s.query('match', **{'address.city': 'prague'})

2.2.1.2.2 Query combination

查詢對象能夠經過邏輯運算符組合起來：

Q("match", title='python') | Q("match", title='django')
# {"bool": {"should": [...]}}

Q("match", title='python') & Q("match", title='django')
# {"bool": {"must": [...]}}

~Q("match", title="python")
# {"bool": {"must_not": [...]}}

當調用.query()方法屢次時，內部會使用&操做符：

s = s.query().query()
print(s.to_dict())
# {"query": {"bool": {...}}}

若是你想要精確控制查詢的格式，能夠經過Q直接構造組合查詢：

q = Q('bool',
    must=[Q('match', title='python')],
    should=[Q(...), Q(...)],
    minimum_should_match=1
)
s = Search().query(q)

2.2.1.3 Filters

若是你想要在過濾上下文中添加查詢，可使用filter()函數來使之變的簡單。

s = Search()
s = s.filter('terms', tags=['search', 'python'])

在背後，這會產生一個bool查詢，並將指定的條件查詢放入其filter分支，等價與下面的操做：

s = Search()
s = s.query('bool', filter=[Q('terms', tags=['search', 'python'])])

若是你想要使用post_filter元素進行多面導航，請使用.post_filter()方法，你還可使用exculde()方法從查詢中排除項目：

s = Search()
s = s.exclude('terms', tags=['search', 'python'])

2.2.1.4 Aggregations

你能夠是使用A快捷方式來定義一個聚合。

from elasticsearch_dsl import A

A('terms', field='tags')
# {"terms": {"field": "tags"}}

爲了實現聚合嵌套，你可使用.bucket()、.metirc()以及.pipeline()方法。

a = A('terms', field='category')
# {'terms': {'field': 'category'}}

a.metric('clicks_per_category', 'sum', field='clicks')\
    .bucket('tags_per_category', 'terms', field='tags')
# {
#   'terms': {'field': 'category'},
#   'aggs': {
#     'clicks_per_category': {'sum': {'field': 'clicks'}},
#     'tags_per_category': {'terms': {'field': 'tags'}}
#   }
# }

爲了將聚合添加到Search對象中，使用.aggs屬性，它是做爲一個top-level聚合的。

s = Search()
a = A('terms', field='category')
s.aggs.bucket('category_terms', a)
# {
#   'aggs': {
#     'category_terms': {
#       'terms': {
#         'field': 'category'
#       }
#     }
#   }
# }

或者：

s = Search()
s.aggs.bucket('articles_per_day', 'date_histogram', field='publish_date', interval='day')\
    .metric('clicks_per_day', 'sum', field='clicks')\
    .pipeline('moving_click_average', 'moving_avg', buckets_path='clicks_per_day')\
    .bucket('tags_per_day', 'terms', field='tags')

s.to_dict()
# {
#   "aggs": {
#     "articles_per_day": {
#       "date_histogram": { "interval": "day", "field": "publish_date" },
#       "aggs": {
#         "clicks_per_day": { "sum": { "field": "clicks" } },
#         "moving_click_average": { "moving_avg": { "buckets_path": "clicks_per_day" } },
#         "tags_per_day": { "terms": { "field": "tags" } }
#       }
#     }
#   }
# }

你能夠經過名字來訪問一個存在的桶。

s = Search()

s.aggs.bucket('per_category', 'terms', field='category')
s.aggs['per_category'].metric('clicks_per_category', 'sum', field='clicks')
s.aggs['per_category'].bucket('tags_per_category', 'terms', field='tags')

2.2.1.5 Sorting

要指定排序順序，可使用.order()方法。

s = Search().sort(
    'category',
    '-title',
    {"lines" : {"order" : "asc", "mode" : "avg"}}
)

能夠經過不傳任何參數調用sort()函數來重置排序。

2.2.1.6 Pagination

要指定from、size，使用slicing API：

s = s[10:20]
# {"from": 10, "size": 10}

要訪問匹配的全部文檔，可使用scan()函數，scan()函數使用scan、scroll elasticsearch API：

for hit in s.scan():
    print(hit.title)

須要注意的是這種狀況下結果是不會被排序的。

2.2.1.7 Highlighting

要指定高亮的通用屬性，可使用highlight_options()方法：

s = s.highlight_options(order='score')

能夠經過highlight()方法來爲了每一個單獨的字段設置高亮：

s = s.highlight('title')
# or, including parameters:
s = s.highlight('title', fragment_size=50)

而後，響應中的分段將在每一個結果對象上以.meta.highlight.FIELD形式提供，其中將包含分段列表：

response = s.execute()
for hit in response:
    for fragment in hit.meta.highlight.title:
        print(fragment)

2.2.1.8 Suggestions

要指定一個suggest請求在你的search對象上，可使用suggest()方法：

# check for correct spelling
s = s.suggest('my_suggestion', 'pyhton', term={'field': 'title'})

2.2.1.9 Extra properties and parameters

要爲search對象設置額外的屬性，可使用.extra()方法。能夠用來定義body中的key，那些不能經過指定API方法來設置的，例如explain、search_filter。

s = s.extra(explain=True)

要設置查詢參數，可使用.params()方法：

s = s.params(routing="42")

若是要限制elasticsearch返回的字段，可使用source()方法：

# only return the selected fields
s = s.source(['title', 'body'])
# don't return any fields, just the metadata
s = s.source(False)
# explicitly include/exclude fields
s = s.source(includes=["title"], excludes=["user.*"])
# reset the field selection
s = s.source(None)

2.2.1.10 Serialization and Deserialization

查詢對象能夠經過使用.to_dict()方法被序列化爲一個字典。

你也可使用類方法from_dict從一個dict建立一個Search對象。這會建立一個新的Search對象並使用字典中的數據填充它。

s = Search.from_dict({"query": {"match": {"title": "python"}}})

若是你但願修改現有的Search對象，並覆蓋其屬性，則可使用update_from_dict()方法就地更改實例。

s = Search(index='i')
s.update_from_dict({"query": {"match": {"title": "python"}}, "size": 42})

2.2.2 Response

你能夠經過調用execute方法來執行你的搜索，它會返回一個Response對象，Response對象容許你經過屬性的方式訪問返回結果字典中的任何key。

print(response.success())
# True

print(response.took)
# 12

print(response.hits.total.relation)
# eq
print(response.hits.total.value)
# 142

print(response.suggest.my_suggestions)

若是想要檢查response對象的內容，能夠經過to_dict方法訪問原始數據。

2.2.2.1 Hits

能夠經過hits屬性訪問返回的匹配結果，或者遍歷Response對象。

response = s.execute()
print('Total %d hits found.' % response.hits.total)
for h in response:
    print(h.title, h.body)

2.2.2.2 Result

每一個匹配項被封裝到一個類中，能夠方便經過類屬性來訪問返回結果字典中的key，全部的元數據存儲在meta屬性中。

response = s.execute()
h = response.hits[0]
print('/%s/%s/%s returned with score %f' % (
    h.meta.index, h.meta.doc_type, h.meta.id, h.meta.score))

2.2.2.3 Aggregations

能夠經過aggregations屬性來訪問聚合結果：

for tag in response.aggregations.per_tag.buckets:
    print(tag.key, tag.max_lines.value)

2.2.3 MultiSearch

能夠經過MultiSearch類同時執行多個搜索，它將會使用_msearch API：

from elasticsearch_dsl import MultiSearch, Search

ms = MultiSearch(index='blogs')

ms = ms.add(Search().filter('term', tags='python'))
ms = ms.add(Search().filter('term', tags='elasticsearch'))

responses = ms.execute()

for response in responses:
    print("Results for query %r." % response.search.query)
    for hit in response:
        print(hit.title)

2.3 Persistence

你可使用dsl庫來定義你的mappings和一個基本的持久化層爲你的應用程序。

2.3.1 Document

若是你要爲你的文檔建立一個model-like的封裝，可使用Document類。它能夠被用做建立在elasticsearch中全部須要的mappings和settings。

from datetime import datetime
from elasticsearch_dsl import Document, Date, Nested, Boolean, \
    analyzer, InnerDoc, Completion, Keyword, Text

html_strip = analyzer('html_strip',
    tokenizer="standard",
    filter=["standard", "lowercase", "stop", "snowball"],
    char_filter=["html_strip"]
)

class Comment(InnerDoc):
    author = Text(fields={'raw': Keyword()})
    content = Text(analyzer='snowball')
    created_at = Date()

    def age(self):
        return datetime.now() - self.created_at

class Post(Document):
    title = Text()
    title_suggest = Completion()
    created_at = Date()
    published = Boolean()
    category = Text(
        analyzer=html_strip,
        fields={'raw': Keyword()}
    )

    comments = Nested(Comment)

     class Index:
        name = 'blog'

    def add_comment(self, author, content):
        self.comments.append(
          Comment(author=author, content=content, created_at=datetime.now()))

    def save(self, ** kwargs):
        self.created_at = datetime.now()
        return super().save(** kwargs)

2.3.1.1 Data types

定義Document實例時，除了可使用python類型，還可使用InnerDoc、Range等類型來表示非簡單類型的數據。

from elasticsearch_dsl import Document, DateRange, Keyword, Range

class RoomBooking(Document):
    room = Keyword()
    dates = DateRange()


rb = RoomBooking(
  room='Conference Room II',
  dates=Range(
    gte=datetime(2018, 11, 17, 9, 0, 0),
    lt=datetime(2018, 11, 17, 10, 0, 0)
  )
)

# Range supports the in operator correctly:
datetime(2018, 11, 17, 9, 30, 0) in rb.dates # True

# you can also get the limits and whether they are inclusive or exclusive:
rb.dates.lower # datetime(2018, 11, 17, 9, 0, 0), True
rb.dates.upper # datetime(2018, 11, 17, 10, 0, 0), False

# empty range is unbounded
Range().lower # None, False

2.3.1.2 Note on dates

當實例化一個Date字段時，能夠經過設置default_timezone參數來明確指定時區。

class Post(Document):
    created_at = Date(default_timezone='UTC')

2.3.1.3 Document life cycle

在你第一次使用Post文檔類型前，你須要在elasticsearch中建立mappings。能夠經過Index對象或者調用init()方法直接建立mappings。

# create the mappings in Elasticsearch
Post.init()

全部metadata字段，能夠經過meta屬性訪問。

post = Post(meta={'id': 42})

# prints 42
print(post.meta.id)

# override default index
post.meta.index = 'my-blog'

能夠經過get()方法來檢索一個存在的文檔：

# retrieve the document
first = Post.get(id=42)
# now we can call methods, change fields, ...
first.add_comment('me', 'This is nice!')
# and save the changes into the cluster again
first.save()

要刪除一個文檔，直接調用delete()方法便可：

first = Post.get(id=42)
first.delete()

2.3.1.4 Analysis

要爲text字段指定analyzer，你只須要使用analyze的名字，使用已有的analyze或者本身定義。

2.3.1.5 Search

爲了在該文檔類型上搜索，使用search方法便可。

# by calling .search we get back a standard Search object
s = Post.search()
# the search is already limited to the index and doc_type of our document
s = s.filter('term', published=True).query('match', title='first')


results = s.execute()

# when you execute the search the results are wrapped in your document class (Post)
for post in results:
    print(post.meta.score, post.title)

2.3.1.6 class Meta options

在Meta類中定義了多個你能夠爲你的文檔定義的metadata，例如mapping。

2.3.1.7 class Index options

Index類中定義了該索引的信息，它的名字、settings和其餘屬性。

2.3.1.8 Document Inheritance

2.3.2 Index

在典型狀況下，在Document類上使用Index類足夠處理任何操做的。在少許case下，直接操做Index對象可能更有用。

Index是一個類，負責保存一個索引在elasticsearch中的全部關聯元數據，例如mapping和settings。因爲它容許方便的同時建立多個mapping，因此當定義mapping的時候它是最有用的。當在遷移elasticsearch對象的時候是特別有用的。

from elasticsearch_dsl import Index, Document, Text, analyzer

blogs = Index('blogs')

# define custom settings
blogs.settings(
    number_of_shards=1,
    number_of_replicas=0
)

# define aliases
blogs.aliases(
    old_blogs={}
)

# register a document with the index
blogs.document(Post)

# can also be used as class decorator when defining the Document
@blogs.document
class Post(Document):
    title = Text()

# You can attach custom analyzers to the index

html_strip = analyzer('html_strip',
    tokenizer="standard",
    filter=["standard", "lowercase", "stop", "snowball"],
    char_filter=["html_strip"]
)

blogs.analyzer(html_strip)

# delete the index, ignore if it doesn't exist
blogs.delete(ignore=404)

# create the index in elasticsearch
blogs.create()

你能夠爲你的索引設置模板，並使用clone()方法建立一個指定的拷貝：

blogs = Index('blogs', using='production')
blogs.settings(number_of_shards=2)
blogs.document(Post)

# create a copy of the index with different name
company_blogs = blogs.clone('company-blogs')

# create a different copy on different cluster
dev_blogs = blogs.clone('blogs', using='dev')
# and change its settings
dev_blogs.setting(number_of_shards=1)

2.3.2.1 Index Template

elasticsearch-dsl還提供了使用IndexTemplate類在elasticsearch中來管理索引模板的選項，該類與Index的API很是類似。

一旦一個索引模板被保存到elasticsearch，他的內容將會自動應用到匹配模式的新索引上（已存在的索引不會受影響），即便索引是當索引一個文檔時自動建立的。

from datetime import datetime

from elasticsearch_dsl import Document, Date, Text


class Log(Document):
    content = Text()
    timestamp = Date()

    class Index:
        name = "logs-*"
        settings = {
          "number_of_shards": 2
        }

    def save(self, **kwargs):
        # assign now if no timestamp given
        if not self.timestamp:
            self.timestamp = datetime.now()

        # override the index to go to the proper timeslot
        kwargs['index'] = self.timestamp.strftime('logs-%Y%m%d')
        return super().save(**kwargs)

# once, as part of application setup, during deploy/migrations:
logs = Log._index.as_template('logs', order=0)
logs.save()

# to perform search across all logs:
search = Log.search()

2.4 Faceted Search

該API是實驗性的，而且也沒有用到，因此先跳過。

2.5 Update By Query

2.5.1 The Update By Query object

Update By Query對象容許使用_update_by_query實如今一個匹配過程當中更新一個文檔。

2.5.1.1 Serialization and Deserialization

該查詢對象能夠經過.to_dict()方法序列化爲一個字典，也能夠經過類方法from_dict()從一個字典構建一個對象。

ubq = UpdateByQuery.from_dict({"query": {"match": {"title": "python"}}})

2.5.1.2 Extra properties and parameters

能夠經過.extra()方法設置額外的屬性：

ubq = ubq.extra(explain=True)

能夠經過.params()方法設置查詢參數：

ubq = ubq.params(routing="42")

2.5.2 Response

你能夠調用.execute()方法執行查詢，它會返回一個Response對象。Response對象容許經過屬性訪問結果字典中的任何key。

response = ubq.execute()

print(response.success())
# True

print(response.took)
# 12

若是須要查看response對象的內容，使用to_dic()方法獲取它的原始數據便可。

2.6 API Documentation

API Documention詳細介紹了elasticsearch-dsl庫中的公共類和方法的用法，具體使用的時候直接翻閱參考便可。

3、總結

一、elasticsearch-dsl相比於elasticsearch來講，提供了更簡便的方法來操做elasticsearch，減小了生成DSL查詢語言的複雜性，推薦使用。

二、elasticsearch-dsl的方法其實仍是和elasticsearch的restful API對應的，因此它的API文檔有些地方寫的並不清晰，例如實例構造能夠傳遞哪些參數？它的說明時能夠接收任何關鍵字參數並會直接把參數傳遞給elasticsearch，因此要肯定哪些參數生效，仍是須要咱們去查看elasticsearch的restful API文檔。