使用python查詢Elasticsearch並導出全部數據

功能拆解python

python鏈接Elasticsearch
查詢Elasticsearch打印結果
導出全部結果數據
將全部結果寫入csv文件

1.打通python與Elasticsearch的通訊

與python鏈接Oracle、MySQL差很少思路，這裏須要用到Elasticsearch包，沒有的趕忙使用pip install elasticsearch來安裝。安裝成功後，再使用from elasticsearch import Elasticsearch就不會報錯了。json

from elasticsearch import Elasticsearch

es = Elasticsearch(hosts="http://192.168.21.33:9200/", http_auth=('abc','dataanalysis'))
print(es.info())

經過Elasticsearch()來配置鏈接，告訴它Elasticsearch所在服務器的IP地址，若是須要輸入用戶名密碼，在http_auth參數中給出。若是打印鏈接的信息不報錯，那就代表鏈接成功了服務器

2.經過json查詢體實現ES的查詢

請求體與Kibana下使用的格式徹底一致，若是不肯定請求體寫的對不對，能夠放在Kibana下調試一下，調試正確了再放進來。elasticsearch

以下所示，經過"_source" : "title"能夠限制返回結果只返回title字段。spa

query_json = {
  "_source": "title",
  "query": {
    "bool": {
      "must": [
        {"match_phrase": {
          "content": "汽車"
        }},
        {"match_phrase": {
          "content": "房子"
        }}
      ]
    }
  }
}

query = es.search(index='mydata',body=query_json)
print(query)

正常狀況下，打印query不報錯的話就能夠看到結果了。可是，你會發現返回的結果只有有限的幾條。這是由於Elasticsearch默認狀況下只會返回10或20條結果，若是你想要獲得全部結果，接下來的工做纔是重點。調試

3.藉助遊標導出全部結果數據

敲黑板，劃重點：code

先借助遊標，將全部結果數據存儲到內存中
而後將內存中的結果數據寫入到磁盤，也就是文件中

query = es.search(index='1485073708892',body=query_json,scroll='5m',size=100)

results = query['hits']['hits'] # es查詢出的結果第一頁
total = query['hits']['total']  # es查詢出的結果總量
scroll_id = query['_scroll_id'] # 遊標用於輸出es查詢出的全部結果

for i in range(0, int(total/100)+1):
    # scroll參數必須指定不然會報錯
    query_scroll = es.scroll(scroll_id=scroll_id,scroll='5m')['hits']['hits']
    results += query_scroll

在發送查詢請求的時候，就告訴ES須要使用遊標，並定義每次返回數據量的大小。ip

定義一個list變量results用來存儲數據結果，在代碼中，能夠另其爲空list，即results = []，也能夠先將返回結果的第一頁存進來，即resutls = query[‘hits’][‘hits’]內存

對於全部結果數據，寫個分頁加載到內存變量的循環。utf-8

4.將結果寫入csv文件

import csv

with open('./data/event_title.csv','w',newline='',encoding='utf-8') as flow:
    csv_writer = csv.writer(flow)
    for res in results:
        # print(res)
        csv_writer.writerow([res['_id']+','+res['_source']['title']])

Done！所有代碼以下所示：

import csv
from elasticsearch import Elasticsearch

# 查看參數配置：https://pypi.org/project/elasticsearch/
es = Elasticsearch(hosts="http://192.168.21.33:9200/", http_auth=('abc','dataanalysis'))
query_json = {
  "_source": "title",
  "query": {
    "bool": {
      "must": [
        {"match_phrase": {
          "content": "汽車"
        }},
        {"match_phrase": {
          "content": "房子"
        }}
      ]
    }
  }
}
query = es.search(index='1485073708892',body=query_json,scroll='5m',size=100)

results = query['hits']['hits'] # es查詢出的結果第一頁
total = query['hits']['total']  # es查詢出的結果總量
scroll_id = query['_scroll_id'] # 遊標用於輸出es查詢出的全部結果

for i in range(0, int(total/100)+1):
    # scroll參數必須指定不然會報錯
    query_scroll = es.scroll(scroll_id=scroll_id,scroll='5m')['hits']['hits']
    results += query_scroll


with open('./data/event_title.csv','w',newline='',encoding='utf-8') as flow:
    csv_writer = csv.writer(flow)
    for res in results:
        # print(res)
        csv_writer.writerow([res['_id']+','+res['_source']['title']])


print('done!')
# print(es.info())