python urllib2導出elasticsearch數據時返回 "urllib2.HTTPError: HTTP Error 500: Internal Server Error"

時間 2019-11-18

標籤 python urllib2 urllib 導出 elasticsearch 數據返回 urllib2.httperror httperror http error internal server 欄目 Python 简体版

原文原文鏈接

0、業務場景

將ES中某個index的某個字段的全部數據，導出到文件中html

一、ES數據導出方法簡述

ES數據導出方法，我主要找到了如下幾個方面，歡迎你們補充：node

ES官方API：snapshot and restore module

The snapshot and restore module allows to create snapshots of individual indices or an entire cluster into a remote repository like shared file system, S3, or HDFS. These snapshots are great for backups because they can be restored relatively quickly but they are not archival because they can only be restored to versions of Elasticsearch that can read the index.python

　　　簡而言之，是個對ES集羣的鏡像化以及快速回復的工具。不知足本次需求的針對某個字段輸出的要求，因此再也不繼續看。感興趣的同窗能夠查看 Elasticsearch Reference [5.0] » Modules » Snapshot And Restorelinux

ES的Java API：

　　　雖然說Java大法是我用的最多的編程語言，可是linux上運行Java腳本實在麻煩。拋出一個Java ES導出文件的連接，感興趣的同窗請自便：elasticsearch使用Java API批量數據導入和導出git

ES的Python API：

　　　迴歸正題，Google搜「elasticsearch導出數據」的第一匹配結果，是一個Python腳本寫的，連接是：lein-wang/elasticsearch_migrategithub

#!/usr/bin/python
#coding:utf-8
'''
    Export and Import ElasticSearch Data.
    Simple Example At __main__
    @author: wgzh159@163.com
    @modifier: lzkhit@163.com
    @note:  uncheck consistency of data, please do it by self
'''


import json
import os
import sys
import time
import urllib2

reload(sys)
sys.setdefaultencoding('utf-8')

class exportEsData():
    size = 10000
    def __init__(self, url,index,type,target_index):
        self.url = url+"/"+index+"/"+type+"/_search"
        self.index = index
        self.type = type
        self.target_index = target_index #替換原有的index
        self.file_name = self.target_index+"_"+self.type+".json"
    def exportData(self):
        print("export data begin...\n")
        begin = time.time()
        try:
            os.remove(self.file_name)
        except:
            os.mknod(self.file_name)
        msg = urllib2.urlopen(self.url).read()
        #print(msg)
        obj = json.loads(msg)
        num = obj["hits"]["total"]
        start = 0
        end =  num/self.size+1 # read size data one bulk
        while(start<end):
            try:
                msg = urllib2.urlopen(self.url+"?from="+str(start*self.size)+"&size="+str(self.size)).read()
                self.writeFile(msg)
                start=start+1
            except urllib2.HTTPError, e:
                print 'There was an error with the request'
                print e
                break
        print(start)
        print("export data end!!!\n total consuming time:"+str(time.time()-begin)+"s")
    def writeFile(self,msg):
        obj = json.loads(msg)
        vals = obj["hits"]["hits"]
        try:
            cnt = 0
            f = open(self.file_name,"a")
            for val in vals:
                val_json = val["_source"]["content"]
                f.write(str(val_json)+"\n")
                cnt += 1
        finally:
            print(cnt)
            f.flush()
            f.close()

class importEsData():
    def __init__(self,url,index,type):
        self.url = url
        self.index = index
        self.type = type
        self.file_name = self.index+"_"+self.type+".json"
    def importData(self):
        print("import data begin...\n")
        begin = time.time()
        try:
            s = os.path.getsize(self.file_name)
            f = open(self.file_name,"r")
            data = f.read(s)
            #此處有坑: 注意bulk操做須要的格式(以\n換行)
            self.post(data)
        finally:
            f.close()
        print("import data end!!!\n total consuming time:"+str(time.time()-begin)+"s")
    def post(self,data):
        print data
        print self.url
        req = urllib2.Request(self.url,data)
        r = urllib2.urlopen(req)
        response = r.read()
        print response
        r.close()

if __name__ == '__main__':
    '''
        Export Data
        e.g.
                            URL                    index        type
        exportEsData("http://10.100.142.60:9200","watchdog","mexception").exportData()
        
        export file name: watchdog_mexception.json
    '''
    exportEsData("http://88.88.88.88:9200","mtnews","articles","corpus").exportData()
    
    '''
        Import Data
        
        *import file name:watchdog_test.json    (important)
                    "_" front part represents the elasticsearch index
                    "_" after part represents the  elasticsearch type
        e.g.
                            URL                    index        type
        mportEsData("http://10.100.142.60:9200","watchdog","test").importData()
    '''
    #importEsData("http://10.100.142.60:9200","watchdog","test").importData()
    #importEsData("http://127.0.0.1:9200/_bulk","chat","CHAT").importData()
    #importEsData("http://127.0.0.1:9200/_bulk","chat","TOPIC").importData()

三、遇到的問題

萬事俱備，python run代碼後，出現了問題：編程

"urllib2.HTTPError: HTTP Error 500: Internal Server Error"

並且根據程序中的doc count計數信息，發現不論bulk size如何變（嘗試了10/50/100/500/1000/5000/10000），老是卡在了第10000篇文檔，而後urllib就拋異常。json

同事黃大哥分析緣由，多是如下幾個方面：api

沒有平衡bulk的速率，生產多，超過了消費能力，超過了es服務端的TPS （這裏黃大哥按照人生經驗建議一個bulk在5~15MB最合適）
系統端問題，需查看日誌

首先，經過在while循環裏面增長sleep語句並減小bulk size，下降ES的TPS，可是仍然在10000篇文檔導出的時候出現了 HTTP STATUS 500 的錯誤，此法不通。服務器

第二種緣由，這時候需登陸ES宿主機查看log。

發現log中有以下信息，

Caused by: QueryPhaseExecutionException[Result window is too large, from + size must be less than or equal to: [10000] but was [11000]. 
See the scroll api for a more efficient way to request lar       ge data sets. This limit can be set by changing the [index.max_result_window]
index level parameter.]

正如 urllib2中HTTP狀態碼含義一文中的

「5XX 迴應代碼以「5」開頭的狀態碼錶示服務器端發現本身出現錯誤，不能繼續執行請求」

確實是服務器端的問題。

四、解決的方法

言歸正傳，這個問題既然定位了，那麼解決方法確定是有的，參考 ES報錯Result window is too large問題處理

須要對對應index在配置上，作以下定義：

curl -XPUT http://88.88.88.88:9200/mtnews/_settings -d '{ "index" : { "max_result_window" : 10000000}}'

對log中提示的 index.max_result_window 字段進行修改（默認的爲10000）

五、ES學習的經驗

發現問題要及時看日誌，這樣能夠節約時間 23333

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。