搜索引擎 - ElasticSearch

時間 2019-11-08

標籤搜索引擎 elasticsearch 欄目搜索引擎简体版

原文原文鏈接

注：ES是Java開源項目，預先安裝Jre和NodeJS。html

1、介紹

Elasticsearch是基於Apache Lucene的開源搜索引擎，目前被認爲是最早進、性能最好、功能最全的搜索引擎。node

一、名詞

分片：集羣中節點存放文檔的地方，分片保存在不一樣節點可用於數據恢復，每一個分片佔用的CPU、RAM、IO越高索引速度就越快python

index（索引）: 相似數據庫，多個索引就表明多個數據庫git

type（類型）: 相似表名github

mapping ：表結構redis

doc（文檔）：數據，一條Json數據爲一個文檔數據庫

ES Json ：ES API請求模板，用於索引數據，格式ES有嚴格規定（不一樣版本有區別）npm

filter（過濾）：ES有倆種查詢模式，一是根據條件查詢（速度慢），二所有查詢後再條件過濾bootstrap

aggs（聚合）：相似數據庫的group by，可多個聚合嵌套使用服務器

2、安裝配置

如下爲單節點配置：

一、下載 ES壓縮包，解壓到本地。

二、打開/ES/config/下 elasticsearch.yml

爲了顯示整潔，去掉了註釋和沒使用的配置項

# ---------------------------------- Cluster -----------------------------------
cluster.name: elasticsearch #ES根據此名將節點放到集羣中

# ------------------------------------ Node ------------------------------------
node.name: node-master #節點名稱，集羣需更改!!!

# ----------------------------------- Paths ------------------------------------
#path.data: /path/to/data
#path.logs: /path/to/logs

# ----------------------------------- Memory -----------------------------------
#bootstrap.memory_lock: true

# ---------------------------------- Network -----------------------------------
network.host: 127.0.0.1 #節點綁定的ip
transport.tcp.port: 9301 #集羣需更改!!!
http.port: 9401 #集羣需更改!!!

# --------------------------------- Discovery ----------------------------------
#discovery.zen.ping.unicast.hosts: ["host1", "host2"] #主節點列表
##########Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):##########
discovery.zen.minimum_master_nodes: 1 #至少1個主節點

# ---------------------------------- Gateway -----------------------------------
#gateway.recover_after_nodes: 3

# ---------------------------------- Various -----------------------------------
#action.destructive_requires_name: true

一、命令

一、命令行到/ES/bin/下，運行 elasticsearch 或 elasticsearch -d 隱藏運行

二、非隱藏運行可以使用 Ctrl+C 關閉。隱藏模式可以使用 ps -ef | grep elastic 或 jps 查看進程號

三、當集羣中的節點出現紅色Unassigned，則檢查處理問題（節點狀態可以使用下面的ES插件進行觀察等其它操做）

（1）查看集羣相關信息

curl "localhost:9401/_nodes/process?pretty"

（2）找出 UNASSIGNED 相關信息

curl -XGET localhost:9401/_cat/shards|grep UNASSIGNED

（3）依次修改以上UNASSIGNED

curl -XPOST 'localhost:9401/_cluster/reroute' -d '{
    "commands" : [ {
        "allocate" : {
            "index" : "graylog_83",
            "shard" : 1,
            "node" : "Auq82gfGQVWgOBw6S7ajRQ",
            "allow_primary" : true
        }
    }]
}'

二、安裝ES監控

一、下載開源項目 elasticsearch-head

二、進入到elasticsearch-head下，命令行 npm install grunt-cli 安裝grunt客戶端

三、在elasticsearch-head下打開Gruntfile.js

四、運行監控插件及結果

3、ES Api

一、建立索引

{
    "student": {
        "properties": {
            "no": {
                "type": "string",
                "fielddata": true,
                "index": "analyzed"
            },
            "name": {
                "type": "string",
                "index": "analyzed"
            },
            "age": {
                "type": "integer"
            },
            "birth": {
                "type": "date",
                "format": "yyyy-MM-dd"
            },
            "isLeader": {
                "type": "boolean"
            }
        }
    }

}

而後用REST方式調用ES接口建立索引和類型：

ES監控插件上顯示：

二、bulk批處理

bulk API 容許在單個步驟中進行屢次 create 、 index 、 update 或 delete 請求。

curl -XPOST "http://172.16.13.4:9401/_bulk?pretty" -d '
{"delete": {"_index": "megacorp", "_type": "employee", "_id": "2"}}
{"create": {"_index": "megacorp", "_type": "employee", "_id": "2"}}
{"name": "first"}
{"index": {"_index": "megacorp", "_type": "employee"}}\n

三、ES分析器

分析器包括三個功能：字符過濾器（過濾掉HTML，特殊符號轉換）、分詞器也叫分析器（標準分析器、簡單、空格、語言分析器）、token過濾器（刪除改變無用詞）。具體詳見這章 ES分析器。

4、ES集羣

配置很簡單就不作詳細說明了，原理跟redis集羣差很少，判斷節點超時、投票選取主節點。

#####################################主節點1#####################################
# ---------------------------------- Cluster -----------------------------------
cluster.name: alex-es

# ------------------------------------ Node ------------------------------------
node.name: node1
node.master: true
node.data: true

# ----------------------------------- Path ------------------------------------
path.data: /path/to/data
path.logs: /path/to/logs

# ----------------------------------- Memory -----------------------------------
bootstrap.memory_lock: true

# ---------------------------------- Network -----------------------------------
network.host: 172.16.13.4
transport.tcp.port: 9301
transport.tcp.compress: true
http.port: 9401
http.max_content_length: 100mb
http.enabled: true
http.cors.enabled: true
http.cors.allow-origin: "*"

# --------------------------------- Discovery ----------------------------------
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["172.16.13.4:9301", "172.16.13.4:9302"]

# ---------------------------------- Gateway -----------------------------------
gateway.recover_after_nodes: 3
gateway.recover_after_time: 5m
gateway.expected_nodes: 3

#####################################主節點2#####################################
# ---------------------------------- Cluster -----------------------------------
cluster.name: alex-es

# ------------------------------------ Node ------------------------------------
node.name: node2
node.master: true
node.data: true

# ----------------------------------- Path ------------------------------------
path.data: /path/to/data2
path.logs: /path/to/logs2

# ----------------------------------- Memory -----------------------------------
bootstrap.memory_lock: true

# ---------------------------------- Network -----------------------------------
network.host: 172.16.13.4
transport.tcp.port: 9302
transport.tcp.compress: true
http.port: 9402
http.max_content_length: 100mb
http.enabled: true
http.cors.enabled: true
http.cors.allow-origin: "*"

# --------------------------------- Discovery ----------------------------------
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["172.16.13.4:9301", "172.16.13.4:9302"]

# ---------------------------------- Gateway -----------------------------------
gateway.recover_after_nodes: 3
gateway.recover_after_time: 5m
gateway.expected_nodes: 3

#####################################子節點######################################
# ---------------------------------- Cluster -----------------------------------
cluster.name: alex-es

# ------------------------------------ Node ------------------------------------
node.name: node3
node.master: false
node.data: true

# ----------------------------------- Path ------------------------------------
path.data: /path/to/data3
path.logs: /path/to/logs3

# ----------------------------------- Memory -----------------------------------
bootstrap.memory_lock: true

# ---------------------------------- Network -----------------------------------
network.host: 172.16.13.4
transport.tcp.port: 9303
transport.tcp.compress: true
http.port: 9403
http.max_content_length: 100mb
http.enabled: true
http.cors.enabled: true
http.cors.allow-origin: "*"

# --------------------------------- Discovery ----------------------------------
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["172.16.13.4:9301", "172.16.13.4:9302"]

# ---------------------------------- Gateway -----------------------------------
gateway.recover_after_nodes: 3
gateway.recover_after_time: 5m
gateway.expected_nodes: 3

以上配置信息不能包含空格，配置好後，所有啓動，在ES-head上監控顯示：

5、ES客戶端問題

官方提供了基於Python、Java等語言的客戶端，其中實現了對es鏈接池輪訓、查詢、索引、批量等操做。

因爲最近在用多進程併發查詢es的功能，當請求數量在一段時間內增長時，會有多個進程的響應超時的問題。

通過調查，已排查掉如下可能存在的問題：

一、Java GC機制問題（包括併發GC、FullGC、GCone等），由於根據GC的機制不一樣，會影響es的性能
二、es隊列大小
三、進程池，基本上是同一時間異步調用es查詢，因此這個不存在問題
四、CPU內存及es配置優化等

最後在服務器上抓包發現，部分請求要通過必定時間才能傳到es上，並且隨着請求數量加大，時間間隔有遞增趨勢，這樣問題就定位在es客戶端發送請求那。

通過一番研究，多是es客戶端所採用的傳輸協議會致使請求時間延長，最後決定用Python的 pycurl 來代替es客戶端，下面是代碼，能夠本身實現es輪訓：

import pycurl
import StringIO
import random

def es_pool():
    return ["ip:port", "ip:port"]

# curl請求
def curl_req(index='', rtype='', body=''):
    s = StringIO.StringIO()
    c = pycurl.Curl()

    es_hosts = es_pool()
    host = es_hosts[random.randint(0, len(es_hosts)) % len(es_hosts)]  # 根據es池大小隨機選擇
    url = host + '/' + index + '/' + rtype + '/_search'

    c.setopt(pycurl.URL, url)
    c.setopt(pycurl.POST, 1)
    c.setopt(pycurl.POSTFIELDS, body)
    c.setopt(pycurl.WRITEFUNCTION, s.write)
    c.perform()
    c.close()
    return s.getvalue()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。