ElasticSearch 2.x入門與快速實踐

時間 2019-11-16

原文原文鏈接

本文從屬於筆者的爬蟲與搜索引擎最佳實踐系列文章html

Introduction

ElasticSearch是一個基於Apache Lucene(TM)的開源搜索引擎。不管在開源仍是專有領域，Lucene能夠被認爲是迄今爲止最早進、性能最好的、功能最全的搜索引擎庫。可是，Lucene只是一個庫。想要使用它，你必須使用Java來做爲開發語言並將其直接集成到你的應用中，更糟糕的是，Lucene很是複雜，你須要深刻了解檢索的相關知識來理解它是如何工做的。ElasticSearch也使用Java開發並使用Lucene做爲其核心來實現全部索引和搜索的功能，可是它的目的是經過簡單的RESTful API來隱藏Lucene的複雜性，從而讓全文搜索變得簡單。
不過，Elasticsearch不只僅是Lucene和全文搜索，咱們還能這樣去描述它：node

分佈式的實時文件存儲，每一個字段都被索引並可被搜索git
分佈式的實時分析搜索引擎github
能夠擴展到上百臺服務器，處理PB級結構化或非結構化數據web

並且，全部的這些功能被集成到一個服務裏面，你的應用能夠經過簡單的RESTful API、各類語言的客戶端甚至命令行與之交互。上手Elasticsearch很是容易。它提供了許多合理的缺省值，並對初學者隱藏了複雜的搜索引擎理論。它開箱即用（安裝便可使用），只需不多的學習既可在生產環境中使用。在ElasticSearch中，咱們經常會聽到Index、Type以及Document等概念，那麼它們與傳統的熟知的關係型數據庫中名稱的類好比下：chrome

Relational DB -> Databases -> Tables -> Rows -> Columns
Elasticsearch -> Indices   -> Types  -> Documents -> Fields

這裏借用此文的一張思惟腦圖來描述整個ElasticSearch生態圈你所應該瞭解的內容:
數據庫

Reference

Books & Tutorial

ElasticSearch權威指南中文版express
elasticsearch-definitive-guideapache

Quick Start

Installation

在這裏下載ElasticSearch的最新預編譯版本，而後直接解壓縮啓動便可。筆者此時使用的是2.3.5版本的ElasticSearch，其文件目錄結構以下：json

home---這是Elasticsearch解壓的目錄
　　bin---這裏面是ES啓動的腳本

　　conf---elasticsearch.yml爲ES的配置文件

　　data---這裏是ES得當前節點的分片的數據，能夠直接拷貝到其餘的節點進行使用

　　logs---日誌文件

　　plugins---這裏存放一些經常使用的插件，若是有一切額外的插件，能夠放在這裏使用。

在ElasticSearch 2.x版本中，默認是不容許以Root用戶身份運行實例，可使用bin/elasticsearch -Des.insecure.allow.root=true來以Root身份啓動集羣，此外還可使用bin/elasticsearch -f -Des.path.conf=/path/to/config/dir參數來讀取相關的.yml或者.json配置。

還有些常見的配置以下所示：

Setting	Description
`http.port`	A bind port range. Defaults to `9200-9300`.
`http.publish_port`	The port that HTTP clients should use when communicating with this node. Useful when a cluster node is behind a proxy or firewall and the `http.port` is not directly addressable from the outside. Defaults to the actual port assigned via `http.port`.
`http.bind_host`	The host address to bind the HTTP service to. Defaults to `http.host`(if set) or `network.bind_host`.
`http.publish_host`	The host address to publish for HTTP clients to connect to. Defaults to `http.host` (if set) or `network.publish_host`.
`http.host`	Used to set the `http.bind_host` and the `http.publish_host` Defaults to `http.host` or `network.host`.
`http.max_content_length`	The max content of an HTTP request. Defaults to `100mb`. If set to greater than `Integer.MAX_VALUE`, it will be reset to 100mb.
`http.max_initial_line_length`	The max length of an HTTP URL. Defaults to `4kb`
`http.max_header_size`	The max size of allowed headers. Defaults to `8kB`
`http.compression`	Support for compression when possible (with Accept-Encoding). Defaults to `false`.
`http.compression_level`	Defines the compression level to use. Defaults to `6`.
`http.cors.enabled`	Enable or disable cross-origin resource sharing, i.e. whether a browser on another origin can do requests to Elasticsearch. Defaults to `false`.
`http.cors.allow-origin`	Which origins to allow. Defaults to no origins allowed. If you prepend and append a `/` to the value, this will be treated as a regular expression, allowing you to support HTTP and HTTPs. for example using `/https?:\/\/localhost(:[0-9]+)?/` would return the request header appropriately in both cases. `` is a valid value but is considered a security risk* as your elasticsearch instance is open to cross origin requests from anywhere.
`http.cors.max-age`	Browsers send a "preflight" OPTIONS-request to determine CORS settings. `max-age` defines how long the result should be cached for. Defaults to `1728000` (20 days)
`http.cors.allow-methods`	Which methods to allow. Defaults to `OPTIONS, HEAD, GET, POST, PUT, DELETE`.
`http.cors.allow-headers`	Which headers to allow. Defaults to `X-Requested-With, Content-Type, Content-Length`.
`http.cors.allow-credentials`	Whether the `Access-Control-Allow-Credentials` header should be returned. Note: This header is only returned, when the setting is set to `true`. Defaults to `false`
`http.detailed_errors.enabled`	Enables or disables the output of detailed error messages and stack traces in response output. Note: When set to `false` and the`error_trace` request parameter is specified, an error will be returned; when `error_trace` is not specified, a simple message will be returned. Defaults to `true`
`http.pipelining`	Enable or disable HTTP pipelining, defaults to `true`.
`http.pipelining.max_events`	The maximum number of events to be queued up in memory before a HTTP connection is closed, defaults to `10000`.

REST API

在咱們啓動了某個ElasticSearch實例以後，便可以經過ElasticSearch自帶的基於JSON REST API來進行交互。咱們可使用官方教程中提供的curl工具，或者稍微複雜一點的經常使用工具Fiddler或者RESTClient來進行交互，不過這裏推薦使用Sense，這是Chrome內置的一個插件，可以提供不少的ElasticSearch的自動補全功能。

當咱們直接訪問根目錄時，會獲得以下的提示:

{
   "name": "Mister Fear",
   "cluster_name": "elasticsearch",
   "version": {
      "number": "2.3.5",
      "build_hash": "90f439ff60a3c0f497f91663701e64ccd01edbb4",
      "build_timestamp": "2016-07-27T10:36:52Z",
      "build_snapshot": false,
      "lucene_version": "5.5.0"
   },
   "tagline": "You Know, for Search"
}

CRUD

Index:建立與更新索引

在ElasticSearch中，Index這一動做類比於CRUD中的Create與Update，當咱們嘗試爲某個不存在的文檔創建索引時，會自動根據其相似與ID建立新的文檔，不然就會對原有的文檔進行修改。ElasticSearch使用PUT請求來進行Index操做，你須要提供索引名稱、類型名稱以及可選的ID，格式規範爲:http://localhost:9200/<index>/<type>/[<id>]。其中索引名稱能夠是任意字符，若是ElasticSearch中並不存在該索引則會自動建立。類型名的原則很相似於索引，不過其與索引相比會指明更多的細節信息：

每一個類型有本身獨立的ID空間
不一樣的類型有不一樣的映射(Mappings)，即不一樣的屬性/域的創建索引的方案
儘量地在一塊兒搜索請求中只對某個類型或者特定的類型進行搜索

典型的某個Index請求爲:

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972
}'

在上述請求執行以後，ElasticSearch會爲咱們建立索引名爲Movies，類型名爲Movie，ID爲1的文檔。固然你也能夠在Sense中運行該請求，這樣的話用戶體驗會更好一點：

在上圖中咱們能夠了解到，ElasticSearch對於PUT請求的響應中包含了是否操做成功、文檔編號等信息。此時咱們若是進行默認的全局搜索，能夠獲得以下返回：

能夠看出咱們剛剛新建的文檔已經能夠被查詢，接下來咱們嘗試對剛纔新創建的文檔進行些修改，添加某些關鍵字屬性。咱們一樣能夠利用PUT請求來進行該操做，不過咱們此次務必要加上須要修改的文檔的ID編號:

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972,
    "genres": ["Crime", "Drama"]
}'

對於此操做的ElasticSearch的響應與前者很相似，不過會能夠看出_version屬性值已經發生了變化：

該屬性便是用來追蹤文檔被修改過的次數，能夠在樂觀併發控制策略中控制併發修改，ElasticSearch僅會容許版本號高於原文檔版本號的修改發生。

GET

最簡單的獲取某個文檔的方式便是基於文檔ID進行搜索，標準的請求格式爲:http://localhost:9200/<index>/<type>/<id>，咱們查詢下上文中插入的一些電影數據:

curl -XGET "http://localhost:9200/movies/movie/1" -d''

返回數據中一樣會包含版本信息、ID編號以及源信息。

Delete:刪除索引

如今咱們嘗試去刪除上文中插入的部分文檔，對於要刪除的文檔一樣須要傳入索引名、類型名與文檔名這些信息，譬如:

curl -XDELETE "http://localhost:9200/movies/movie/1" -d''

在咱們刪除了該文檔以後，再次嘗試用GET方法獲取該文檔信息時，會獲得以下的響應:

Search

ElasticSearch最誘人的地方便是爲咱們提供了方便快捷的搜索功能，咱們首先嚐試使用以下的命令建立測試文檔:

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972,
    "genres": ["Crime", "Drama"]
}'

curl -XPUT "http://localhost:9200/movies/movie/2" -d'
{
    "title": "Lawrence of Arabia",
    "director": "David Lean",
    "year": 1962,
    "genres": ["Adventure", "Biography", "Drama"]
}'

curl -XPUT "http://localhost:9200/movies/movie/3" -d'
{
    "title": "To Kill a Mockingbird",
    "director": "Robert Mulligan",
    "year": 1962,
    "genres": ["Crime", "Drama", "Mystery"]
}'

curl -XPUT "http://localhost:9200/movies/movie/4" -d'
{
    "title": "Apocalypse Now",
    "director": "Francis Ford Coppola",
    "year": 1979,
    "genres": ["Drama", "War"]
}'

curl -XPUT "http://localhost:9200/movies/movie/5" -d'
{
    "title": "Kill Bill: Vol. 1",
    "director": "Quentin Tarantino",
    "year": 2003,
    "genres": ["Action", "Crime", "Thriller"]
}'

curl -XPUT "http://localhost:9200/movies/movie/6" -d'
{
    "title": "The Assassination of Jesse James by the Coward Robert Ford",
    "director": "Andrew Dominik",
    "year": 2007,
    "genres": ["Biography", "Crime", "Drama"]
}'

這裏須要瞭解的是，ElasticSearch爲咱們提供了通用的_bulk端點來在單請求中完成多文檔建立操做，不過這裏爲了簡單起見仍是分爲了多個請求進行執行。ElasticSearch中搜索主要是基於_search這個端點進行的，其標準請求格式爲:<index>/<type>/_search，其中index與type都是可選的。換言之，咱們能夠以以下幾種方式發起請求:

http://localhost:9200/_search... - 搜索全部的Index與Type
http://localhost:9200/movies/... - 搜索Movies索引下的全部類型
http://localhost:9200/movies/... -僅搜索包含在Movies索引Movie類型下的文檔

全文搜索

ElasticSearch的Query DSL爲咱們提供了許多不一樣類型的強大的查詢的語法，其核心的查詢字符串包含不少查詢的選項，而且由ElasticSearch編譯轉化爲多個簡單的查詢請求。最簡單的查詢請求便是全文檢索，譬如咱們這裏須要搜索關鍵字:kill:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "kill"
        }
    }
}'

執行該請求可能獲得以下響應:

指定域搜索

在上文簡單的全文檢索中，咱們會搜索每一個文檔中的全部域。而不少時候咱們僅須要對指定的部分域中文檔進行搜索操做，譬如咱們要搜索僅在標題中出現ford字段的文檔:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "ford",
            "fields": ["title"]
        }
    }
}'

而在全文搜索中，fields字段即被設置爲了默認的_all值：