ElasticSearch 搜索引擎

時間 2019-11-12

原文原文鏈接

Elasticsearch 是一個分佈式可擴展的實時搜索和分析引擎,一個創建在全文搜索引擎 Apache Lucene(TM) 基礎上的搜索引擎.固然 Elasticsearch 並不只僅是 Lucene 那麼簡單，它不只包括了全文搜索功能，還能夠進行如下工做:html

分佈式實時文件存儲，並將每個字段都編入索引，使其能夠被搜索。
實時分析的分佈式搜索引擎。
能夠擴展到上百臺服務器，處理PB級別的結構化或非結構化數據。

Elasticsearch是面向文檔型數據庫，一條數據在這裏就是一個文檔，用JSON做爲文檔序列化的格式java

{
    "name" :     "John",
    "sex" :      "Male",
    "age" :      25,
    "birthDate": "1990/05/01",
    "about" :    "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

用Mysql這樣的數據庫存儲就會容易想到創建一張User表，有balabala的字段等，在Elasticsearch裏這就是一個文檔，固然這個文檔會屬於一個User的類型，各類各樣的類型存在於一個索引當中。這裏有一份簡易的將Elasticsearch和關係型數據術語對照表:node

關係數據庫     ⇒ 數據庫 ⇒ 表    ⇒ 行    ⇒ 列(Columns)

Elasticsearch  ⇒ 索引(Index)   ⇒ 類型(type)  ⇒ 文檔(Docments)  ⇒ 字段(Fields)

一個 Elasticsearch 集羣能夠包含多個索引(數據庫)，也就是說其中包含了不少類型(表)。這些類型中包含了不少的文檔(行)，而後每一個文檔中又包含了不少的字段(列)。Elasticsearch的交互，可使用Java API，也能夠直接使用HTTP的Restful API方式，好比咱們打算插入一條記錄，能夠簡單發送一個HTTP的請求：算法

PUT /megacorp/employee/1  
{
    "name" :     "John",
    "sex" :      "Male",
    "age" :      25,
    "about" :    "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

安裝ElasticSearch 官網：http://www.elasticsearch.orgsql

將zip文件解壓,雙擊執行 elasticsearch.bat，該腳本文件執行 ElasticSearch 安裝程序，稍等片刻，打開瀏覽器，輸入 http://localhost:9200 ，顯式如下畫面，說明ES安裝成功。shell

http://127.0.0.1:9200/


{
  "name" : "node-1",
  "cluster_name" : "my-application",
  "version" : {
    "number" : "2.4.0",
    "build_hash" : "ce9f0c7394dee074091dd1bc4e9469251181fc55",
    "build_timestamp" : "2016-08-29T09:14:17Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.2"
  },
  "tagline" : "You Know, for Search"
}

再安裝head插件數據庫

按住Windows+R，輸入cmd，打開命令行工具，進入到ElasticSearch的bin目錄，使用ES命令安裝插件json

cd D:\elasticsearch-2.4.0\bin
plugin install mobz/elasticsearch-head

在本地瀏覽器中輸入http://localhost:9200/_plugin/head/，若是看到如下截圖，說明head插件安裝成功。瀏覽器

建立索引服務器

請求體 —— JSON 文檔 —— 包含了這位員工的全部詳細信息，他的名字叫 John Smith ，今年 25 歲，喜歡攀巖。

PUT /megacorp/employee/1
{
    "first_name" : "John",
    "last_name" :  "Smith",
    "age" :        25,
    "about" :      "I love to go rock climbing",
    "interests": [ "sports", "music" ]
}

注意，路徑 /megacorp/employee/1 包含了三部分的信息：

megacorp 索引名稱

employee 類型名稱

1 特定僱員的ID

PUT /megacorp/employee/2
{
    "first_name" :  "Jane",
    "last_name" :   "Smith",
    "age" :         32,
    "about" :       "I like to collect rock albums",
    "interests":  [ "music" ]
}

PUT /megacorp/employee/3
{
    "first_name" :  "Douglas",
    "last_name" :   "Fir",
    "age" :         35,
    "about":        "I like to build cabinets",
    "interests":  [ "forestry" ]
}

增長後看下發生了什麼變化

檢索文檔

簡單地執行一個 HTTP GET 請求並指定文檔的地址——索引庫、類型和ID。使用這三個信息能夠返回原始的 JSON 文檔：

http://localhost:9200/megacorp/employee/1

GET /megacorp/employee/1

{
  "_index" :   "megacorp",
  "_type" :    "employee",
  "_id" :      "1",
  "_version" : 1,
  "found" :    true,
  "_source" :  {
      "first_name" :  "John",
      "last_name" :   "Smith",
      "age" :         25,
      "about" :       "I love to go rock climbing",
      "interests":  [ "sports", "music" ]
  }
}

一樣的，可使用 DELETE 命令來刪除文檔，以及使用 HEAD 指令來檢查文檔是否存在。

簡單搜索

搜索全部僱員_search

http://localhost:9200/megacorp/employee/_search

GET /megacorp/employee/_search

{
   "took":      6,
   "timed_out": false,
   "_shards": { ... },
   "hits": {
      "total":      3,
      "max_score":  1,
      "hits": [
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "3",
            "_score":         1,
            "_source": {
               "first_name":  "Douglas",
               "last_name":   "Fir",
               "age":         35,
               "about":       "I like to build cabinets",
               "interests": [ "forestry" ]
            }
         },
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "1",
            "_score":         1,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            "_index":         "megacorp",
            "_type":          "employee",
            "_id":            "2",
            "_score":         1,
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

條件查詢

是一種Query-string 搜索

嘗試下搜索姓氏爲 ``Smith`` 的僱員

http://127.0.0.1:9200/megacorp/employee/_search?q=last_name:Smith

GET /megacorp/employee/_search?q=last_name:Smith

{
   ...
   "hits": {
      "total":      2,
      "max_score":  0.30685282,
      "hits": [
         {
            ...
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            ...
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

表達式查詢

Elasticsearch 提供一個豐富靈活的查詢語言叫作 查詢表達式 ，一個 match 查詢, 它支持構建更加複雜和健壯的查詢

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "last_name" : "Smith"
        }
    }
}

複雜的搜索

搜索姓氏爲 Smith 的僱員，但此次咱們只須要年齡大於 30 的。查詢須要稍做調整，使用過濾器 filter ，它支持高效地執行一個結構化查詢

GET /megacorp/employee/_search
{
    "query" : {
        "bool": {
            "must": {
                "match" : {
                    "last_name" : "smith" 
                }
            },
            "filter": {
                "range" : {
                    "age" : { "gt" : 30 } 
                }
            }
        }
    }
}

range 過濾器 ，它能找到年齡大於 30 的文檔，其中 gt 表示_大於(_great than)

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.30685282,
      "hits": [
         {
            ...
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

範圍檢索（range query）

range 查詢可同時提供包含（inclusive）和不包含（exclusive）這兩種範圍表達式，可供組合的選項以下：

gt: > 大於（greater than）
lt: < 小於（less than）
gte: >= 大於或等於（greater than or equal to）
lte: <= 小於或等於（less than or equal to）

相似Mysql中的範圍查詢：

SELECT document
FROM   products
WHERE  price BETWEEN 20 AND 40

全文搜索

搜索下全部喜歡攀巖（rock climbing）的僱員

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}

{
   ...
   "hits": {
      "total":      2,
      "max_score":  0.16273327,
      "hits": [
         {
            ...
            "_score":         0.16273327, 
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         },
         {
            ...
            "_score":         0.016878016, 
            "_source": {
               "first_name":  "Jane",
               "last_name":   "Smith",
               "age":         32,
               "about":       "I like to collect rock albums",
               "interests": [ "music" ]
            }
         }
      ]
   }
}

短語查找

精確匹配一系列單詞或者短語，對 match 查詢稍做調整，使用一個叫作 match_phrase 的查詢：

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    }
}

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            }
         }
      ]
   }
}

高亮搜索

執行前面的查詢，並增長一個新的 highlight 參數

GET /megacorp/employee/_search
{
    "query" : {
        "match_phrase" : {
            "about" : "rock climbing"
        }
    },
    "highlight": {
        "fields" : {
            "about" : {}
        }
    }
}

結果中還多了一個叫作 highlight 的部分。這個部分包含了 about 屬性匹配的文本片斷，並以 HTML 標籤 <em></em> 封裝

{
   ...
   "hits": {
      "total":      1,
      "max_score":  0.23013961,
      "hits": [
         {
            ...
            "_score":         0.23013961,
            "_source": {
               "first_name":  "John",
               "last_name":   "Smith",
               "age":         25,
               "about":       "I love to go rock climbing",
               "interests": [ "sports", "music" ]
            },
            "highlight": {
               "about": [
                  "I love to go <em>rock</em> <em>climbing</em>" 
               ]
            }
         }
      ]
   }
}

分析或聚合

Elasticsearch 有一個功能叫聚合（aggregations），容許咱們基於數據生成一些精細的分析結果。聚合與 SQL 中的 GROUP BY 相似但更強大。

聚合分類

Metrics 是簡單的對過濾出來的數據集進行avg,max等操做，是一個單一的數值。

Bucket 你則能夠理解爲將過濾出來的數據集按條件分紅多個小數據集，而後Metrics會分別做用在這些小數據集上

咱們找到全部職員中最大的共同點（興趣愛好）是什麼：

GET /megacorp/employee/_search
{
  "aggs": {
    "all_interests": {
      "terms": { "field": "interests" }
    }
  }
}

{
   ...
   "hits": { ... },
   "aggregations": {
      "all_interests": {
         "buckets": [
            {
               "key":       "music",
               "doc_count": 2
            },
            {
               "key":       "forestry",
               "doc_count": 1
            },
            {
               "key":       "sports",
               "doc_count": 1
            }
         ]
      }
   }
}

若是想知道叫 Smith 的僱員中最受歡迎的興趣愛好，能夠直接添加適當的查詢來組合查詢：

GET /megacorp/employee/_search
{
  "query": {
    "match": {
      "last_name": "smith"
    }
  },
  "aggs": {
    "all_interests": {
      "terms": {
        "field": "interests"
      }
    }
  }
}

...
  "all_interests": {
     "buckets": [
        {
           "key": "music",
           "doc_count": 2
        },
        {
           "key": "sports",
           "doc_count": 1
        }
     ]
  }

聚合還支持分級彙總。好比，查詢特定興趣愛好員工的平均年齡：

GET /megacorp/employee/_search
{
    "aggs" : {
        "all_interests" : {
            "terms" : { "field" : "interests" },
            "aggs" : {
                "avg_age" : {
                    "avg" : { "field" : "age" }
                }
            }
        }
    }
}

...
  "all_interests": {
     "buckets": [
        {
           "key": "music",
           "doc_count": 2,
           "avg_age": {
              "value": 28.5
           }
        },
        {
           "key": "forestry",
           "doc_count": 1,
           "avg_age": {
              "value": 35
           }
        },
        {
           "key": "sports",
           "doc_count": 1,
           "avg_age": {
              "value": 25
           }
        }
     ]
  }

映射mapping

//建立索引
PUT book

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "book"
}


//查看空mapping

GET book/_mapping

 
{
  "book": {
    "mappings": {}
  }
}

//插入
PUT book/it/1
{
  "bookId":1,
  "bookName":"Java程序設計",
  "publishDate":"2018-01-12"
}

//再次查看映射
GET book/_mapping

{
  "book": {
    "mappings": {
      "it": {
        "properties": {
          "bookId": {
            "type": "long"
          },
          "bookName": {
            "type": "string"
          },
          "publishDate": {
            "type": "date",
            "format": "strict_date_optional_time||epoch_millis"
          }
        }
      }
    }
  }
}

"bookId": { "type": "long" }, "bookName": { "type": "string" }, "publishDate": { "type": "date"} 將數據映射成了對應的數據類型，功能很是強大

地理座標

地理座標點 是指地球表面能夠用經緯度描述的一個點。地理座標點能夠用來計算兩個座標間的距離，還能夠判斷一個座標是否在一個區域中，或在聚合中。

PUT /attractions
{
  "mappings": {
    "restaurant": {
      "properties": {
        "name": {
          "type": "string"
        },
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}

PUT /attractions/restaurant/1
{
  "name":     "Chipotle Mexican Grill",
  "location": "40.715, -74.011" 
}

PUT /attractions/restaurant/2
{
  "name":     "Pala Pizza",
  "location": { 
    "lat":     40.722,
    "lon":    -73.989
  }
}

PUT /attractions/restaurant/3
{
  "name":     "Mini Munchies Pizza",
  "location": [ -73.983, 40.719 ]

有四種地理座標點相關的過濾器能夠用來選中或者排除文檔：

geo_bounding_box 找出落在指定矩形框中的點。

geo_distance 找出與指定位置在給定距離內的點。

geo_distance_range 找出與指定點距離在給定最小距離和最大距離之間的點。

geo_polygon 找出落在多邊形中的點。 這個過濾器使用代價很大 。

全文檢索

GET /attractions/restaurant/_search
{}

指定一個矩形的 頂部 , 底部 , 左邊界 ，和 右邊界 ，而後過濾器只需判斷座標的經度是否在左右邊界之間，緯度是否在上下邊界之間

GET /attractions/restaurant/_search
{
  "query": {
    "filtered": {
      "filter": {
        "geo_bounding_box": {
          "location": { 
            "top_left": {
              "lat":  40.8,
              "lon": -74.0
            },
            "bottom_right": {
              "lat":  40.7,
              "lon": -73.0
            }
          }
        }
      }
    }
  }
}

地理距離過濾器（ geo_distance ）以給定位置爲圓心畫一個圓，找出全部與指定點距離在 1km 內的 location 字段

GET /attractions/restaurant/_search
{
  "query": {
    "filtered": {
      "filter": {
        "geo_distance": {
          "distance": "1km", 
          "location": { 
            "lat":  40.715,
            "lon": -73.988
          }
        }
      }
    }
  }
}

兩點間的距離計算，有多種犧牲性能換取精度的算法：

arc 最慢但最精確的是 arc 計算方式，這種方式把世界看成球體來處理。不過這種方式的精度有限，由於這個世界並非徹底的球體。

plane 計算方式把地球當成是平坦的，這種方式快一些可是精度略遜。在赤道附近的位置精度最好，而靠近兩極則變差。

sloppy_arc如此命名，是由於它使用了 Lucene 的 SloppyMath 類。這是一種用精度換取速度的計算方式，它使用 Haversine formula 來計算距離。它比 arc 計算方式快 4 到 5 倍，而且距離精度達 99.9%。這也是默認的計算方式。

{
  "query": {
    "filtered": {
      "filter": {
        "geo_distance": {
          "distance":      "1km",
          "distance_type": "plane", 
          "location": {
            "lat":  40.715,
            "lon": -73.988
          }
        }
      }
    }
  }
}

geo_distance 和 geo_distance_range 過濾器的惟一差異在於後者是一個環狀的，它會排除掉落在內圈中的那部分文檔。

指定到中心點的距離也能夠換一種表示方式：指定一個最小距離（使用 gt 或者 gte ）和最大距離（使用 lt 和 lte ），就像使用 range 過濾器同樣，匹配那些距離中心點大於等於 1km 而小於 2km 的位置：

{
  "query": {
    "filtered": {
      "filter": {
        "geo_distance_range": {
          "gte":    "1km", 
          "lt":     "2km", 
          "location": {
            "lat":  40.715,
            "lon": -73.988
          }
        }
      }
    }
  }
}

//檢索結果
{
  "took": 8,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "attractions",
        "_type": "restaurant",
        "_id": "1",
        "_score": 1,
        "_source": {
          "name": "Chipotle Mexican Grill",
          "location": "40.715, -74.011"
        }
      }
    ]
  }
}

指定距離排序

GET /attractions/restaurant/_search
{
  "query": {
    "filtered": {
      "filter": {
        "geo_bounding_box": {
          "type":       "indexed",
          "location": {
            "top_left": {
              "lat":  40.8,
              "lon": -74.0
            },
            "bottom_right": {
              "lat":  40.4,
              "lon": -73.0
            }
          }
        }
      }
    }
  },
  "sort": [
    {
      "_geo_distance": {
        "location": { 
          "lat":  40.715,
          "lon": -73.998
        },
        "order":         "asc",
        "unit":          "km", 
        "distance_type": "plane" 
      }
    }
  ]
}

unit將距離以 km 爲單位寫入到每一個返回結果的 sort 鍵中,使用快速但精度略差的 plane 計算方式，能夠把"order": "asc" 改爲desc 試試結果不同

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": null,
    "hits": [
      {
        "_index": "attractions",
        "_type": "restaurant",
        "_id": "2",
        "_score": null,
        "_source": {
          "name": "Pala Pizza",
          "location": {
            "lat": 40.722,
            "lon": -73.989
          }
        },
        "sort": [
          1.2692283384165726
        ]
      },
      {
        "_index": "attractions",
        "_type": "restaurant",
        "_id": "3",
        "_score": null,
        "_source": {
          "name": "Mini Munchies Pizza",
          "location": [
            -73.983,
            40.719
          ]
        },
        "sort": [
          1.7281303268440118
        ]
      }
    ]
  }
}

批量導入數據bulk

vi bulktest.json

//寫入批量操做語句。好比，下面

{"index":{"_index":"zhouls","_type":"emp","_id":"10"}}
{ "name":"jack", "age" :18}
{"index":{"_index":"zhouls","_type":"emp","_id":"11"}}
{"name":"tom", "age":27}
{"update":{"_index":"zhouls","_type":"emp", "_id":"2"}}
{"doc":{"age" :22}}
{"delete":{"_index":"zhouls","_type":"emp","_id":"1"}}


[root@xxxx elasticsearch-1.6.2]# curl -PUT '127.0.0.1:9200/_bulk' --data-binary @bulktest.json;
{"took":18,"errors":true,"items":[{"index":{"_index":"zhouls","_type":"emp","_id":"10","_version":2,"status":200}},{"index":{"_index":"zhouls","_type":"emp","_id":"11","_version":2,"status":200}},{"update":{"_index":"zhouls","_type":"emp","_id":"2","status":404,"error":"DocumentMissingException[[zhouls][-1] [emp][2]: document missing]"}},{"delete":{"_index":"zhouls","_type":"emp","_id":"1","_version":1,"status":404,"found":false}}]}


[root@xxxx elasticsearch-1.6.2]# curl -XGET 'http://127.0.0.1:9200/zhouls/emp/1?pretty';
{
  "_index" : "zhouls",
  "_type" : "emp",
  "_id" : "1",
  "found" : false
}

分頁查詢

搜索分頁，能夠經過 from + size 組合來進行。from表示從第幾行開始，size表示查詢多少條文檔。from默認爲0，size默認爲10

POST megacorp/employee/_search

{
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "must_not": [],
      "should": []
    }
  },
  "from": 0,
  "size": 10,
  "sort": [],
  "aggs": {}
}

還有一些經常使用查詢

prefix 前綴查詢

range  範圍查詢

term   精確查詢

missing 查找文檔中是否包含沒有某個字段

exists  查找文檔中是否包含指定字段

regexp  用正則匹配查詢

wildcard 標準的shell通配符查詢

spannear 跨度查詢

query_string 字符查詢

就介紹這麼多吧

安裝插件參考：

https://www.cnblogs.com/ljhdo/p/4887557.html

https://www.cnblogs.com/hts-technology/p/8477258.html

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。