掌握分詞器的配置、測試
掌握文檔的管理操做
掌握路由規則。html
認識分詞器node
Analyzer 分析器正則表達式
在ES中一個Analyzer 由下面三種組件組合而成:數據庫
character filter :字符過濾器,對文本進行字符過濾處理,如處理文本中的html標籤字符。處理完後再交給tokenizer進行分詞。一個analyzer中可包含0個或多個字符過濾器,多個按配置順序依次進行處理。
tokenizer:分詞器,對文本進行分詞。一個analyzer必需且只可包含一個tokenizer。
token filter:詞項過濾器,對tokenizer分出的詞進行過濾處理。如轉小寫、停用詞處理、同義詞處理。一個analyzer可包含0個或多個詞項過濾器,按配置順序進行過濾。json
如何測試分詞器api
POST _analyze { "analyzer": "whitespace", "text": "The quick brown fox." } POST _analyze { "tokenizer": "standard", "filter": [ "lowercase", "asciifolding" ], "text": "Is this déja vu?" }
搞清楚position和offset數組
{ "token": "The", "start_offset": 0, "end_offset": 3, "type": "word", "position": 0 }, { "token": "quick", "start_offset": 4, "end_offset": 9, "type": "word", "position": 1 }
內建的character filter網絡
HTML Strip Character Filter
html_strip :過濾html標籤,解碼HTML entities like &.
Mapping Character Filter
mapping :用指定的字符串替換文本中的某字符串。
Pattern Replace Character Filter
pattern_replace :進行正則表達式替換。併發
HTML Strip Character Filterapp
測試:
POST _analyze { "tokenizer": "keyword", "char_filter": [ "html_strip" ], "text": "<p>I'm so <b>happy</b>!</p>" }
在索引中配置:
PUT my_index { "settings": { "analysis": { "analyzer": { "my_analyzer": { "tokenizer": "keyword", "char_filter": ["my_char_filter"] } }, "char_filter": { "my_char_filter": { "type": "html_strip", "escaped_tags": ["b"] } } } } }
測試:
POST my_index/_analyze { "analyzer": "my_analyzer", "text": "<p>I'm so <b>happy</b>!</p>" }
escaped_tags 用來指定例外的標籤。 若是沒有例外標籤需配置,則不須要在此進行客戶化定義,在上面的my_analyzer中直接使用 html_strip
Mapping character filter
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html
Pattern Replace Character Filter
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html
內建的Tokenizer
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
集成的中文分詞器Ikanalyzer中提供的tokenizer:ik_smart 、 ik_max_word
測試tokenizer
POST _analyze { "tokenizer": "standard", "text": "張三說的確實在理" } POST _analyze { "tokenizer": "ik_smart", "text": "張三說的確實在理" }
內建的Token Filter
ES中內建了不少Token filter ,詳細瞭解:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html
Lowercase Token Filter :lowercase 轉小寫
Stop Token Filter :stop 停用詞過濾器
Synonym Token Filter: synonym 同義詞過濾器
說明:中文分詞器Ikanalyzer中自帶有停用詞過濾功能。
Synonym Token Filter 同義詞過濾器
PUT /test_index { "settings": { "index" : { "analysis" : { "analyzer" : { "my_ik_synonym" : { "tokenizer" : "ik_smart", "filter" : ["synonym"] } }, "filter" : { "synonym" : { "type" : "synonym", "synonyms_path" : "analysis/synonym.txt" } } } } } } synonyms_path:指定同義詞文件(相對config的位置)
同義詞定義格式
ES同義詞格式支持 solr、 WordNet 兩種格式。
在analysis/synonym.txt中用solr格式定義以下同義詞 文件必定要UTF-8編碼
張三,李四 電飯煲,電飯鍋 => 電飯煲 電腦 => 計算機,computer
一行一類同義詞,=> 表示標準化爲
測試:
POST test_index/_analyze { "analyzer": "my_ik_synonym", "text": "張三說的確實在理" } POST test_index/_analyze { "analyzer": "my_ik_synonym", "text": "我想買個電飯鍋和一個電腦" }
經過例子的結果瞭解同義詞的處理行爲
內建的Analyzer
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
集成的中文分詞器Ikanalyzer中提供的Analyzer:ik_smart 、 ik_max_word
內建的和集成的analyzer能夠直接使用。若是它們不能知足咱們的須要,則咱們可本身組合字符過濾器、分詞器、詞項過濾器來定義自定義的analyzer
自定義 Analyzer
zero or more character filters
a tokenizer
zero or more token filters.
配置參數:
PUT my_index8 { "settings": { "analysis": { "analyzer": { "my_ik_analyzer": { "type": "custom", "tokenizer": "ik_smart", "char_filter": [ "html_strip" ], "filter": [ "synonym" ] } }, "filter": { "synonym": { "type": "synonym", "synonyms_path": "analysis/synonym.txt" } } } }}
爲字段指定分詞器
PUT my_index8/_mapping/_doc { "properties": { "title": { "type": "text", "analyzer": "my_ik_analyzer" } } }
PUT my_index8/_mapping/_doc { "properties": { "title": { "type": "text", "analyzer": "my_ik_analyzer", "search_analyzer": "other_analyzer" } } }
PUT my_index8/_doc/1 { "title": "張三說的確實在理" } GET /my_index8/_search { "query": { "term": { "title": "張三" } } }
爲索引定義個default分詞器
PUT /my_index10 { "settings": { "analysis": { "analyzer": { "default": { "tokenizer": "ik_smart", "filter": [ "synonym" ] } }, "filter": { "synonym": { "type": "synonym", "synonyms_path": "analysis/synonym.txt" } } } },
"mappings": { "_doc": { "properties": { "title": { "type": "text" } } } } }
PUT my_index10/_doc/1 { "title": "張三說的確實在理" } GET /my_index10/_search { "query": { "term": { "title": "張三" } } }
Analyzer的使用順序
咱們能夠爲每一個查詢、每一個字段、每一個索引指定分詞器。
在索引階段ES將按以下順序來選用分詞:
首先選用字段mapping定義中指定的analyzer
字段定義中沒有指定analyzer,則選用 index settings中定義的名字爲default 的analyzer。
如index setting中沒有定義default分詞器,則使用 standard analyzer.
查詢階段ES將按以下順序來選用分詞:
The analyzer defined in a full-text query.
The search_analyzer defined in the field mapping.
The analyzer defined in the field mapping.
An analyzer named default_search in the index settings.
An analyzer named default in the index settings.
The standard analyzer.
新建文檔
PUT twitter/_doc/1 指定文檔id,新增/修改 { "id": 1, "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
POST twitter/_doc/ 新增,自動生成文檔id { "id": 1, "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
{ "_index": "twitter", //所屬索引 "_type": "_doc", //所屬mapping type "_id": "p-D3ymMBl4RK_V6aWu_V", //文檔id "_version": 1, //文檔版本 "result": "created", "_shards": { //分片的寫入狀況 "total": 3, //所在分片有三個副本 "successful": 1, //1個副本上成功寫入 "failed": 0 //失敗副本數 }, "_seq_no": 0, //第幾回操做該文檔 "_primary_term": 3 //詞項數 }
獲取單個文檔
HEAD twitter/_doc/11 查看是否存儲
GET twitter/_doc/1
GET twitter/_doc/1?_source=false
GET twitter/_doc/1/_source
{ "_index": "twitter", "_type": "_doc", "_id": "1", "_version": 2, "found": true, "_source": { "id": 1, "user": "kimchy", "post_date": "2009-11-15T14:12:12", "message": "trying out Elasticsearch" }}
PUT twitter11 { //獲取存儲字段 "mappings": { "_doc": { "properties": { "counter": { "type": "integer", "store": false }, "tags": { "type": "keyword", "store": true } } } }} PUT twitter11/_doc/1 { "counter" : 1, "tags" : ["red"] } GET twitter11/_doc/1?stored_fields=tags,counter
獲取多個文檔 _mget
GET /_mget { "docs" : [ { "_index" : "twitter", "_type" : "_doc", "_id" : "1" }, { "_index" : "twitter", "_type" : "_doc", "_id" : "2" "stored_fields" : ["field3", "field4"] } ] }
GET /twitter/_mget { "docs" : [ { "_type" : "_doc", "_id" : "1" }, { "_type" : "_doc", "_id" : "2" } ] }
GET /twitter/_doc/_mget { "docs" : [ { "_id" : "1" }, { "_id" : "2" } ] }
GET /twitter/_doc/_mget { "ids" : ["1", "2"] }
請求參數_source stored_fields 能夠用在url上也可用在請求json串中
刪除文檔
DELETE twitter/_doc/1 指定文檔id進行刪除
DELETE twitter/_doc/1?version=1 用版原本控制刪除
{ "_shards" : { "total" : 2, "failed" : 0, "successful" : 2 }, "_index" : "twitter", "_type" : "_doc", "_id" : "1", "_version" : 2, "_primary_term": 1, "_seq_no": 5, "result": "deleted" }
查詢刪除
POST twitter/_delete_by_query { "query": { "match": { "message": "some message" } } }
POST twitter/_doc/_delete_by_query?conflicts=proceed { "query": { "match_all": {} } } 當有文檔有版本衝突時,不放棄刪除操做(記錄衝突的文檔,繼續刪除其餘複合查詢的文檔)
經過task api 來查看 查詢刪除任務
GET _tasks?detailed=true&actions=*/delete/byquery
GET /_tasks/taskId:1 查詢具體任務的狀態
POST _tasks/task_id:1/_cancel 取消任務
{ "nodes" : { "r1A2WoRbTwKZ516z6NEs5A" : { "name" : "r1A2WoR", "transport_address" : "127.0.0.1:9300", "host" : "127.0.0.1", "ip" : "127.0.0.1:9300", "attributes" : { "testattr" : "test", "portsfile" : "true" }, "tasks" : { "r1A2WoRbTwKZ516z6NEs5A:36619" : { "node" : "r1A2WoRbTwKZ516z6NEs5A", "id" : 36619, "type" : "transport", "action" : "indices:data/write/delete/byquery", "status" : { "total" : 6154, "updated" : 0, "created" : 0, "deleted" : 3500, "batches" : 36, "version_conflicts" : 0, "noops" : 0, "retries": 0, "throttled_millis": 0 }, "description" : "" } } } }}
更新文檔
PUT twitter/_doc/1 指定文檔id進行修改 { "id": 1, "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
PUT twitter/_doc/1?version=1 樂觀鎖併發更新控制 { "id": 1, "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
{ "_index": "twitter", "_type": "_doc", "_id": "1", "_version": 3, "result": "updated", "_shards": { "total": 3, "successful": 1, "failed": 0 }, "_seq_no": 2, "_primary_term": 3 }
Scripted update 經過腳原本更新文檔
PUT uptest/_doc/1 一、準備一個文檔 { "counter" : 1, "tags" : ["red"] }
POST uptest/_doc/1/_update 二、對文檔1的counter + 4 { "script" : { "source": "ctx._source.counter += params.count", "lang": "painless", "params" : { "count" : 4 } } }
POST uptest/_doc/1/_update 三、往數組中加入元素 { "script" : { "source": "ctx._source.tags.add(params.tag)", "lang": "painless", "params" : { "tag" : "blue" } } }
腳本說明:painless是es內置的一種腳本語言,ctx執行上下文對象(經過它還可訪問_index, _type, _id, _version, _routing and _now (the current timestamp) ),params是參數集合
Scripted update 經過腳原本更新文檔
說明:腳本更新要求索引的_source 字段是啓用的。更新執行流程:
一、獲取到原文檔
二、經過_source字段的原始數據,執行腳本修改。
三、刪除原索引文檔
四、索引修改後的文檔
它只是下降了一些網絡往返,並減小了get和索引之間版本衝突的可能性。
POST uptest/_doc/1/_update { "script" : "ctx._source.new_field = 'value_of_new_field'" } 四、添加一個字段
POST uptest/_doc/1/_update { "script" : "ctx._source.remove('new_field')" } 五、移除一個字段
POST uptest/_doc/1/_update { "script" : { "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }", "lang": "painless", "params" : { "tag" : "green" } } } 六、判斷刪除或不作什麼
POST uptest/_doc/1/_update { "doc" : { "name" : "new_name" } } 七、合併傳人的文檔字段進行更新
{ "_index": "uptest", "_type": "_doc", "_id": "1", "_version": 4, "result": "noop", "_shards": { "total": 0, "successful": 0, "failed": 0 } } 八、再次執行7,更新內容相同,不需作什麼
POST uptest/_doc/1/_update { "doc" : { "name" : "new_name" }, "detect_noop": false } 九、設置不作noop檢測
POST uptest/_doc/1/_update { "script" : { "source": "ctx._source.counter += params.count", "lang": "painless", "params" : { "count" : 4 } }, "upsert" : { "counter" : 1 } } 十、upsert 操做:若是要更新的文檔存在,則執行腳本進行更新,如不存在,則把 upsert中的內容做爲一個新文檔寫入。
查詢更新
POST twitter/_update_by_query { "script": { "source": "ctx._source.likes++", "lang": "painless" }, "query": { "term": { "user": "kimchy" } } } 經過條件查詢來更新文檔
批量操做
批量操做API /_bulk 讓咱們能夠在一次調用中執行多個索引、刪除操做。這能夠大大提升索引數據的速度。批量操做內容體需按以下以新行分割的json結構格式給出:
action_and_meta_data\n optional_source\n action_and_meta_data\n optional_source\n .... action_and_meta_data\n optional_source\n
POST _bulk { "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } } { "field1" : "value1" } { "delete" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } } { "create" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } } { "field1" : "value3" } { "update" : {"_id" : "1", "_type" : "_doc", "_index" : "test"} } { "doc" : {"field2" : "value2"} }
action_and_meta_data: action能夠是 index, create, delete and update ,meta_data 指: _index ,_type,_id
請求端點能夠是: /_bulk, /{index}/_bulk, {index}/{type}/_bulk
curl + json 文件 批量索引多個文檔
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_doc/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"
reindex 重索引
Reindex API /_reindex 讓咱們能夠將一個索引中的數據重索引到另外一個索引中(拷貝),要求源索引的_source 是開啓的。目標索引的setting 、mapping 信息與源索引無關。
POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter" } }
重索引要考慮的一個問題:目標索引中存在源索引中的數據,這些數據的version如何處理。
一、若是沒有指定version_type 或指定爲 internal,則會是採用目標索引中的版本,重索引過程當中,執行的就是新增、更新操做。
POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "internal" } }
二、若是想使用源索引中的版原本進行版本控制更新,則設置 version_type 爲extenal。重索引操做將寫入不存在的,更新舊版本的數據。
POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "external" } }
若是你只想從源索引中複製目標索引中不存在的文檔數據,能夠指定 op_type 爲 create 。此時存在的文檔將觸發 版本衝突(會致使放棄操做),可設置「conflicts」: 「proceed「,跳過繼續
POST _reindex { "conflicts": "proceed", "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "op_type": "create" } }
你也能夠只索引源索引的一部分數據,經過 type 或 查詢來指定你須要的數據
POST _reindex { "source": { "index": "twitter", "type": "_doc", "query": { "term": { "user": "kimchy" } } }, "dest": { "index": "new_twitter" } }
能夠從多個源獲取數據
POST _reindex { "source": { "index": ["twitter", "blog"], "type": ["_doc", "post"] }, "dest": { "index": "all_together" } }
POST _reindex { "size": 10000, 能夠限定文檔數量 "source": { "index": "twitter", "sort": { "date": "desc" } }, "dest": { "index": "new_twitter" } }
POST _reindex { 能夠選擇複製源文檔的哪些字段 "source": { "index": "twitter", "_source": ["user", "_doc"] }, "dest": { "index": "new_twitter" } }
POST _reindex { "source": { "index": "twitter" }, "dest": { "index": "new_twitter", "version_type": "external" }, "script": { //能夠用script來改變文檔 "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}", "lang": "painless" } }
POST _reindex { "source": { "index": "source", "query": { "match": { "company": "cat" } } }, "dest": { //能夠指定路由值 "index": "dest", "routing": "=cat" } }
POST _reindex { "source": { //從遠程源複製 "remote": { "host": "http://otherhost:9200", "username": "user", "password": "pass" }, "index": "source", "query": { "match": { "test": "data" } } }, "dest": { "index": "dest" } }
經過_task 來查詢執行狀態
GET _tasks?detailed=true&actions=*reindex
?refresh
對於索引、更新、刪除操做若是想操做完後立馬重刷新可見,可帶上refresh參數。
PUT /test/_doc/1?refresh {"test": "test"} PUT /test/_doc/2?refresh=true {"test": "test"}
refresh 可選值說明
未給值或=true,則立馬會重刷新讀索引。 =false ,至關於沒帶refresh 參數,遵循內部的定時刷新。 =wait_for ,登記等待刷新,當登記的請求數達到index.max_refresh_listeners 參數設定的值時(defaults to 1000),將觸發重刷新。
集羣組成
建立索引的流程
節點故障
索引文檔
文檔是如何路由的
文檔該存到哪一個分片上?
決定文檔存放到哪一個分片上就是文檔路由。ES中經過下面的計算獲得每一個文檔的存放分片:
shard = hash(routing) % number_of_primary_shards
routing 是用來進行hash計算的路由值,默認是使用文檔id值。咱們能夠在索引文檔時經過routing參數指定別的路由值
POST twitter/_doc?routing=kimchy { "user" : "kimchy", "post_date" : "2009-11-15T14:12:12", "message" : "trying out Elasticsearch" }
在索引、刪除、更新、查詢中均可以使用routing參數(可多值)指定操做的分片。
PUT my_index2 { "mappings": { "_doc": { "_routing": { "required": true } } } } 強制要求給定路由值
思考:關係型數據庫中有分區表,經過選定分區,能夠下降操做的數據量,提升效率。在ES的索引中能不能這樣作?
能夠:經過指定路由值,讓一個分片上存放一個區的數據。如按部門存放數據,則可指定路由值爲部門值。
搜索
搜索的步驟:如要搜索 索引 s1 一、node2解析查詢。 二、node2將查詢發給索引s1的分片/副本(R1,R2,R0)節點 三、各節點執行查詢,將結果發給Node2 四、Node2合併結果,做出響應。