搜索引擎(Elasticsearch索引管理2)

學完本課題,你應達成以下目標:

掌握分詞器的配置、測試
掌握文檔的管理操做
掌握路由規則。html

分詞器

認識分詞器node

Analyzer   分析器正則表達式

在ES中一個Analyzer 由下面三種組件組合而成:數據庫

character filter :字符過濾器,對文本進行字符過濾處理,如處理文本中的html標籤字符。處理完後再交給tokenizer進行分詞。一個analyzer中可包含0個或多個字符過濾器,多個按配置順序依次進行處理。
tokenizer:分詞器,對文本進行分詞。一個analyzer必需且只可包含一個tokenizer。
token filter:詞項過濾器,對tokenizer分出的詞進行過濾處理。如轉小寫、停用詞處理、同義詞處理。一個analyzer可包含0個或多個詞項過濾器,按配置順序進行過濾。json

如何測試分詞器api

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}

POST _analyze
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}

搞清楚position和offset數組

{
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "quick",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    }

內建的character filter網絡

    HTML Strip Character Filter
        html_strip :過濾html標籤,解碼HTML entities like &. 
    Mapping Character Filter
        mapping :用指定的字符串替換文本中的某字符串。 
    Pattern Replace Character Filter
        pattern_replace :進行正則表達式替換。併發

HTML Strip Character Filterapp

測試:

POST _analyze
{
  "tokenizer":      "keyword", 
  "char_filter":  [ "html_strip" ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

在索引中配置:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

測試:

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

escaped_tags 用來指定例外的標籤。 若是沒有例外標籤需配置,則不須要在此進行客戶化定義,在上面的my_analyzer中直接使用 html_strip

Mapping character filter

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-mapping-charfilter.html

Pattern Replace Character Filter

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html

內建的Tokenizer

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

集成的中文分詞器Ikanalyzer中提供的tokenizer:ik_smart 、 ik_max_word

測試tokenizer

POST _analyze
{
  "tokenizer":      "standard", 
  "text": "張三說的確實在理"
}

POST _analyze
{
  "tokenizer":      "ik_smart", 
  "text": "張三說的確實在理"
}

內建的Token Filter

ES中內建了不少Token filter ,詳細瞭解:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html

   Lowercase Token Filter  :lowercase    轉小寫
   Stop Token Filter :stop   停用詞過濾器
   Synonym Token Filter: synonym   同義詞過濾器
說明:中文分詞器Ikanalyzer中自帶有停用詞過濾功能。

Synonym Token Filter  同義詞過濾器

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "my_ik_synonym" : {
                        "tokenizer" : "ik_smart",
                        "filter" : ["synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms_path" : "analysis/synonym.txt"
                    }
                }
            }
        }
    }
}
synonyms_path:指定同義詞文件(相對config的位置)

同義詞定義格式

ES同義詞格式支持 solr、 WordNet 兩種格式。

在analysis/synonym.txt中用solr格式定義以下同義詞    文件必定要UTF-8編碼

張三,李四
電飯煲,電飯鍋 => 電飯煲   
電腦 => 計算機,computer

一行一類同義詞,=> 表示標準化爲

測試:

POST test_index/_analyze
{
  "analyzer": "my_ik_synonym",
  "text": "張三說的確實在理"
}

POST test_index/_analyze
{
  "analyzer": "my_ik_synonym",
  "text": "我想買個電飯鍋和一個電腦"
}

經過例子的結果瞭解同義詞的處理行爲

內建的Analyzer

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

集成的中文分詞器Ikanalyzer中提供的Analyzer:ik_smart 、 ik_max_word

內建的和集成的analyzer能夠直接使用。若是它們不能知足咱們的須要,則咱們可本身組合字符過濾器、分詞器、詞項過濾器來定義自定義的analyzer

自定義 Analyzer

    zero or more character filters
    a tokenizer
    zero or more token filters.

配置參數:

PUT my_index8
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ik_analyzer": {
          "type": "custom",
          "tokenizer": "ik_smart",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
             "synonym"
          ]
        }
      },
      "filter": {
        "synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }    }  }}

爲字段指定分詞器

PUT my_index8/_mapping/_doc
{
  "properties": {
    "title": {
        "type": "text",
        "analyzer": "my_ik_analyzer"
    }
  }
}
PUT my_index8/_mapping/_doc
{
  "properties": {
    "title": {
        "type": "text",
        "analyzer": "my_ik_analyzer",
        "search_analyzer": "other_analyzer" 
    }
  }
}
PUT my_index8/_doc/1
{
  "title": "張三說的確實在理"
}

GET /my_index8/_search
{
  "query": {
    "term": {
      "title": "張三"
    }
  }
}

爲索引定義個default分詞器

PUT /my_index10
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "ik_smart",
          "filter": [
            "synonym"
          ]
        }
      },
      "filter": {
        "synonym": {
          "type": "synonym",
          "synonyms_path": "analysis/synonym.txt"
        }
      }
    }
  },
"mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text"
        }
      }
    }
  }
}
PUT my_index10/_doc/1
{
  "title": "張三說的確實在理"
}

GET /my_index10/_search
{
  "query": {
    "term": {
      "title": "張三"
    }
  }
}

Analyzer的使用順序

咱們能夠爲每一個查詢、每一個字段、每一個索引指定分詞器。

在索引階段ES將按以下順序來選用分詞:

   首先選用字段mapping定義中指定的analyzer
   字段定義中沒有指定analyzer,則選用 index settings中定義的名字爲default 的analyzer。
   如index setting中沒有定義default分詞器,則使用 standard analyzer.

查詢階段ES將按以下順序來選用分詞:

    The analyzer defined in a full-text query.
    The search_analyzer defined in the field mapping.
    The analyzer defined in the field mapping.
    An analyzer named default_search in the index settings.
    An analyzer named default in the index settings.
    The standard analyzer.

文檔管理

新建文檔

PUT twitter/_doc/1            指定文檔id,新增/修改
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
POST twitter/_doc/            新增,自動生成文檔id
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
{
  "_index": "twitter",              //所屬索引
  "_type": "_doc",                  //所屬mapping type
  "_id": "p-D3ymMBl4RK_V6aWu_V",    //文檔id
  "_version": 1,                    //文檔版本
  "result": "created",
  "_shards": {                      //分片的寫入狀況
    "total": 3,                     //所在分片有三個副本
    "successful": 1,                //1個副本上成功寫入
    "failed": 0                     //失敗副本數
  },
  "_seq_no": 0,                     //第幾回操做該文檔
  "_primary_term": 3                //詞項數
}

獲取單個文檔

HEAD twitter/_doc/11      查看是否存儲
GET twitter/_doc/1
GET twitter/_doc/1?_source=false
GET twitter/_doc/1/_source
{
  "_index": "twitter",
  "_type": "_doc",
  "_id": "1",
  "_version": 2,
  "found": true,
  "_source": {
    "id": 1,
    "user": "kimchy",
    "post_date": "2009-11-15T14:12:12",
    "message": "trying out Elasticsearch"
  }}
PUT twitter11
{                                   //獲取存儲字段
   "mappings": {
      "_doc": {
         "properties": {
            "counter": {
               "type": "integer",
               "store": false
            },
            "tags": {
               "type": "keyword",
               "store": true
            } }   }  }}

PUT twitter11/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}

GET twitter11/_doc/1?stored_fields=tags,counter

獲取多個文檔 _mget

GET /_mget
{
    "docs" : [
        {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_index" : "twitter",
            "_type" : "_doc",
            "_id" : "2"
            "stored_fields" : ["field3", "field4"]
        }
    ]
}
GET /twitter/_mget
{
    "docs" : [
        {
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_type" : "_doc",
            "_id" : "2"
        }
    ]
}
GET /twitter/_doc/_mget
{
    "docs" : [
        {
            "_id" : "1"
        },
        {
            "_id" : "2"
        }
    ]
}
GET /twitter/_doc/_mget
{
    "ids" : ["1", "2"]
}

請求參數_source   stored_fields 能夠用在url上也可用在請求json串中

刪除文檔

DELETE twitter/_doc/1    指定文檔id進行刪除

DELETE twitter/_doc/1?version=1    用版原本控制刪除

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    },
    "_index" : "twitter",
    "_type" : "_doc",
    "_id" : "1",
    "_version" : 2,
    "_primary_term": 1,
    "_seq_no": 5,
    "result": "deleted"
}

查詢刪除

POST twitter/_delete_by_query
{
  "query": { 
    "match": {
      "message": "some message"
    }
  }
}
POST twitter/_doc/_delete_by_query?conflicts=proceed
{
  "query": {
    "match_all": {}
  }
}
當有文檔有版本衝突時,不放棄刪除操做(記錄衝突的文檔,繼續刪除其餘複合查詢的文檔)

經過task api 來查看 查詢刪除任務

GET _tasks?detailed=true&actions=*/delete/byquery
GET /_tasks/taskId:1    查詢具體任務的狀態
POST _tasks/task_id:1/_cancel    取消任務
{
  "nodes" : {
    "r1A2WoRbTwKZ516z6NEs5A" : {
      "name" : "r1A2WoR",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "attributes" : {
        "testattr" : "test",
        "portsfile" : "true"
      },
      "tasks" : {
        "r1A2WoRbTwKZ516z6NEs5A:36619" : {
          "node" : "r1A2WoRbTwKZ516z6NEs5A",
          "id" : 36619,
          "type" : "transport",
          "action" : "indices:data/write/delete/byquery",
          "status" : {    
            "total" : 6154,
            "updated" : 0,
            "created" : 0,
            "deleted" : 3500,
            "batches" : 36,
            "version_conflicts" : 0,
            "noops" : 0,
            "retries": 0,
            "throttled_millis": 0
          },
          "description" : ""
        }      }   }  }}

更新文檔

PUT twitter/_doc/1      指定文檔id進行修改
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
PUT twitter/_doc/1?version=1     樂觀鎖併發更新控制
{
    "id": 1,
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}
{
  "_index": "twitter",
  "_type": "_doc",
  "_id": "1",
  "_version": 3,
  "result": "updated",
  "_shards": {
    "total": 3,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 3
}

Scripted update 經過腳原本更新文檔

PUT uptest/_doc/1       一、準備一個文檔
{
    "counter" : 1,
    "tags" : ["red"]
}
POST uptest/_doc/1/_update     二、對文檔1的counter + 4
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    }
}
POST uptest/_doc/1/_update     三、往數組中加入元素
{
    "script" : {
        "source": "ctx._source.tags.add(params.tag)",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}

腳本說明:painless是es內置的一種腳本語言,ctx執行上下文對象(經過它還可訪問_index, _type, _id, _version, _routing and _now (the current timestamp) ),params是參數集合

Scripted update 經過腳原本更新文檔

說明:腳本更新要求索引的_source 字段是啓用的。更新執行流程:
一、獲取到原文檔
二、經過_source字段的原始數據,執行腳本修改。
三、刪除原索引文檔
四、索引修改後的文檔 
它只是下降了一些網絡往返,並減小了get和索引之間版本衝突的可能性。

POST uptest/_doc/1/_update
{
    "script" : "ctx._source.new_field = 'value_of_new_field'"
}
四、添加一個字段
POST uptest/_doc/1/_update
{
    "script" : "ctx._source.remove('new_field')"
}
五、移除一個字段
POST uptest/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}
六、判斷刪除或不作什麼
POST uptest/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}
七、合併傳人的文檔字段進行更新
{
  "_index": "uptest",
  "_type": "_doc",
  "_id": "1",
  "_version": 4,
  "result": "noop",
  "_shards": {
    "total": 0,
    "successful": 0,
    "failed": 0
  }
}
八、再次執行7,更新內容相同,不需作什麼
POST uptest/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "detect_noop": false
}
九、設置不作noop檢測
POST uptest/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    },
    "upsert" : {
        "counter" : 1
    }
}
十、upsert 操做:若是要更新的文檔存在,則執行腳本進行更新,如不存在,則把 upsert中的內容做爲一個新文檔寫入。

查詢更新

POST twitter/_update_by_query
{
  "script": {
    "source": "ctx._source.likes++",
    "lang": "painless"
  },
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}
經過條件查詢來更新文檔

批量操做

批量操做API /_bulk 讓咱們能夠在一次調用中執行多個索引、刪除操做。這能夠大大提升索引數據的速度。批量操做內容體需按以下以新行分割的json結構格式給出:

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n
POST _bulk
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "_doc", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

action_and_meta_data:   action能夠是 index, create, delete and update ,meta_data 指: _index ,_type,_id  

請求端點能夠是:  /_bulk,  /{index}/_bulk,  {index}/{type}/_bulk

curl + json 文件 批量索引多個文檔

curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_doc/_bulk?pretty&refresh" --data-binary "@accounts.json"
curl "localhost:9200/_cat/indices?v"

reindex 重索引

Reindex API /_reindex 讓咱們能夠將一個索引中的數據重索引到另外一個索引中(拷貝),要求源索引的_source 是開啓的。目標索引的setting 、mapping 信息與源索引無關。

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

重索引要考慮的一個問題:目標索引中存在源索引中的數據,這些數據的version如何處理。
一、若是沒有指定version_type 或指定爲 internal,則會是採用目標索引中的版本,重索引過程當中,執行的就是新增、更新操做。

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "internal"
  }
}

二、若是想使用源索引中的版原本進行版本控制更新,則設置 version_type 爲extenal。重索引操做將寫入不存在的,更新舊版本的數據。

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  }
}

若是你只想從源索引中複製目標索引中不存在的文檔數據,能夠指定 op_type 爲 create 。此時存在的文檔將觸發 版本衝突(會致使放棄操做),可設置「conflicts」: 「proceed「,跳過繼續

POST _reindex
{
  "conflicts": "proceed",
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

你也能夠只索引源索引的一部分數據,經過 type 或 查詢來指定你須要的數據

POST _reindex
{
  "source": {
    "index": "twitter",
    "type": "_doc",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

能夠從多個源獲取數據

POST _reindex
{
  "source": {
    "index": ["twitter", "blog"],
    "type": ["_doc", "post"]
  },
  "dest": {
    "index": "all_together"
  }
}
POST _reindex
{
  "size": 10000,         能夠限定文檔數量
  "source": {
    "index": "twitter",
    "sort": { "date": "desc" }
  },
  "dest": {
    "index": "new_twitter"
  }
}
POST _reindex
{                       能夠選擇複製源文檔的哪些字段
  "source": {
    "index": "twitter",
    "_source": ["user", "_doc"]
  },
  "dest": {
    "index": "new_twitter"
  }
}
POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  },
  "script": {               //能夠用script來改變文檔
    "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}",
    "lang": "painless"
  }
}
POST _reindex
{
  "source": {
    "index": "source",
    "query": {
      "match": {
        "company": "cat"
      }
    }
  },
  "dest": {               //能夠指定路由值
    "index": "dest",
    "routing": "=cat"
  }
}

 

POST _reindex
{
  "source": {            //從遠程源複製
    "remote": {
      "host": "http://otherhost:9200",
      "username": "user",
      "password": "pass"
    },
    "index": "source",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

經過_task 來查詢執行狀態

GET _tasks?detailed=true&actions=*reindex

?refresh

對於索引、更新、刪除操做若是想操做完後立馬重刷新可見,可帶上refresh參數。

PUT /test/_doc/1?refresh
{"test": "test"}
PUT /test/_doc/2?refresh=true
{"test": "test"}

refresh 可選值說明

未給值或=true,則立馬會重刷新讀索引。
=false ,至關於沒帶refresh 參數,遵循內部的定時刷新。
=wait_for ,登記等待刷新,當登記的請求數達到index.max_refresh_listeners 參數設定的值時(defaults to 1000),將觸發重刷新。

 

路由詳解

集羣組成

 

建立索引的流程

 

節點故障

 

索引文檔

文檔是如何路由的

文檔該存到哪一個分片上?
決定文檔存放到哪一個分片上就是文檔路由。ES中經過下面的計算獲得每一個文檔的存放分片:

shard = hash(routing) % number_of_primary_shards

routing 是用來進行hash計算的路由值,默認是使用文檔id值。咱們能夠在索引文檔時經過routing參數指定別的路由值

POST twitter/_doc?routing=kimchy
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

在索引、刪除、更新、查詢中均可以使用routing參數(可多值)指定操做的分片。

PUT my_index2
{
  "mappings": {
    "_doc": {
      "_routing": {
        "required": true 
      }
    }
  }
}
強制要求給定路由值

思考:關係型數據庫中有分區表,經過選定分區,能夠下降操做的數據量,提升效率。在ES的索引中能不能這樣作?

能夠:經過指定路由值,讓一個分片上存放一個區的數據。如按部門存放數據,則可指定路由值爲部門值。

搜索

搜索的步驟:如要搜索 索引 s1 一、node2解析查詢。 二、node2將查詢發給索引s1的分片/副本(R1,R2,R0)節點 三、各節點執行查詢,將結果發給Node2 四、Node2合併結果,做出響應。  

相關文章
相關標籤/搜索