ES系列4、ES6.3經常使用api之文檔類api

時間 2019-11-07

標籤系列 es6.3 經常使用 api 文檔简体版

原文原文鏈接

1.Index API: 建立並創建索引

PUT twitter/tweet/1
{
     "user" : "kimchy",
     "post_date" : "2009-11-15T14:12:12",
     "message" : "trying out Elasticsearch"
}

官方文檔參考：Index API。javascript

2.Get API: 獲取文檔

curl -XGET 'http://localhost:9200/twitter/tweet/1'

官方文檔參考：Get API。html

3.DELETE API: 刪除文檔

$ curl -XDELETE 'http://localhost:9200/twitter/tweet/1'

官方文檔參考：Delete API。java

4.UPDATE API: 更新文檔

PUT test/type1/1{ "counter" : 1, "tags" : ["red"]}

官方文檔參考：Update API。node

5.Multi Get API: 一次批量獲取文檔

PUT 'localhost:9200/_mget
{ 
    "docs" :
    [
       {"_index" : "test", 
         "_type" : "type",
           "_id" : "1" 
        },
       { "_index" : "test",
          "_type" : "type",
            "_id" : "2" 
        }
     ]
}

官方文檔參考：Multi Get API。

6.Bulk API: 批量操做，增刪改查

1.本地文件批量操做

e$ curl -s -XPOST localhost:9200/blog/user/_bulk --data-binary @requests
requests文件內容以下
{"index":{"_id":"25"}}
{"name":"黎明","id":25}
{"index":{"_id":"26"}}
{"name":"小明","id":26}
{"index":{"_id":"26"}}
{"name":"雄安","id":27}
{"index":{"_id":"28"}}
{"name":"笑話","id":28}

2.resp 方法

curl -H "Content-Type: application/json" -XPOST 'http://47.52.199.51:9200/book/english/_bulk' -d'
{"index":{"_id":"17"}}
{"name":"cddd","id":17}
{"index":{"_id":"18"}}
{"name":"cddd","id":18}
{"index":{"_id":"19"}}
{"name":"cddd","id":19}
{"index":{"_id":"20"}}
{"name":"cddd","id":20}
'

官方文檔參考：Bulk API。web

7.DELETE By Query API: 查詢刪除

POST /book/_delete_by_query
{
　　"query":{
　　　　"match":{
　　　　"name": "yangxioa"
　　　　}
　　}
}

7.1.刪除全部

POST /book/_delete_by_query
{
    "query":{
        "match_all":{}
    }
}

7.2.支持路由查詢（routing=XXX,匹配分片數）

POST twitter/_delete_by_query?routing=1
{
  "query": {
    "range" : {
        "age" : {
           "gte" : 10
        }
    }
  }
}

{
  "took" : 147, // 整個操做從開始到結束的毫秒數
  "timed_out": false, // true若是在經過查詢執行刪除期間執行的任何請求超時 ，則將此標誌設置爲。
  "total": 119, // 已成功處理的文檔數。
  "deleted": 119, // 已成功刪除的文檔數。
  "batches": 1,  // 經過查詢刪除拉回的滾動響應數。
  "version_conflicts": 0, // 按查詢刪除的版本衝突數。
  "noops": 0, // 對於按查詢刪除，此字段始終等於零。它只存在，以便經過查詢刪除，按查詢更新和reindex API返回具備相同結構的響應。
  "retries": { // 經過查詢刪除嘗試的重試次數。bulk是重試的批量操做search的數量，是重試的搜索操做的數量。
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0, // 請求睡眠符合的毫秒數requests_per_second。
  "requests_per_second": -1.0, // 在經過查詢刪除期間有效執行的每秒請求數。
  "throttled_until_millis": 0, //在按查詢響應刪除時，此字段應始終等於零。它只在使用Task API時有意義，它指示下一次（自紀元以來的毫秒數），爲了符合，將再次執行受限制的請求
  "failures" : [ ] 
   //若是在此過程當中存在任何不可恢復的錯誤，則會出現故障數組。若是這是非空的，那麼請求由於那些失敗而停止。逐個查詢是使用批處理實現的，
   任何故障都會致使整個進程停止，但當前批處理中的全部故障都會被收集到數組中。您可使用該conflicts選項來防止reindex在版本衝突中停止。
}

官方文檔參考：Delete By Query API。json

8.update更新api

8.1.腳本更新

POST test/_doc/1/_update
   {
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",// ES語言類型
        "params" : {
            "count" : 4
        }
    }
}

8.2.新增字段

POST test/_doc/1/_update
{
    "script" : "ctx._source.new_field = 'value_of_new_field'"
}

8.3.刪除字段

POST test/_doc/1/_update
{
    "script" : "ctx._source.remove('new_field')"
}

8.4.存在就更新

POST test/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}

8.5.更新部分字段

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}

8.6.upsert：存在就更新，不存在插入

POST test/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    },
    "upsert" : {
        "counter" : 1
    }
}

官方文檔參考：Update 腳本更新APIapi

9.UPDATE BY QUERY API:查詢更新

9.1.更新，從新索引

POST twitter/_update_by_query?conflicts=proceed

{
  "took" : 147,
  "timed_out": false,
  "updated": 120,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1.0,
  "throttled_until_millis": 0,
  "total": 120,
  "failures" : [ ]
}

ES內部自帶實現樂觀鎖控制，先查詢出要更新的記錄的版本號，更新時匹配版本號時候一致。
全部更新和查詢失敗都會致使_update_by_query停止並failures在響應中返回。已執行的更新仍然存在。換句話說，該過程不會回滾，只會停止。當第一個失敗致使停止時，失敗的批量請求返回的全部失敗都將在failures元素中返回; 所以，可能存在至關多的失敗實體。數組

若是您只想計算版本衝突，不要致使_update_by_query 停止，您能夠conflicts=proceed在URL或"conflicts": "proceed",改配置當第一個衝突時會會繼續執行，version_conflicts衝突數量。緩存

9.2.查詢更新

POST twitter/_update_by_query?conflicts=proceed
{
  "query": { 
    "term": {
      "user": "kimchy"
    }
  }
}

9.3.查詢腳本更新

POST twitter/_update_by_query
{
  "script": {
    "source": "ctx._source.likes++",
    "lang": "painless"
  },
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}

也能夠同時在多個索引和多個類型上完成這一切，就像搜索API同樣：服務器

POST twitter，blog / _doc，post / _update_by_query

routing則路由將複製到滾動查詢，將進程限制爲與該路由值匹配的分片：

POST twitter/_update_by_query?routing=1

默認狀況下，_update_by_query使用1000的滾動批次。可使用scroll_sizeURL參數更改批量大小：

POST twitter/_update_by_query?scroll_size=100

9.4.使用TASK API獲取全部正在運行的逐個查詢請求的狀態

GET _tasks?detailed=true&actions=*byquery

結果：

{
  "nodes" : {
    "r1A2WoRbTwKZ516z6NEs5A" : {
      "name" : "r1A2WoR",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "attributes" : {
        "testattr" : "test",
        "portsfile" : "true"
      },
      "tasks" : {
        "r1A2WoRbTwKZ516z6NEs5A:36619" : {
          "node" : "r1A2WoRbTwKZ516z6NEs5A",
          "id" : 36619,
          "type" : "transport",
          "action" : "indices:data/write/update/byquery",
          "status" : {    
            "total" : 6154,
            "updated" : 3500,
            "created" : 0,
            "deleted" : 0,
            "batches" : 4,
            "version_conflicts" : 0,
            "noops" : 0,
            "retries": {
              "bulk": 0,
              "search": 0
            }
            "throttled_millis": 0
          },
          "description" : ""
        }
      }
    }
  }
}

使用任務ID，您能夠直接查找任務：

GET /_tasks/taskId:1

可使用任務取消API取消任何按查詢更新：

POST _tasks/task_id:1/_cancel

手動切片：

POST twitter/_update_by_query
{
  "slice": {
    "id": 0,
    "max": 2
  },
  "script": {
    "source": "ctx._source['extra'] = 'test'"
  }
}

官方文檔參考：Update By Query API

10.Reindex API：從新索引

10.1.複製整個索引

最基本的形式_reindex只是將文檔從一個索引複製到另外一個索引。這會將twitter索引中的文檔複製到new_twitter索引中(前提是要有相同的索引類型)：

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

10.2.複製匹配的文檔

POST _reindex
{
  "source": {
    "index": "twitter",
    "type": "_doc",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

10.3.複製多個索引文檔

POST _reindex
{
  "source": {
    "index": ["book", "blog"],
    "type": ["english", "user"]
  },
  "dest": {
    "index": "book1"
  }
 }

ES 6.3只支持一個索引一個類型，因此上面這個並無實驗成功！提示：

"reason": "Rejecting mapping update to [book1] as the final mapping would have more than 1 type: [english, user]"

10.4.是否覆蓋版本號

POST reindex 
{
  "source": {
    "index": ["book"],
    "type": ["english"]
  },
  "dest": {
    "index": "book1",
    "version_type":"external"
  }
 }

「external」:表示使用source的版本號覆蓋dest的版本號，當source的版本號<=dest的版本號會提示衝突，「internal」:表示保持dest的版本號自增。

10.5.只複製不存在的記錄，已經存在的記錄提示衝突

POST _reindex
 {
  "source": {
    "index": ["book"],
    "type": ["english"]
  },
  "dest": {
    "index": "book1",
    "op_type": "create"
  }
 }

默認狀況下，版本衝突會停止該_reindex過程，但能夠經過"conflicts": "proceed"請求正文中的設置對它們進行計數

10.6.排序複製指定數量

POST _reindex
{
    "size":10,
    "source": {
        "index": ["book"],
        "sort": { "name": "desc" }
      },
     "dest": {
        "index": "book1",
        "op_type": "create"
      }
}

若是報錯禁止排序：Fielddata is disabled on text fields by...

聚合這些操做用單獨的數據結構(fielddata)緩存到內存裏了，須要單獨開啓：

PUT book/_mapping/english

{
  "properties": {
    "name": { 
      "type":     "text",
      "fielddata": true
    }
  }
}

10.7.複製部分字段

POST _reindex
{"source": {
    "index": "book",
     "_source": ["age", "name"]
  },
  "dest": {
    "index": "book1"
  }
}

10.8.過濾修改元數據再複製

POST _reindex

{
  "size":2,
  "source": {
    "index": "book",
     "_source": ["age", "name"]
  },
  "dest": {
    "index": "book1",
    "routing": "=age" // 根據age進行路由

  },
  "script": {
    "source": "if (ctx._source.age == 12) {ctx._source.age++}",
    "lang": "painless"
  }
}

就像在_update_by_query，您能夠設置ctx.op更改在目標索引上執行的操做：

noop: 設置 ctx.op = "noop" 腳本是否肯定沒必要在目標索引中編制索引。這種無操做將 noop 在響應機構的計數器中報告。
delete: ctx.op = "delete" 若是腳本肯定必須從目標索引中刪除文檔，請進行設置。

10.9.從遠程複製

POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200",
      "username": "user",
      "password": "pass"
    },
    "index": "source",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

10.10.查看重建索引任務

GET _tasks?detailed=true&actions=*reindex

官方文檔參考：Reindex API

11.term Vectors:分詞api

11.1. term的基本信息

# term_freq：在在該字段中的頻率

# position：詞在該字段中的位置

# start_offset：從什麼偏移量開始的

# end_offset: 到什麼偏移量結束

11.2 term的統計信息

若是啓用了term的統計信息，即term_statistics設爲true，那麼有哪些統計信息呢？

# doc_freq： 該詞在文檔中出現的頻率

# ttf：total term frequency的縮寫，一個term在全部document中出現的頻率

11.3字段的統計信息

若是啓用了字段統計信息，即field_statistics設爲true,那麼有哪些統計信息呢？

# sum_doc_freq: 一個字段中全部term的文檔頻率之和

# doc_count: 有多少個文檔包含這個字段

# sum_ttf：sum total term frequency的縮寫，一個字段中的每個term的在全部文檔出現之和

term statistics和field statistics並不精準，不會被考慮有的doc可能被刪除了

11.5採集term信息的方式

採集term信息的方式有兩種：index-time(從已經存儲的索引中查看) 和 query-time（及時生成）

11.6 index-time方式

須要在mapping配置一下，而後創建索引的時候，就直接生成這些詞條和文檔的統計信息

PUT /website

{

   "mappings": {

       "article":{

           "properties":{

               "text":{

                   "type": "text",

                   "term_vector": "with_positions_offsets",

                   "store": "true",

                   "analyzer" : "fulltext"

                }

            }

        }

    },

   "settings": {

       "analysis": {

           "analyzer": {

               "fulltext":{

                   "type": "custom",

                   "tokenizer": "whitespace",

                   "filter": [

                        "lowercase",

                       "type_as_payload"

                   ]

               }

            }

        }

    }

}

View Code

11.7 query-time方式

即以前沒有在mapping裏配置過，而是經過查詢的方式產生這些統計信息

POST /ecommerce/music/1/_termvectors

{

   "fields":["desc"],

   "offsets":true,

   "payloads":true,

   "positions":true,

   "term_statistics":true,

   "field_statistics" : true

}

11.8 手動指定analyzer來生成termvector

我麼能夠經過指定per_field_analyzer設置一個分詞器對該字段文本進行分詞。

POST /ecommerce/music/1/_termvectors

{

   "fields":["desc"],

   "offsets":true,

   "payloads":true,

   "positions":true,

   "term_statistics":true,

   "field_statistics" : true,

   "per_field_analyzer":{

       "text":"standard"

    }

}

11.9 在線文檔及時生成termvector

POST book/english/_termvectors
{
  "doc" : {
    "name" : "hellow word",
    "text" : "twitter test test test"
  },
  "fields": ["name"],
  "per_field_analyzer" : {
    "name":"standard"
  }
}

response

{
  "_index": "book",
  "_type": "english",
  "_version": 0,
  "found": true,
  "took": 1,
  "term_vectors": {
    "name": {
      "field_statistics": {
        "sum_doc_freq": 632,
        "doc_count": 30,
        "sum_ttf": 991
      },
      "terms": {
        "hellow": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 0,
              "start_offset": 0,
              "end_offset": 6
            }
          ]
        },
        "word": {
          "term_freq": 1,
          "tokens": [
            {
              "position": 1,
              "start_offset": 7,
              "end_offset": 11
            }
          ]
        }
      }
    }
  }
}

View Code

11.10 term的統計信息

咱們能夠根據term的統計信息，過濾出我麼想看的統計結果，好比過濾掉一些出現頻率太低的term,好比我要過濾出該字段最多隻有10個term，並且那些term在該字段中出現的頻率爲2，且

POST /ecommerce/music/1/_termvectors

{

   "fields":["desc"],
   "offsets":true,
   "payloads":true,
   "positions":true,
   "term_statistics":true,
   "field_statistics" : true,

   "filter":{
       "max_num_terms":10, // 返回的最大分詞輸
       "min_term_freq" : 2, // 忽略低於源文檔中出現的次數
       "min_doc_freq" : 1  // 忽略低於全部文檔中出現的次數
    }

}

11.11 term過濾參數說明

max_num_terms：每一個字段必須返回的最大分詞數。默認爲25。

min_term_freq：忽略源文檔中低於此頻率的單詞。默認爲1。
max_term_freq：忽略源文檔中超過此頻率的單詞。默認爲無限制。

min_doc_freq：忽略至少在這麼多文檔中沒有出現的分詞。默認爲1。
max_doc_freq：忽略超過這麼多文檔中出現的單詞。默認爲無限制。

min_word_length：最小字長，低於該字長將被忽略。默認爲0。
max_word_length：最大字長，高於該字長將被忽略。默認爲unbounded（0）。

官方文檔參考：Term Vector Api

12 批量返回分詞：Multi termvectors API

採集term信息的方式有兩種：index-time(從已經存儲的索引中查看) 和 query-time（及時生成）

12.1 index-time

POST /_mtermvectors
{
   "docs": [
      {
         "_index": "twitter",
         "_type": "_doc",
         "_id": "2",
         "term_statistics": true
      },
      {
         "_index": "twitter",
         "_type": "_doc",
         "_id": "1",
         "fields": [
            "message"
         ]
      }
   ]
}

View Code

url中指定索引：

POST /twitter/_mtermvectors
{
   "docs": [
      {
         "_type": "_doc",
         "_id": "2",
         "fields": [
            "message"
         ],
         "term_statistics": true
      },
      {
         "_type": "_doc",
         "_id": "1"
      }
   ]
}

View Code

url中指定索引類型：

POST /twitter/_doc/_mtermvectors
{
   "docs": [
      {
         "_id": "2",
         "fields": [
            "message"
         ],
         "term_statistics": true
      },
      {
         "_id": "1"
      }
   ]
}

View Code

若是索引類型和字段都相同：

POST /twitter/_doc/_mtermvectors
{
    "ids" : ["1", "2"],
    "parameters": {
        "fields": [
                "message"
        ],
        "term_statistics": true
    }
}

View Code

12.2及時批量生成

POST_mtermvectors
{
   "docs": [
      {
         "_index": "book",
         "_type": "english",
         "doc" : {
            "name" : "John Doe",
            "message" : "twitter test test test"
         },
          "fields": ["name"],
          "per_field_analyzer" : {
          "name":"standard"
         }
      },
      {
         "_index": "book",
         "_type": "english",
         "doc" : {
           "name" : "Jane Doe",
           "message" : "Another twitter test ..."
         },
          "fields": ["name"],
          "per_field_analyzer" : {
          "name":"standard"
         }
      }
   ]
}

View Code

response:

{
  "docs": [
    {
      "_index": "book",
      "_type": "english",
      "_version": 0,
      "found": true,
      "took": 2,
      "term_vectors": {
        "name": {
          "field_statistics": {
            "sum_doc_freq": 632,
            "doc_count": 30,
            "sum_ttf": 991
          },
          "terms": {
            "doe": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 1,
                  "start_offset": 5,
                  "end_offset": 8
                }
              ]
            },
            "john": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 0,
                  "start_offset": 0,
                  "end_offset": 4
                }
              ]
            }
          }
        }
      }
    },
    {
      "_index": "book",
      "_type": "english",
      "_version": 0,
      "found": true,
      "took": 0,
      "term_vectors": {
        "name": {
          "field_statistics": {
            "sum_doc_freq": 632,
            "doc_count": 30,
            "sum_ttf": 991
          },
          "terms": {
            "doe": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 1,
                  "start_offset": 5,
                  "end_offset": 8
                }
              ]
            },
            "jane": {
              "term_freq": 1,
              "tokens": [
                {
                  "position": 0,
                  "start_offset": 0,
                  "end_offset": 4
                }
              ]
            }
          }
        }
      }
    }
  ]
}

View Code

12.2.返回該索引所有文檔的分詞統計

POST book/_search
{  
    "size" : 0, "aggs" : { "messages" : { "terms" : { "size" : 10, "field" : "name" } } } }

官方文檔參考：Multi termvectors API

13.?refresh

ES的索引數據是寫入到磁盤上的。但這個過程是分階段實現的，由於IO的操做是比較費時的。

先寫到內存中，此時不可搜索
默認通過 1s 以後會(refresh)被寫入 lucene 的底層文件 segment 中，此時能夠搜索到
flush以後纔會寫入磁盤

以上過程因爲隨時可能被中斷致使數據丟失，因此每個過程都會有 translog 記錄，若是中間有任何一步失敗了，等服務器重啓以後就會重試，保證數據寫入。translog也是先存在內存裏的，而後默認5秒刷一次寫到硬盤裏。

在 index ，Update , Delete , Bulk 等操做中，能夠設置 refresh 的值。取值以下：

`13.1.refresh=true`

更新數據以後，馬上對相關的分片(包括副本) 刷新，這個刷新操做保證了數據更新的結果能夠馬上被搜索到。

`13.2.refresh=wait_for`

這個參數表示，刷新後返回。刷新不會馬上進行，而是等待一段時間才刷新 ( index.refresh_interval)，默認時間是 1 秒。刷新時間間隔能夠經過index 的配置動態修改。或者直接手動刷新 POST /twitter/_refresh

`13.3.refresh=false`

refresh 的默認值，當即返回。更新數據以後不馬上刷新，在返回結果以後的某個時間點會自動刷新，也就是隨機的，看es服務器的運行狀況。

那麼選擇哪一種刷新方式？

wait_for 和 true 對比，前者每次會積累必定的工做量再去刷新
true 是低效的，由於每次實時刷新會產生很小的 segment，隨後這些零碎的小段會被合併到效率更高的大 segment 中。也就是說使用 true 的代價在於，在 index 階段會建立這些小的 segment，在搜索的時候也是搜索這些小的 segment，在合併的時候去將小的 segment 合併到大的 segment 中
不要在多個請求中對每一條數據都設置 refresh=wait_for ，用bulk 去批量更新，而後在單個的請求中設置 refresh=wait_for 會好一些
若是 index.refresh_interval: -1 ，將會禁用刷新，那帶上了 refresh=wait_for 參數的請求實際上刷新的時間是未知的。若是 index.refresh_interval 的值設置的比默認值( 1s )更小，好比 200 ms，那帶上了 refresh=wait_for 參數的請求將很快刷新，可是仍然會產生一些低效的segment。
refresh=wait_for 只會影響到當前須要強制刷新的請求，refresh=true 卻會影響正在處理的其餘請求。因此若是想盡量小的縮小影響範圍時，應該用 refresh=wait_for

官方文檔參考：Refresh api

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。