Elasticsearch 的坑爹事——記錄一次mapping field修改過程

Elasticsearch 的坑爹事 html

本文記錄一次Elasticsearch mapping field修改過程


團隊使用Elasticsearch作日誌的分類檢索分析服務,使用了相似以下的_mappingpython

{
    "settings" : {
        "number_of_shards" : 20
    },
    "mappings" : {
      "client" : {
        "properties" : {
          "ip" : {
            "type" : "long"
          },
          "cost" : {
            "type" : "long"
          },
}

 
如今問題來了,日誌中輸出的"127.0.0.1"這類的IP地址在Elasticsearch中是不能轉化爲long的(報錯Java.lang.NumberFormatException),因此咱們必須將字段改成string型或者ip型(Elasticsearch支持, 數據類型可見mapping-core-types)才能達到理想的效果.

目標明確了,就是改掉mapping的ip的field type便可.
elasticsearch.org找了一圈 嘿嘿, update一下便可數據庫

curl -XPUT localhost:8301/store/client/_mapping -d '
{
    "client" : {
        "properties" : {
            "local_ip" : {"type" : "string", "store" : "yes"}    
        }
    }
}


報錯結果app

{"error":"MergeMappingException[Merge failed with failures {[mapper [local_ip] of different type, current_type [long], merged_type [string]]}]","status":400}



尼瑪 真逗  我long想轉一下string 竟然失敗(elasticsearch產品層面理應支持這種無損轉化)  無果
Google了一下相似的案例 (案例)
在一個帖子中獲得的elasticsearch開發人員的準確答覆less


  "You can't change existing mapping type, you need to create a new index with the correct mapping and index the data again."

想一想 略坑啊 我無論是由於elasticsearch仍是由於底層Lucene的緣由,修改一個field須要對全部已有數據的全部field進行reindex,這自己就是一個逆天的思路,可是elasticsearch的研發人員還以爲這沒有什麼不合理的.

在Elasticsearch上游逛了一圈,上面這樣寫到
(http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/)
the problem — why you can’t change mappings

You can only find that which is stored in your index. In order to make your data searchable, your database needs to know what type of data each field contains and how it should be indexed. If you switch a field type from e.g. a string to a date, all of the data for that field that you already have indexed becomes useless. One way or another, you need to reindex that field.

...
OK,這一段話很合理,我改了一個field的類型 須要對這個field進行reindex,如論哪一種數據庫都須要這麼作,沒錯.
咱們再繼續往下看看,reindexing your data, 尼瑪一看,弱爆了,他的reindexing your data不是對修改的filed進行reindex,而是建立了一個新的index,對全部的filed進行reindexing, 太逆天了。

吐槽歸吐槽,這個事情逃不了,那我就按他的來吧.
首先建立一個新的索引curl

curl -XPUT localhost:8305/store_v2 -d '
{
    "settings" : {
        "number_of_shards" : 20
    },
    "mappings" : {
      "client" : {
        "properties" : {
          "ip" : {
            "type" : "string"
          },
          "cost" : {
            "type" : "long"
          },
}


等等,我建立了新索引,client往Elasticsearch的代碼不會須要修改吧,瞅了一眼,有解決方案,創建一個alias(別名,和C++引用差很少),經過alias來實現對後面索引數據的解耦合,看到這,舒了一口氣。

如今的問題是 這是一個線上服務,不能停服務,因此我須要一個倒數據到個人新索引的一個方案
Elasticsearch官網寫到
  pull the documents in from your old index, using a scrolled search and index them into the new index using the bulk API. Many of the client APIs provide a reindex() method which will do all of this for you. Once you are done, you can delete the old index.
第一句,看起來很美好,找了一圈,尼瑪無圖無真相,Google都沒有例子,你讓我怎麼導數據?
第二句 client APIS, 看起來只有這個方法可搞了

python用起來比較熟,因此我就直接選 pyes了,裝了一大堆破依賴庫以後,終於能夠run起來了elasticsearch

    import pyes
    conn = pyes.es.ES("http://10.xx.xx.xx:8305/")
    search = pyes.query.MatchAllQuery().search(bulk_read=1000)
    hits = conn.search(search, 'store_v1', 'client', scan=True, scroll="30m", model=lambda _,hit: hit)
    for hit in hits:
         #print hit
         conn.index(hit['_source'], 'store_v2', 'client', hit['_id'], bulk=True)
    conn.flush()

 
花了大概一個多小時,新的索引基本和老索引數據一致了,對於線上完成瞬間的增量,這裏沒心思關注了,數據準確性要求沒那麼高,得過且過。

接下來修改alias別名的指向(若是你以前沒有用alias來改mapping,納尼就等着哭吧)ide

curl -XPOST localhost:8305/_aliases -d '
{
    "actions": [
        { "remove": {
            "alias": "store",
            "index": "store_v1"
        }},
        { "add": {
            "alias": "store",
            "index": "store_v2"
        }}
    ]
}
'

 
啷啷鏘鏘,正在追數據中ui

等新索引的數據已經追上時this

將老的索引刪掉

curl -XDELETE localhost:8303/store_v1

 

至此完成!

一件如此簡單的事情,Elasticsearch竟然能讓他變得如此複雜,真是牛逼啊...

相關文章
相關標籤/搜索