ES批量索引寫入時的ID自動生成算法

對bulk request的處理流程:html

一、遍歷全部的request,對其作一些加工,主要包括:獲取routing(若是mapping裏有的話)、指定的timestamp(若是沒有帶timestamp會使用當前時間),若是沒有指定id字段,在action.bulk.action.allow_id_generation配置爲true的狀況下,會自動生成一個base64UUID做爲id字段,並會將request的opType字段置爲CREATE,由於若是是使用es自動生成的id的話,默認就是createdocument而不是updatedocument。(注:坑爹啊,我從github上面下的最新的ES代碼,發現自動生成id這一段已經沒有設置opType字段了,看起來和有指定id是同樣的處理邏輯了,見https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/action/index/IndexRequest.java)。java

二、建立一個shardId--> Operation的Map,再次遍歷全部的request,獲取獲取每一個request應該發送到的shardId,獲取的過程是這樣的:request有routing就直接返回,若是沒有,會先對id求一個hash,這裏的hash函數默認是Murmur3,固然你也能夠經過配置index.legacy.routing.hash.type來決定使用的hash函數,決定發到哪一個shard:node

return MathUtils.mod(hash, indexMetaData.getNumberOfShards());git

即用hash對shard的總數求模來獲取shardId,將shardId做爲key,經過遍歷的index和request組成BulkItemRequest的集合做爲value放入以前說的map中(爲何要拿到遍歷的index,由於在bulk response中能夠看到對每一個request的請求處理結果的),其實說了這麼多就是要對request按shard來分組(爲負載均衡)。github

三、遍歷上面獲得的map,對不一樣的分組建立一個bulkShardRequest,包含配置consistencyLevel和timeout。並從集羣state中得到primary shard,若是primary在本機就直接執行,若是不在會再發送到其shard所在的node。算法

 

上述1中的ID生成算法:api


對於ES1.71版本,所處包爲org.elasticsearch.action.index.IndexRequestapp

void org.elasticsearch.action.index.IndexRequest.process(MetaData metaData, @Nullable MappingMetaData mappingMd, boolean allowIdGeneration, String concreteIndex) throws ElasticsearchException
{
............
// generate id if not already provided and id generation is allowed if (allowIdGeneration) { if (id == null) { id(Strings.base64UUID()); // since we generate the id, change it to CREATE opType(IndexRequest.OpType.CREATE); autoGeneratedId = true; } }
............
}

 

IndexRequest org.elasticsearch.action.index.IndexRequest.id(String id)負載均衡

Sets the id of the indexed document. If not set, will be automatically generated.
Parameters:
id dom


String org.elasticsearch.common.Strings.base64UUID()

Generates a time-based UUID (similar to Flake IDs), which is preferred when generating an ID to be indexed into a Lucene index as primary key. The id is opaque and the implementation is free to change at any time!

/** Generates a time-based UUID (similar to Flake IDs), which is preferred when generating an ID to be indexed into a Lucene index as
* primary key. The id is opaque and the implementation is free to change at any time! */
public static String base64UUID() {
    return TIME_UUID_GENERATOR.getBase64UUID();
}

 

 

參考: 

https://discuss.elastic.co/t/generate-id/28536/2

https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing 

https://github.com/elastic/elasticsearch/pull/7531/files ES歷史版本的改動能夠在這裏看到,最開始ES使用的是randomBase64UUID,出於性能後來用了相似Flake的ID!

http://xbib.org/elasticsearch/2.1.1/apidocs/org/elasticsearch/common/Strings.html

http://www.opscoder.info/es_indexprocess1.html 有bulk插入的詳細說明

相關文章
相關標籤/搜索