Kafka Indexing Service segements 生成規則是根據topic 的partitions決定,假設 topic 有12個partiontions ,查詢粒度是 1小時,那麼 1天最多產生的segements 數量 216,一個segements的大小官網建議 500-700 MB ,其中有些segment大小隻有幾十K,很是不合理。html
從官網提供的合併實例當時並未執行成功,最終通過嘗試json
{ "type" : "index_hadoop", "spec" : { "dataSchema" : { "dataSource" : "wikipedia", "parser" : { "type" : "hadoopyString", "parseSpec" : { "format" : "json", "timestampSpec" : { "column" : "timestamp", "format" : "auto" }, "dimensionsSpec" : { "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"], "dimensionExclusions" : [], "spatialDimensions" : [] } } }, "metricsSpec" : [ { "type" : "count", "name" : "count" }, { "type" : "doubleSum", "name" : "added", "fieldName" : "added" }, { "type" : "doubleSum", "name" : "deleted", "fieldName" : "deleted" }, { "type" : "doubleSum", "name" : "delta", "fieldName" : "delta" } ], "granularitySpec" : { "type" : "uniform", "segmentGranularity" : "DAY", "queryGranularity" : "NONE", "intervals" : [ "2013-08-31/2013-09-01" ] } }, "ioConfig" : { "type" : "hadoop", "inputSpec":{ "type":"dataSource", "ingestionSpec":{ "dataSource":"wikipedia", "intervals":[ "2013-08-31/2013-09-01" ] } }, "tuningConfig" : { "type": "hadoop" } } } }
"inputSpec":{ "type":"dataSource", "ingestionSpec":{ "dataSource":"wikipedia", "intervals":[ "2013-08-31/2013-09-01" ] }
{ "type":"index_hadoop", "spec":{ "dataSchema":{ "dataSource":"test", "parser":{ "type":"hadoopyString", "parseSpec":{ "format":"json", "timestampSpec":{ "column":"timeStamp", "format":"auto" }, "dimensionsSpec": { "dimensions": [ "test_id", "test_id" ], "dimensionExclusions": [ "timeStamp", "value" ] } } }, "metricsSpec": [ { "type": "count", "name": "count" } ], "granularitySpec":{ "type":"uniform", "segmentGranularity":"MONTH", "queryGranularity": "HOUR", "intervals":[ "2017-12-01/2017-12-31" ] } }, "ioConfig":{ "type":"hadoop", "inputSpec":{ "type":"dataSource", "ingestionSpec":{ "dataSource":"test", "intervals":[ "2017-12-01/2017-12-31" ] } } }, "tuningConfig":{ "type":"hadoop", "maxRowsInMemory":500000, "partitionsSpec":{ "type":"hashed", "targetPartitionSize":5000000 }, "numBackgroundPersistThreads":1, "jobProperties":{ "mapreduce.job.local.dir":"/home/ant/druid/druid-0.11.0/var/mapred", "mapreduce.cluster.local.dir":"/home/ant/druid/druid-0.11.0/var/mapred", "mapred.job.map.memory.mb":2300, "mapreduce.reduce.memory.mb":2300 } } } }
這是對於加載的數據的說明。app
URLoop
HTTPspa
POSTcode
參數orm
參數名稱 類型 值 Content-Type header application/jsonhtm
druid 自己提供合併任務方式,但還是建議,直接經過hadoop計算。ip
http://druid.io/docs/latest/ingestion/batch-ingestion.html
http://druid.io/docs/latest/ingestion/update-existing-data.html