druid Hadoop-based Batch Ingestion

時間 2019-11-13

標籤 druid hadoop based batch ingestion 欄目 Java開源简体版

原文原文鏈接

背景

Kafka Indexing Service segements 生成規則是根據topic 的partitions決定，假設 topic 有12個partiontions ，查詢粒度是 1小時，那麼 1天最多產生的segements 數量 216，一個segements的大小官網建議 500-700 MB ，其中有些segment大小隻有幾十K,很是不合理。html

合併

從官網提供的合併實例當時並未執行成功，最終通過嘗試json

{
  "type" : "index_hadoop",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "parser" : {
        "type" : "hadoopyString",
        "parseSpec" : {
          "format" : "json",
          "timestampSpec" : {
            "column" : "timestamp",
            "format" : "auto"
          },
          "dimensionsSpec" : {
            "dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
            "dimensionExclusions" : [],
            "spatialDimensions" : []
          }
        }
      },
      "metricsSpec" : [
        {
          "type" : "count",
          "name" : "count"
        },
        {
          "type" : "doubleSum",
          "name" : "added",
          "fieldName" : "added"
        },
        {
          "type" : "doubleSum",
          "name" : "deleted",
          "fieldName" : "deleted"
        },
        {
          "type" : "doubleSum",
          "name" : "delta",
          "fieldName" : "delta"
        }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "DAY",
        "queryGranularity" : "NONE",
        "intervals" : [ "2013-08-31/2013-09-01" ]
      }
    },
    "ioConfig" : {
      "type" : "hadoop",
     "inputSpec":{
                "type":"dataSource",
                "ingestionSpec":{
                    "dataSource":"wikipedia",
                    "intervals":[
                        "2013-08-31/2013-09-01"
                    ]
                }
            },
    "tuningConfig" : {
      "type": "hadoop"
    }
  }
}
}

說明

"inputSpec":{
                "type":"dataSource",
                "ingestionSpec":{
                    "dataSource":"wikipedia",
                    "intervals":[
                        "2013-08-31/2013-09-01"
                    ]
                }

設置Hadoop 任務工做目錄，默認經過/tmp，若是臨時目錄可用空間比較小，則會致使任務沒法正常執行

{
    "type":"index_hadoop",
    "spec":{
        "dataSchema":{
            "dataSource":"test",
            "parser":{
                "type":"hadoopyString",
                "parseSpec":{
                    "format":"json",
                    "timestampSpec":{
                        "column":"timeStamp",
                        "format":"auto"
                    },
                   "dimensionsSpec": {
                     "dimensions": [
                        "test_id",
                        "test_id"
                    ],
                    "dimensionExclusions": [
                        "timeStamp",
                        "value"
                    ]
                }
                }
            },
             "metricsSpec": [
            {
                "type": "count",
                "name": "count"
            }
        ],
            "granularitySpec":{
                "type":"uniform",
                "segmentGranularity":"MONTH",
                "queryGranularity": "HOUR",
                "intervals":[
                         "2017-12-01/2017-12-31"
                    ]
                
            }
        },
        "ioConfig":{
            "type":"hadoop",
            "inputSpec":{
                "type":"dataSource",
                "ingestionSpec":{
                    "dataSource":"test",
                    "intervals":[
                        "2017-12-01/2017-12-31"
                    ]
                }
            }
            
        },
		"tuningConfig":{
                "type":"hadoop",
                 "maxRowsInMemory":500000,
                 "partitionsSpec":{
                    "type":"hashed",
                    "targetPartitionSize":5000000
                },
                "numBackgroundPersistThreads":1,
                
                "jobProperties":{
                	"mapreduce.job.local.dir":"/home/ant/druid/druid-0.11.0/var/mapred",
                	"mapreduce.cluster.local.dir":"/home/ant/druid/druid-0.11.0/var/mapred",
                	"mapred.job.map.memory.mb":2300,
                	"mapreduce.reduce.memory.mb":2300
                
                }
               
            }
    }
}

這是對於加載的數據的說明。app