HdfsReader提供了讀取分佈式文件系統數據存儲的能力。在底層實現上,HdfsReader獲取分佈式文件系統上文件的數據,並轉換爲DataX傳輸協議傳遞給Writer。
目前HdfsReader支持的文件格式有textfile(text)、orcfile(orc)、rcfile(rc)、sequence file(seq)和普通邏輯二維表(csv)類型格式的文件,且文件內容存放的必須是一張邏輯意義上的二維表。
HdfsReader須要Jdk1.7及以上版本的支持。java
HdfsReader實現了從Hadoop分佈式文件系統Hdfs中讀取文件數據並轉爲DataX協議的功能。textfile是Hive建表時默認使用的存儲格式,數據不作壓縮,本質上textfile就是以文本的形式將數據存放在hdfs中,對於DataX而言,HdfsReader實現上類比TxtFileReader,有諸多類似之處。orcfile,它的全名是Optimized Row Columnar file,是對RCFile作了優化。據官方文檔介紹,這種文件格式能夠提供一種高效的方法來存儲Hive數據。HdfsReader利用Hive提供的OrcSerde類,讀取解析orcfile文件的數據。目前HdfsReader支持的功能以下:
1. 支持textfile、orcfile、rcfile、sequence file和csv格式的文件,且要求文件內容存放的是一張邏輯意義上的二維表。
2. 支持多種類型數據讀取(使用String表示),支持列裁剪,支持列常量
3. 支持遞歸讀取、支持正則表達式("*"和"?")。
4. 支持orcfile數據壓縮,目前支持SNAPPY,ZLIB兩種壓縮方式。
5. 多個File能夠支持併發讀取。
6. 支持sequence file數據壓縮,目前支持lzo壓縮方式。
7. csv類型支持壓縮格式有:gzip、bz二、zip、lzo、lzo_deflate、snappy。
8. 目前插件中Hive版本爲1.1.1,Hadoop版本爲2.7.1(Apache[爲適配JDK1.7],在Hadoop 2.5.0, Hadoop 2.6.0 和Hive 1.2.0測試環境中寫入正常;其它版本需後期進一步測試;
9. 支持kerberos認證(注意:若是用戶須要進行kerberos認證,那麼用戶使用的Hadoop集羣版本須要和hdfsreader的Hadoop版本保持一致,若是高於hdfsreader的Hadoop版本,不保證kerberos認證有效)
暫時不能作到:
1. 單個File支持多線程併發讀取,這裏涉及到單個File內部切分算法。二期考慮支持。
2. 目前還不支持hdfs HA;node
json以下mysql
{ "job": { "setting": { "speed": { "channel": 3 } }, "content": [{ "reader": { "name": "hdfsreader", "parameter": { "path": "/user/hive/warehouse/test/*", "defaultFS": "hdfs://192.168.1.121:8020", "column": [{ "index": 0, "type": "long" }, { "index": 1, "type": "string" }, { "type": "string", "value": "hello" } ], "fileType": "text", "encoding": "UTF-8", "fieldDelimiter": "," } }, "writer": { "name": "streamwriter", "parameter": { "print": true } } }] } }
執行正則表達式
FengZhendeMacBook-Pro:bin FengZhen$ ./datax.py /Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/1.reader_all.json DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. 2018-11-18 17:28:30.540 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl 2018-11-18 17:28:30.551 [main] INFO Engine - the machine info => osInfo: Oracle Corporation 1.8 25.162-b12 jvmInfo: Mac OS X x86_64 10.13.4 cpu num: 4 totalPhysicalMemory: -0.00G freePhysicalMemory: -0.00G maxFileDescriptorCount: -1 currentOpenFileDescriptorCount: -1 GC Names [PS MarkSweep, PS Scavenge] MEMORY_NAME | allocation_size | init_size PS Eden Space | 256.00MB | 256.00MB Code Cache | 240.00MB | 2.44MB Compressed Class Space | 1,024.00MB | 0.00MB PS Survivor Space | 42.50MB | 42.50MB PS Old Gen | 683.00MB | 683.00MB Metaspace | -0.00MB | 0.00MB 2018-11-18 17:28:30.572 [main] INFO Engine - { "content":[ { "reader":{ "name":"hdfsreader", "parameter":{ "column":[ { "index":0, "type":"long" }, { "index":1, "type":"string" }, { "type":"string", "value":"hello" } ], "defaultFS":"hdfs://192.168.1.121:8020", "encoding":"UTF-8", "fieldDelimiter":",", "fileType":"text", "path":"/user/hive/warehouse/test/*" } }, "writer":{ "name":"streamwriter", "parameter":{ "print":true } } } ], "setting":{ "speed":{ "channel":3 } } } 2018-11-18 17:28:30.601 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null 2018-11-18 17:28:30.605 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0 2018-11-18 17:28:30.605 [main] INFO JobContainer - DataX jobContainer starts job. 2018-11-18 17:28:30.609 [main] INFO JobContainer - Set jobId = 0 2018-11-18 17:28:30.650 [job-0] INFO HdfsReader$Job - init() begin... 2018-11-18 17:28:31.318 [job-0] INFO HdfsReader$Job - hadoopConfig details:{"finalParameters":[]} 2018-11-18 17:28:31.318 [job-0] INFO HdfsReader$Job - init() ok and end... 2018-11-18 17:28:31.326 [job-0] INFO JobContainer - jobContainer starts to do prepare ... 2018-11-18 17:28:31.327 [job-0] INFO JobContainer - DataX Reader.Job [hdfsreader] do prepare work . 2018-11-18 17:28:31.327 [job-0] INFO HdfsReader$Job - prepare(), start to getAllFiles... 2018-11-18 17:28:31.327 [job-0] INFO HdfsReader$Job - get HDFS all files in path = [/user/hive/warehouse/test/*] Nov 18, 2018 5:28:31 PM org.apache.hadoop.util.NativeCodeLoader <clinit> 警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-11-18 17:28:33.323 [job-0] INFO HdfsReader$Job - [hdfs://192.168.1.121:8020/user/hive/warehouse/test/data]是[text]類型的文件, 將該文件加入source files列表 2018-11-18 17:28:33.327 [job-0] INFO HdfsReader$Job - 您即將讀取的文件數爲: [1], 列表爲: [hdfs://192.168.1.121:8020/user/hive/warehouse/test/data] 2018-11-18 17:28:33.328 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work . 2018-11-18 17:28:33.328 [job-0] INFO JobContainer - jobContainer starts to do split ... 2018-11-18 17:28:33.329 [job-0] INFO JobContainer - Job set Channel-Number to 3 channels. 2018-11-18 17:28:33.329 [job-0] INFO HdfsReader$Job - split() begin... 2018-11-18 17:28:33.330 [job-0] INFO JobContainer - DataX Reader.Job [hdfsreader] splits to [1] tasks. 2018-11-18 17:28:33.331 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [1] tasks. 2018-11-18 17:28:33.347 [job-0] INFO JobContainer - jobContainer starts to do schedule ... 2018-11-18 17:28:33.356 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups. 2018-11-18 17:28:33.359 [job-0] INFO JobContainer - Running by standalone Mode. 2018-11-18 17:28:33.388 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks. 2018-11-18 17:28:33.396 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated. 2018-11-18 17:28:33.397 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated. 2018-11-18 17:28:33.419 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started 2018-11-18 17:28:33.516 [0-0-0-reader] INFO HdfsReader$Job - hadoopConfig details:{"finalParameters":["mapreduce.job.end-notification.max.retry.interval","mapreduce.job.end-notification.max.attempts"]} 2018-11-18 17:28:33.517 [0-0-0-reader] INFO Reader$Task - read start 2018-11-18 17:28:33.518 [0-0-0-reader] INFO Reader$Task - reading file : [hdfs://192.168.1.121:8020/user/hive/warehouse/test/data] 2018-11-18 17:28:33.790 [0-0-0-reader] INFO UnstructuredStorageReaderUtil - CsvReader使用默認值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":",","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值爲[null] 2018-11-18 17:28:33.845 [0-0-0-reader] INFO Reader$Task - end read source files... 1 張三 hello 2 李四 hello 2018-11-18 17:28:34.134 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[715]ms 2018-11-18 17:28:34.137 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks. 2018-11-18 17:28:43.434 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 16 bytes | Speed 1B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.425s | Percentage 100.00% 2018-11-18 17:28:43.435 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks. 2018-11-18 17:28:43.436 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work. 2018-11-18 17:28:43.436 [job-0] INFO JobContainer - DataX Reader.Job [hdfsreader] do post work. 2018-11-18 17:28:43.437 [job-0] INFO JobContainer - DataX jobId [0] completed successfully. 2018-11-18 17:28:43.438 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /Users/FengZhen/Desktop/Hadoop/dataX/datax/hook 2018-11-18 17:28:43.446 [job-0] INFO JobContainer - [total cpu info] => averageCpu | maxDeltaCpu | minDeltaCpu -1.00% | -1.00% | -1.00% [total gc info] => NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime PS MarkSweep | 1 | 1 | 1 | 0.038s | 0.038s | 0.038s PS Scavenge | 1 | 1 | 1 | 0.020s | 0.020s | 0.020s 2018-11-18 17:28:43.446 [job-0] INFO JobContainer - PerfTrace not enable! 2018-11-18 17:28:43.447 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 16 bytes | Speed 1B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.425s | Percentage 100.00% 2018-11-18 17:28:43.448 [job-0] INFO JobContainer - 任務啓動時刻 : 2018-11-18 17:28:30 任務結束時刻 : 2018-11-18 17:28:43 任務總計耗時 : 12s 任務平均流量 : 1B/s 記錄寫入速度 : 0rec/s 讀出記錄總數 : 2 讀寫失敗總數 : 0
--path
描述:要讀取的文件路徑,若是要讀取多個文件,可使用正則表達式"*",注意這裏能夠支持填寫多個路徑。。
當指定單個Hdfs文件,HdfsReader暫時只能使用單線程進行數據抽取。二期考慮在非壓縮文件狀況下針對單個File能夠進行多線程併發讀取。
當指定多個Hdfs文件,HdfsReader支持使用多線程進行數據抽取。線程併發數經過通道數指定。
當指定通配符,HdfsReader嘗試遍歷出多個文件信息。例如: 指定/表明讀取/目錄下全部的文件,指定/bazhen/*表明讀取bazhen目錄下游全部的文件。HdfsReader目前只支持""和"?"做爲文件通配符。
特別須要注意的是,DataX會將一個做業下同步的全部的文件視做同一張數據表。用戶必須本身保證全部的File可以適配同一套schema信息。而且提供給DataX權限可讀。
必選:是
默認值:無
--defaultFS
描述:Hadoop hdfs文件系統namenode節點地址。
目前HdfsReader已經支持Kerberos認證,若是須要權限認證,則須要用戶配置kerberos參數,見下面
必選:是
默認值:無
--fileType
描述:文件的類型,目前只支持用戶配置爲"text"、"orc"、"rc"、"seq"、"csv"。
text表示textfile文件格式
orc表示orcfile文件格式
rc表示rcfile文件格式
seq表示sequence file文件格式
csv表示普通hdfs文件格式(邏輯二維表)
特別須要注意的是,HdfsReader可以自動識別文件是orcfile、textfile或者仍是其它類型的文件,但該項是必填項,HdfsReader則會只讀取用戶配置的類型的文件,忽略路徑下其餘格式的文件
另外須要注意的是,因爲textfile和orcfile是兩種徹底不一樣的文件格式,因此HdfsReader對這兩種文件的解析方式也存在差別,這種差別致使hive支持的複雜複合類型(好比map,array,struct,union)在轉換爲DataX支持的String類型時,轉換的結果格式略有差別,好比以map類型爲例:
orcfile map類型經hdfsreader解析轉換成datax支持的string類型後,結果爲"{job=80, team=60, person=70}"
textfile map類型經hdfsreader解析轉換成datax支持的string類型後,結果爲"job:80,team:60,person:70"
從上面的轉換結果能夠看出,數據自己沒有變化,可是表示的格式略有差別,因此若是用戶配置的文件路徑中要同步的字段在Hive中是複合類型的話,建議配置統一的文件格式。
若是須要統一複合類型解析出來的格式,咱們建議用戶在hive客戶端將textfile格式的表導成orcfile格式的表
必選:是
默認值:無
--column
描述:讀取字段列表,type指定源數據的類型,index指定當前列來自於文本第幾列(以0開始),value指定當前類型爲常量,不從源頭文件讀取數據,而是根據value值自動生成對應的列。
默認狀況下,用戶能夠所有按照String類型讀取數據,配置以下:
"column": ["*"]
用戶能夠指定Column字段信息,配置以下:
{ "type": "long", "index": 0 //從本地文件文本第一列獲取int字段 }, { "type": "string", "value": "alibaba" //HdfsReader內部生成alibaba的字符串字段做爲當前字段 } ```
對於用戶指定Column信息,type必須填寫,index/value必須選擇其一。算法
* 必選:是 <br />sql
* 默認值:所有按照string類型讀取 <br />
--fieldDelimiter
描述:讀取的字段分隔符
另外須要注意的是,HdfsReader在讀取textfile數據時,須要指定字段分割符,若是不指定默認爲',',HdfsReader在讀取orcfile時,用戶無需指定字段分割符
必選:否
默認值:,
--encoding
描述:讀取文件的編碼配置。
必選:否
默認值:utf-8
--nullFormat
描述:文本文件中沒法使用標準字符串定義null(空指針),DataX提供nullFormat定義哪些字符串能夠表示爲null。
例如若是用戶配置: nullFormat:"\N",那麼若是源頭數據是"\N",DataX視做null字段。
必選:否
默認值:無
--haveKerberos
描述:是否有Kerberos認證,默認false
例如若是用戶配置true,則配置項kerberosKeytabFilePath,kerberosPrincipal爲必填。
必選:haveKerberos 爲true必選
默認值:false
--kerberosKeytabFilePath
描述:Kerberos認證 keytab文件路徑,絕對路徑
必選:否
默認值:無
--kerberosPrincipal
描述:Kerberos認證Principal名,如xxxx/hadoopclient@xxx.xxx
必選:haveKerberos 爲true必選
默認值:無
--compress
描述:當fileType(文件類型)爲csv下的文件壓縮方式,目前僅支持 gzip、bz二、zip、lzo、lzo_deflate、hadoop-snappy、framing-snappy壓縮;值得注意的是,lzo存在兩種壓縮格式:lzo和lzo_deflate,用戶在配置的時候須要留心,不要配錯了;另外,因爲snappy目前沒有統一的stream format,datax目前只支持最主流的兩種:hadoop-snappy(hadoop上的snappy stream format)和framing-snappy(google建議的snappy stream format);orc文件類型下無需填寫。
必選:否
默認值:無
--hadoopConfig
描述:hadoopConfig裏能夠配置與Hadoop相關的一些高級參數,好比HA的配置。
"hadoopConfig":{
"dfs.nameservices": "testDfs",
"dfs.ha.namenodes.testDfs": "namenode1,namenode2",
"dfs.namenode.rpc-address.aliDfs.namenode1": "",
"dfs.namenode.rpc-address.aliDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
必選:否
默認值:無
--csvReaderConfig
描述:讀取CSV類型文件參數配置,Map類型。讀取CSV類型文件使用的CsvReader進行讀取,會有不少配置,不配置則使用默認值。
必選:否
默認值:無
常見配置:
"csvReaderConfig":{
"safetySwitch": false,
"skipEmptyRecords": false,
"useTextQualifier": false
}
全部配置項及默認值,配置時 csvReaderConfig 的map中請嚴格按照如下字段名字進行配置:
boolean caseSensitive = true;
char textQualifier = 34;
boolean trimWhitespace = true;
boolean useTextQualifier = true;//是否使用csv轉義字符
char delimiter = 44;//分隔符
char recordDelimiter = 0;
char comment = 35;
boolean useComments = false;
int escapeMode = 1;
boolean safetySwitch = true;//單列長度是否限制100000字符
boolean skipEmptyRecords = true;//是否跳過空行
boolean captureRawRecord = true;數據庫
類型轉換
因爲textfile和orcfile文件表的元數據信息由Hive維護並存放在Hive本身維護的數據庫(如mysql)中,目前HdfsReader不支持對Hive元數
據數據庫進行訪問查詢,所以用戶在進行類型轉換的時候,必須指定數據類型,若是用戶配置的column爲"*",則全部column默認轉換爲
string類型。HdfsReader提供了類型轉換的建議表以下:apache
其中:
--Long是指Hdfs文件文本中使用整形的字符串表示形式,例如"123456789"。
--Double是指Hdfs文件文本中使用Double的字符串表示形式,例如"3.1415"。
--Boolean是指Hdfs文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不區分大小寫。
--Date是指Hdfs文件文本中使用Date的字符串表示形式,例如"2014-12-31"。
特別提醒:
--Hive支持的數據類型TIMESTAMP能夠精確到納秒級別,因此textfile、orcfile中TIMESTAMP存放的數據相似於"2015-08-21 22:40:47.397898389",若是轉換的類型配置爲DataX的Date,轉換以後會致使納秒部分丟失,因此若是須要保留納秒部分的數據,請配置轉換類型爲DataX的String類型。
按分區讀取
Hive在建表的時候,能夠指定分區partition,例如建立分區partition(day="20150820",hour="09"),對應的hdfs文件系統中,相應的表的目錄下則會多出/20150820和/09兩個目錄,且/20150820是/09的父目錄。瞭解了分區都會列成相應的目錄結構,在按照某個分區讀取某個表全部數據時,則只需配置好json中path的值便可。
好比須要讀取表名叫mytable01下分區day爲20150820這一天的全部數據,則配置以下:
"path": "/user/hive/warehouse/mytable01/20150820/*"json
HdfsWriter提供向HDFS文件系統指定路徑中寫入TEXTFile文件和ORCFile文件,文件內容可與hive中表關聯。網絡
(1)、目前HdfsWriter僅支持textfile和orcfile兩種格式的文件,且文件內容存放的必須是一張邏輯意義上的二維表;
(2)、因爲HDFS是文件系統,不存在schema的概念,所以不支持對部分列寫入;
(3)、目前僅支持與如下Hive數據類型: 數值型:TINYINT,SMALLINT,INT,BIGINT,FLOAT,DOUBLE 字符串類型:STRING,VARCHAR,CHAR 布爾類型:BOOLEAN 時間類型:DATE,TIMESTAMP 目前不支持:decimal、binary、arrays、maps、structs、union類型;
(4)、對於Hive分區表目前僅支持一次寫入單個分區;
(5)、對於textfile需用戶保證寫入hdfs文件的分隔符與在Hive上建立表時的分隔符一致,從而實現寫入hdfs數據與Hive表字段關聯;
(6)、HdfsWriter實現過程是:首先根據用戶指定的path,建立一個hdfs文件系統上不存在的臨時目錄,建立規則:path_隨機;而後將讀取的文件寫入這個臨時目錄;所有寫入後再將這個臨時目錄下的文件移動到用戶指定目錄(在建立文件時保證文件名不重複); 最後刪除臨時目錄。若是在中間過程發生網絡中斷等狀況形成沒法與hdfs創建鏈接,須要用戶手動刪除已經寫入的文件和臨時目錄。
(7)、目前插件中Hive版本爲1.1.1,Hadoop版本爲2.7.1(Apache[爲適配JDK1.7],在Hadoop 2.5.0, Hadoop 2.6.0 和Hive 1.2.0測試環境中寫入正常;其它版本需後期進一步測試;
(8)、目前HdfsWriter支持Kerberos認證(注意:若是用戶須要進行kerberos認證,那麼用戶使用的Hadoop集羣版本須要和hdfsreader的Hadoop版本保持一致,若是高於hdfsreader的Hadoop版本,不保證kerberos認證有效)
json以下
{ "setting": {}, "job": { "setting": { "speed": { "channel": 2 } }, "content": [{ "reader": { "name": "txtfilereader", "parameter": { "path": ["/Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/data.txt"], "encoding": "UTF-8", "column": [{ "index": 0, "type": "long" }, { "index": 1, "type": "long" }, { "index": 2, "type": "long" }, { "index": 3, "type": "long" }, { "index": 4, "type": "DOUBLE" }, { "index": 5, "type": "DOUBLE" }, { "index": 6, "type": "STRING" }, { "index": 7, "type": "STRING" }, { "index": 8, "type": "STRING" }, { "index": 9, "type": "BOOLEAN" }, { "index": 10, "type": "date" }, { "index": 11, "type": "date" } ], "fieldDelimiter": "`" } }, "writer": { "name": "hdfswriter", "parameter": { "defaultFS": "hdfs://192.168.1.121:8020", "fileType": "orc", "path": "/user/hive/warehouse/hdfswriter.db/orc_table", "fileName": "hdfswriter", "column": [{ "name": "col1", "type": "TINYINT" }, { "name": "col2", "type": "SMALLINT" }, { "name": "col3", "type": "INT" }, { "name": "col4", "type": "BIGINT" }, { "name": "col5", "type": "FLOAT" }, { "name": "col6", "type": "DOUBLE" }, { "name": "col7", "type": "STRING" }, { "name": "col8", "type": "VARCHAR" }, { "name": "col9", "type": "CHAR" }, { "name": "col10", "type": "BOOLEAN" }, { "name": "col11", "type": "date" }, { "name": "col12", "type": "TIMESTAMP" } ], "writeMode": "append", "fieldDelimiter": "`", "compress": "NONE" } } }] } }
--defaultFS
描述:Hadoop hdfs文件系統namenode節點地址。格式:hdfs://ip:端口;例如:hdfs://127.0.0.1:9000
必選:是
默認值:無
--fileType
描述:文件的類型,目前只支持用戶配置爲"text"或"orc"。
text表示textfile文件格式
orc表示orcfile文件格式
必選:是
默認值:無
--path
描述:存儲到Hadoop hdfs文件系統的路徑信息,HdfsWriter會根據併發配置在Path目錄下寫入多個文件。爲與hive表關聯,請填寫hive表在hdfs上的存儲路徑。例:Hive上設置的數據倉庫的存儲路徑爲:/user/hive/warehouse/ ,已創建數據庫:test,表:hello;則對應的存儲路徑爲:/user/hive/warehouse/test.db/hello
必選:是
默認值:無
--fileName
描述:HdfsWriter寫入時的文件名,實際執行時會在該文件名後添加隨機的後綴做爲每一個線程寫入實際文件名。
必選:是
默認值:無
--column
描述:寫入數據的字段,不支持對部分列寫入。爲與hive中表關聯,須要指定表中全部字段名和字段類型,其中:name指定字段名,type指定字段類型。
用戶能夠指定Column字段信息,配置以下:
"column":
[
{
"name": "userName",
"type": "string"
},
{
"name": "age",
"type": "long"
}
]
必選:是
默認值:無
--writeMode
描述:hdfswriter寫入前數據清理處理模式:
♣ append,寫入前不作任何處理,DataX hdfswriter直接使用filename寫入,並保證文件名不衝突。
♣ nonConflict,若是目錄下有fileName前綴的文件,直接報錯。
必選:是
默認值:無
--fieldDelimiter
描述:hdfswriter寫入時的字段分隔符,須要用戶保證與建立的Hive表的字段分隔符一致,不然沒法在Hive表中查到數據
必選:是
默認值:無
--compress
描述:hdfs文件壓縮類型,默認不填寫意味着沒有壓縮。其中:text類型文件支持壓縮類型有gzip、bzip2;orc類型文件支持的壓縮類型有NONE、SNAPPY(須要用戶安裝SnappyCodec)。
a必選:否
默認值:無壓縮
--hadoopConfig
描述:hadoopConfig裏能夠配置與Hadoop相關的一些高級參數,好比HA的配置。
"hadoopConfig":{
"dfs.nameservices": "testDfs",
"dfs.ha.namenodes.testDfs": "namenode1,namenode2",
"dfs.namenode.rpc-address.aliDfs.namenode1": "",
"dfs.namenode.rpc-address.aliDfs.namenode2": "",
"dfs.client.failover.proxy.provider.testDfs": "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
}
必選:否
默認值:無
--encoding
描述:寫文件的編碼配置。
必選:否
默認值:utf-8,慎重修改
--haveKerberos
描述:是否有Kerberos認證,默認false
例如若是用戶配置true,則配置項kerberosKeytabFilePath,kerberosPrincipal爲必填。
必選:haveKerberos 爲true必選
默認值:false
--kerberosKeytabFilePath
描述:Kerberos認證 keytab文件路徑,絕對路徑
必選:否
默認值:無
--kerberosPrincipal
描述:Kerberos認證Principal名,如xxxx/hadoopclient@xxx.xxx
必選:haveKerberos 爲true必選
默認值:無
目前 HdfsWriter 支持大部分 Hive 類型,請注意檢查你的類型。
下面列出 HdfsWriter 針對 Hive 數據類型轉換列表:
數據準備
data.txt
1`2`3`4`5`6`7`8`9`true`2018-11-18 09:15:30`2018-11-18 09:15:30 13`14`15`16`17`18`19`20`21`false`2018-11-18 09:16:30`2018-11-18 09:15:30
建表(text格式的)
create database IF NOT EXISTS hdfswriter; use hdfswriter; create table text_table( col1 TINYINT, col2 SMALLINT, col3 INT, col4 BIGINT, col5 FLOAT, col6 DOUBLE, col7 STRING, col8 VARCHAR(10), col9 CHAR(10), col10 BOOLEAN, col11 date, col12 TIMESTAMP ) row format delimited fields terminated by "`" STORED AS TEXTFILE;
Orc格式的
create database IF NOT EXISTS hdfswriter; use hdfswriter; create table orc_table( col1 TINYINT, col2 SMALLINT, col3 INT, col4 BIGINT, col5 FLOAT, col6 DOUBLE, col7 STRING, col8 VARCHAR(10), col9 CHAR(10), col10 BOOLEAN, col11 date, col12 TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '`' STORED AS ORC;
執行json文件
FengZhendeMacBook-Pro:bin FengZhen$ ./datax.py /Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/2.writer.json DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. 2018-11-18 21:08:16.212 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl 2018-11-18 21:08:16.232 [main] INFO Engine - the machine info => osInfo: Oracle Corporation 1.8 25.162-b12 jvmInfo: Mac OS X x86_64 10.13.4 cpu num: 4 totalPhysicalMemory: -0.00G freePhysicalMemory: -0.00G maxFileDescriptorCount: -1 currentOpenFileDescriptorCount: -1 GC Names [PS MarkSweep, PS Scavenge] MEMORY_NAME | allocation_size | init_size PS Eden Space | 256.00MB | 256.00MB Code Cache | 240.00MB | 2.44MB Compressed Class Space | 1,024.00MB | 0.00MB PS Survivor Space | 42.50MB | 42.50MB PS Old Gen | 683.00MB | 683.00MB Metaspace | -0.00MB | 0.00MB 2018-11-18 21:08:16.287 [main] INFO Engine - { "content":[ { "reader":{ "name":"txtfilereader", "parameter":{ "column":[ { "index":0, "type":"long" }, { "index":1, "type":"long" }, { "index":2, "type":"long" }, { "index":3, "type":"long" }, { "index":4, "type":"DOUBLE" }, { "index":5, "type":"DOUBLE" }, { "index":6, "type":"STRING" }, { "index":7, "type":"STRING" }, { "index":8, "type":"STRING" }, { "index":9, "type":"BOOLEAN" }, { "index":10, "type":"date" }, { "index":11, "type":"date" } ], "encoding":"UTF-8", "fieldDelimiter":"`", "path":[ "/Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/data.txt" ] } }, "writer":{ "name":"hdfswriter", "parameter":{ "column":[ { "name":"col1", "type":"TINYINT" }, { "name":"col2", "type":"SMALLINT" }, { "name":"col3", "type":"INT" }, { "name":"col4", "type":"BIGINT" }, { "name":"col5", "type":"FLOAT" }, { "name":"col6", "type":"DOUBLE" }, { "name":"col7", "type":"STRING" }, { "name":"col8", "type":"VARCHAR" }, { "name":"col9", "type":"CHAR" }, { "name":"col10", "type":"BOOLEAN" }, { "name":"col11", "type":"date" }, { "name":"col12", "type":"TIMESTAMP" } ], "compress":"NONE", "defaultFS":"hdfs://192.168.1.121:8020", "fieldDelimiter":"`", "fileName":"hdfswriter", "fileType":"orc", "path":"/user/hive/warehouse/hdfswriter.db/orc_table", "writeMode":"append" } } } ], "setting":{ "speed":{ "channel":2 } } } 2018-11-18 21:08:16.456 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null 2018-11-18 21:08:16.460 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0 2018-11-18 21:08:16.460 [main] INFO JobContainer - DataX jobContainer starts job. 2018-11-18 21:08:16.464 [main] INFO JobContainer - Set jobId = 0 Nov 18, 2018 9:08:17 PM org.apache.hadoop.util.NativeCodeLoader <clinit> 警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-11-18 21:08:18.679 [job-0] INFO JobContainer - jobContainer starts to do prepare ... 2018-11-18 21:08:18.680 [job-0] INFO JobContainer - DataX Reader.Job [txtfilereader] do prepare work . 2018-11-18 21:08:18.682 [job-0] INFO TxtFileReader$Job - add file [/Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/data.txt] as a candidate to be read. 2018-11-18 21:08:18.682 [job-0] INFO TxtFileReader$Job - 您即將讀取的文件數爲: [1] 2018-11-18 21:08:18.683 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] do prepare work . 2018-11-18 21:08:18.933 [job-0] INFO HdfsWriter$Job - 因爲您配置了writeMode append, 寫入前不作清理工做, [/user/hive/warehouse/hdfswriter.db/orc_table] 目錄下寫入相應文件名前綴 [hdfswriter] 的文件 2018-11-18 21:08:18.934 [job-0] INFO JobContainer - jobContainer starts to do split ... 2018-11-18 21:08:18.934 [job-0] INFO JobContainer - Job set Channel-Number to 2 channels. 2018-11-18 21:08:18.937 [job-0] INFO JobContainer - DataX Reader.Job [txtfilereader] splits to [1] tasks. 2018-11-18 21:08:18.938 [job-0] INFO HdfsWriter$Job - begin do split... 2018-11-18 21:08:18.995 [job-0] INFO HdfsWriter$Job - splited write file name:[hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__015cbc4e_94e7_4693_a543_6290deb25115/hdfswriter__0c580bd3_2265_4be7_8def_e6e024066c68] 2018-11-18 21:08:18.995 [job-0] INFO HdfsWriter$Job - end do split. 2018-11-18 21:08:18.995 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] splits to [1] tasks. 2018-11-18 21:08:19.026 [job-0] INFO JobContainer - jobContainer starts to do schedule ... 2018-11-18 21:08:19.050 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups. 2018-11-18 21:08:19.064 [job-0] INFO JobContainer - Running by standalone Mode. 2018-11-18 21:08:19.076 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks. 2018-11-18 21:08:19.112 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated. 2018-11-18 21:08:19.113 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated. 2018-11-18 21:08:19.133 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started 2018-11-18 21:08:19.133 [0-0-0-reader] INFO TxtFileReader$Task - reading file : [/Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/data.txt] 2018-11-18 21:08:19.199 [0-0-0-writer] INFO HdfsWriter$Task - begin do write... 2018-11-18 21:08:19.200 [0-0-0-writer] INFO HdfsWriter$Task - write to file : [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__015cbc4e_94e7_4693_a543_6290deb25115/hdfswriter__0c580bd3_2265_4be7_8def_e6e024066c68] 2018-11-18 21:08:19.342 [0-0-0-reader] INFO UnstructuredStorageReaderUtil - CsvReader使用默認值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":"`","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值爲[null] 2018-11-18 21:08:19.355 [0-0-0-reader] ERROR StdoutPluginCollector - 髒數據: {"message":"類型轉換錯誤, 沒法將[10] 轉換爲[BOOLEAN]","record":[{"byteSize":1,"index":0,"rawData":1,"type":"LONG"},{"byteSize":1,"index":1,"rawData":2,"type":"LONG"},{"byteSize":1,"index":2,"rawData":3,"type":"LONG"},{"byteSize":1,"index":3,"rawData":4,"type":"LONG"},{"byteSize":1,"index":4,"rawData":"5","type":"DOUBLE"},{"byteSize":1,"index":5,"rawData":"6","type":"DOUBLE"},{"byteSize":1,"index":6,"rawData":"7","type":"STRING"},{"byteSize":1,"index":7,"rawData":"8","type":"STRING"},{"byteSize":1,"index":8,"rawData":"9","type":"STRING"}],"type":"reader"} 2018-11-18 21:08:19.358 [0-0-0-reader] ERROR StdoutPluginCollector - 髒數據: {"message":"類型轉換錯誤, 沒法將[22] 轉換爲[BOOLEAN]","record":[{"byteSize":2,"index":0,"rawData":13,"type":"LONG"},{"byteSize":2,"index":1,"rawData":14,"type":"LONG"},{"byteSize":2,"index":2,"rawData":15,"type":"LONG"},{"byteSize":2,"index":3,"rawData":16,"type":"LONG"},{"byteSize":2,"index":4,"rawData":"17","type":"DOUBLE"},{"byteSize":2,"index":5,"rawData":"18","type":"DOUBLE"},{"byteSize":2,"index":6,"rawData":"19","type":"STRING"},{"byteSize":2,"index":7,"rawData":"20","type":"STRING"},{"byteSize":2,"index":8,"rawData":"21","type":"STRING"}],"type":"reader"} 2018-11-18 21:08:22.031 [0-0-0-writer] INFO HdfsWriter$Task - end do write 2018-11-18 21:08:22.115 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[2987]ms 2018-11-18 21:08:22.116 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks. 2018-11-18 21:08:29.150 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 27 bytes | Speed 2B/s, 0 records/s | Error 2 records, 27 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00% 2018-11-18 21:08:29.151 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks. 2018-11-18 21:08:29.152 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] do post work. 2018-11-18 21:08:29.152 [job-0] INFO HdfsWriter$Job - start rename file [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__015cbc4e_94e7_4693_a543_6290deb25115/hdfswriter__0c580bd3_2265_4be7_8def_e6e024066c68] to file [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table/hdfswriter__0c580bd3_2265_4be7_8def_e6e024066c68]. 2018-11-18 21:08:29.184 [job-0] INFO HdfsWriter$Job - finish rename file [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__015cbc4e_94e7_4693_a543_6290deb25115/hdfswriter__0c580bd3_2265_4be7_8def_e6e024066c68] to file [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table/hdfswriter__0c580bd3_2265_4be7_8def_e6e024066c68]. 2018-11-18 21:08:29.184 [job-0] INFO HdfsWriter$Job - start delete tmp dir [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__015cbc4e_94e7_4693_a543_6290deb25115] . 2018-11-18 21:08:29.217 [job-0] INFO HdfsWriter$Job - finish delete tmp dir [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__015cbc4e_94e7_4693_a543_6290deb25115] . 2018-11-18 21:08:29.218 [job-0] INFO JobContainer - DataX Reader.Job [txtfilereader] do post work. 2018-11-18 21:08:29.218 [job-0] INFO JobContainer - DataX jobId [0] completed successfully. 2018-11-18 21:08:29.219 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /Users/FengZhen/Desktop/Hadoop/dataX/datax/hook 2018-11-18 21:08:29.330 [job-0] INFO JobContainer - [total cpu info] => averageCpu | maxDeltaCpu | minDeltaCpu -1.00% | -1.00% | -1.00% [total gc info] => NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime PS MarkSweep | 1 | 1 | 1 | 0.042s | 0.042s | 0.042s PS Scavenge | 1 | 1 | 1 | 0.020s | 0.020s | 0.020s 2018-11-18 21:08:29.330 [job-0] INFO JobContainer - PerfTrace not enable! 2018-11-18 21:08:29.330 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 27 bytes | Speed 0B/s, 0 records/s | Error 2 records, 27 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00% 2018-11-18 21:08:29.331 [job-0] INFO JobContainer - 任務啓動時刻 : 2018-11-18 21:08:16 任務結束時刻 : 2018-11-18 21:08:29 任務總計耗時 : 12s 任務平均流量 : 0B/s 記錄寫入速度 : 0rec/s 讀出記錄總數 : 2 讀寫失敗總數 : 2
由於有類型不匹配的數據,將格式修改正確後從新執行
FengZhendeMacBook-Pro:bin FengZhen$ ./datax.py /Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/2.writer_orc.json DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. 2018-11-18 21:16:18.386 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl 2018-11-18 21:16:18.401 [main] INFO Engine - the machine info => osInfo: Oracle Corporation 1.8 25.162-b12 jvmInfo: Mac OS X x86_64 10.13.4 cpu num: 4 totalPhysicalMemory: -0.00G freePhysicalMemory: -0.00G maxFileDescriptorCount: -1 currentOpenFileDescriptorCount: -1 GC Names [PS MarkSweep, PS Scavenge] MEMORY_NAME | allocation_size | init_size PS Eden Space | 256.00MB | 256.00MB Code Cache | 240.00MB | 2.44MB Compressed Class Space | 1,024.00MB | 0.00MB PS Survivor Space | 42.50MB | 42.50MB PS Old Gen | 683.00MB | 683.00MB Metaspace | -0.00MB | 0.00MB 2018-11-18 21:16:18.452 [main] INFO Engine - { "content":[ { "reader":{ "name":"txtfilereader", "parameter":{ "column":[ { "index":0, "type":"long" }, { "index":1, "type":"long" }, { "index":2, "type":"long" }, { "index":3, "type":"long" }, { "index":4, "type":"DOUBLE" }, { "index":5, "type":"DOUBLE" }, { "index":6, "type":"STRING" }, { "index":7, "type":"STRING" }, { "index":8, "type":"STRING" }, { "index":9, "type":"BOOLEAN" }, { "index":10, "type":"date" }, { "index":11, "type":"date" } ], "encoding":"UTF-8", "fieldDelimiter":"`", "path":[ "/Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/data.txt" ] } }, "writer":{ "name":"hdfswriter", "parameter":{ "column":[ { "name":"col1", "type":"TINYINT" }, { "name":"col2", "type":"SMALLINT" }, { "name":"col3", "type":"INT" }, { "name":"col4", "type":"BIGINT" }, { "name":"col5", "type":"FLOAT" }, { "name":"col6", "type":"DOUBLE" }, { "name":"col7", "type":"STRING" }, { "name":"col8", "type":"VARCHAR" }, { "name":"col9", "type":"CHAR" }, { "name":"col10", "type":"BOOLEAN" }, { "name":"col11", "type":"date" }, { "name":"col12", "type":"TIMESTAMP" } ], "compress":"NONE", "defaultFS":"hdfs://192.168.1.121:8020", "fieldDelimiter":"`", "fileName":"hdfswriter", "fileType":"orc", "path":"/user/hive/warehouse/hdfswriter.db/orc_table", "writeMode":"append" } } } ], "setting":{ "speed":{ "channel":2 } } } 2018-11-18 21:16:18.521 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null 2018-11-18 21:16:18.540 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0 2018-11-18 21:16:18.541 [main] INFO JobContainer - DataX jobContainer starts job. 2018-11-18 21:16:18.577 [main] INFO JobContainer - Set jobId = 0 Nov 18, 2018 9:16:19 PM org.apache.hadoop.util.NativeCodeLoader <clinit> 警告: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2018-11-18 21:16:20.377 [job-0] INFO JobContainer - jobContainer starts to do prepare ... 2018-11-18 21:16:20.378 [job-0] INFO JobContainer - DataX Reader.Job [txtfilereader] do prepare work . 2018-11-18 21:16:20.379 [job-0] INFO TxtFileReader$Job - add file [/Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/data.txt] as a candidate to be read. 2018-11-18 21:16:20.379 [job-0] INFO TxtFileReader$Job - 您即將讀取的文件數爲: [1] 2018-11-18 21:16:20.380 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] do prepare work . 2018-11-18 21:16:21.428 [job-0] INFO HdfsWriter$Job - 因爲您配置了writeMode append, 寫入前不作清理工做, [/user/hive/warehouse/hdfswriter.db/orc_table] 目錄下寫入相應文件名前綴 [hdfswriter] 的文件 2018-11-18 21:16:21.428 [job-0] INFO JobContainer - jobContainer starts to do split ... 2018-11-18 21:16:21.429 [job-0] INFO JobContainer - Job set Channel-Number to 2 channels. 2018-11-18 21:16:21.430 [job-0] INFO JobContainer - DataX Reader.Job [txtfilereader] splits to [1] tasks. 2018-11-18 21:16:21.430 [job-0] INFO HdfsWriter$Job - begin do split... 2018-11-18 21:16:21.454 [job-0] INFO HdfsWriter$Job - splited write file name:[hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__c122e384_c22a_467a_b392_4e1042b2d033/hdfswriter__f55b962f_b945_401a_88fa_c1970d374dd1] 2018-11-18 21:16:21.454 [job-0] INFO HdfsWriter$Job - end do split. 2018-11-18 21:16:21.454 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] splits to [1] tasks. 2018-11-18 21:16:21.488 [job-0] INFO JobContainer - jobContainer starts to do schedule ... 2018-11-18 21:16:21.497 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups. 2018-11-18 21:16:21.505 [job-0] INFO JobContainer - Running by standalone Mode. 2018-11-18 21:16:21.519 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks. 2018-11-18 21:16:21.533 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated. 2018-11-18 21:16:21.534 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated. 2018-11-18 21:16:21.554 [0-0-0-reader] INFO TxtFileReader$Task - reading file : [/Users/FengZhen/Desktop/Hadoop/dataX/json/HDFS/data.txt] 2018-11-18 21:16:21.559 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started 2018-11-18 21:16:21.621 [0-0-0-writer] INFO HdfsWriter$Task - begin do write... 2018-11-18 21:16:21.622 [0-0-0-writer] INFO HdfsWriter$Task - write to file : [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__c122e384_c22a_467a_b392_4e1042b2d033/hdfswriter__f55b962f_b945_401a_88fa_c1970d374dd1] 2018-11-18 21:16:21.747 [0-0-0-reader] INFO UnstructuredStorageReaderUtil - CsvReader使用默認值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":"`","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值爲[null] 2018-11-18 21:16:22.444 [0-0-0-writer] INFO HdfsWriter$Task - end do write 2018-11-18 21:16:22.476 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[926]ms 2018-11-18 21:16:22.477 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks. 2018-11-18 21:16:31.572 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 61 bytes | Speed 6B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00% 2018-11-18 21:16:31.573 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks. 2018-11-18 21:16:31.573 [job-0] INFO JobContainer - DataX Writer.Job [hdfswriter] do post work. 2018-11-18 21:16:31.576 [job-0] INFO HdfsWriter$Job - start rename file [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__c122e384_c22a_467a_b392_4e1042b2d033/hdfswriter__f55b962f_b945_401a_88fa_c1970d374dd1] to file [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table/hdfswriter__f55b962f_b945_401a_88fa_c1970d374dd1]. 2018-11-18 21:16:31.626 [job-0] INFO HdfsWriter$Job - finish rename file [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__c122e384_c22a_467a_b392_4e1042b2d033/hdfswriter__f55b962f_b945_401a_88fa_c1970d374dd1] to file [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table/hdfswriter__f55b962f_b945_401a_88fa_c1970d374dd1]. 2018-11-18 21:16:31.627 [job-0] INFO HdfsWriter$Job - start delete tmp dir [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__c122e384_c22a_467a_b392_4e1042b2d033] . 2018-11-18 21:16:31.648 [job-0] INFO HdfsWriter$Job - finish delete tmp dir [hdfs://192.168.1.121:8020/user/hive/warehouse/hdfswriter.db/orc_table__c122e384_c22a_467a_b392_4e1042b2d033] . 2018-11-18 21:16:31.649 [job-0] INFO JobContainer - DataX Reader.Job [txtfilereader] do post work. 2018-11-18 21:16:31.649 [job-0] INFO JobContainer - DataX jobId [0] completed successfully. 2018-11-18 21:16:31.653 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /Users/FengZhen/Desktop/Hadoop/dataX/datax/hook 2018-11-18 21:16:31.766 [job-0] INFO JobContainer - [total cpu info] => averageCpu | maxDeltaCpu | minDeltaCpu -1.00% | -1.00% | -1.00% [total gc info] => NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime PS MarkSweep | 1 | 1 | 1 | 0.051s | 0.051s | 0.051s PS Scavenge | 1 | 1 | 1 | 0.023s | 0.023s | 0.023s 2018-11-18 21:16:31.766 [job-0] INFO JobContainer - PerfTrace not enable! 2018-11-18 21:16:31.766 [job-0] INFO StandAloneJobContainerCommunicator - Total 2 records, 61 bytes | Speed 6B/s, 0 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.000s | Percentage 100.00% 2018-11-18 21:16:31.773 [job-0] INFO JobContainer - 任務啓動時刻 : 2018-11-18 21:16:18 任務結束時刻 : 2018-11-18 21:16:31 任務總計耗時 : 13s 任務平均流量 : 6B/s 記錄寫入速度 : 0rec/s 讀出記錄總數 : 2 讀寫失敗總數 : 0