DataXhtml
DataX 是阿里巴巴集團內被普遍使用的離線數據同步工具/平臺,實現包括 MySQL、SQL Server、Oracle、PostgreSQL、HDFS、Hive、HBase、OTS、ODPS 等各類異構數據源之間高效的數據同步功能。
Featuresjava
DataX自己做爲數據同步框架,將不一樣數據源的同步抽象爲從源頭數據源讀取數據的Reader插件,以及向目標端寫入數據的Writer插件,理論上DataX框架能夠支持任意數據源類型的數據同步工做。同時DataX插件體系做爲一套生態系統, 每接入一套新數據源該新加入的數據源便可實現和現有的數據源互通。
System Requirementspython
Linux JDK(1.8以上,推薦1.8) Python(推薦Python2.6.X) Apache Maven 3.X(Compile DataX)
Quick Startmysql
工具部署linux
方法一、直接下DataX工具包:DataX下載地址,下載後解壓至本地某個目錄,進入bin目錄,便可運行同步做業git
$ cd {YOUR_DATAX_HOME}/bin $ python datax.py {YOUR_JOB.json}
方法二、下載DataX源碼,本身編譯:DataX源碼github
①.安裝JDKsql
tar xvf jdk-8u151-linux-x64.tar.gz mv jdk1.8.0_151 vim /etc/profile.d/jdk.sh export JAVA_HOME=/usr/local/jdk1.8.0_151 export JAVA_BIN=$JAVA_HOME/bin export PATH=$PATH:$JAVA_BIN export CLASSPATH=$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar source /etc/profile.d/jdk.sh
檢測安裝是否成功apache
[root@oracle ~]# java -version java version "1.8.0_151" Java(TM) SE Runtime Environment (build 1.8.0_151-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode)
②.查看Python版本,若是不知足則須要自行安裝json
[root@oracle ~]# python -V Python 2.6.6
③.安裝Maven
開始安裝配置
tar xvf apache-maven-3.6.1-bin.tar.gz mv apache-maven-3.6.1-bin.tar ../maven vim /etc/profile M2_HOME=/usr/local/maven export PATH=${M2_HOME}/bin:/u01/mysql/bin:${PATH}
驗證Maven是否安裝成功
[root@oracle src]# mvn -v Apache Maven 3.6.1 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00) Maven home: /usr/local/maven Java version: 1.8.0_151, vendor: Oracle Corporation Java home: /usr/local/jdk1.8.0_151/jre Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "3.10.0-862.el7.x86_64", arch: "amd64", family: "unix"
系統需求已配置完成,開始源碼安裝DataX,下載方法二選其一
下載地址:https://github.com/alibaba/DataX git clone git@github.com:alibaba/DataX.git
開始源碼安裝
unzip DataX-master.zip mv DataX-master ../ cd ../DataX-master mvn -U clean package assembly:assembly -Dmaven.test.skip=true
該過程很是很是的漫長,須要等待,打包成功,最終顯示以下
[INFO] BUILD SUCCESS [INFO] ----------------------------------------------------------------- [INFO] Total time: 08:12 min [INFO] Finished at: 2015-12-13T16:26:48+08:00 [INFO] Final Memory: 133M/960M [INFO] -----------------------------------------------------------------
打包成功後的DataX包位於{DataX_source_code_home}/target/datax/datax/,結構以下:
cd /usr/local/Datax-master [root@oracle DataX-master]# ls -a ./target/datax/datax/ . .. bin conf job lib log log_perf plugin script tmp
配置示例:從stream讀取數據並打印到控制檯,
第一步。建立創業的配置文件(json格式),能夠經過命令查看配置模板:python datax.py -r {YOUR_READER} -w {YOUR_WRITER}
cd /usr/local/DataX-master/target/datax/datax/bin ./datax.py -r streamreader -w streamwriter DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. Please refer to the streamreader document: https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md Please refer to the streamwriter document: https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md Please save the following configuration as a json file and use python {DATAX_HOME}/bin/datax.py {JSON_FILE_NAME}.json to run the job. { "job": { "content": [ { "reader": { "name": "streamreader", "parameter": { "column": [], "sliceRecordCount": "" } }, "writer": { "name": "streamwriter", "parameter": { "encoding": "", "print": true } } } ], "setting": { "speed": { "channel": "" } } } }
根據模板配置本身所需的json,具體以下
{ "job": { "content": [ { "reader": { "name": "oraclereader", "parameter": { "column": ["zybh","xmbh","jsbh","xmmc","sfje","fylx","bz","sjc","zllx"], "connection": [ { "jdbcUrl": ["jdbc:oracle:thin:@192.168.11.91:1521:orcl"], "table": ["JDZLJCXM"] } ], "password": "jgzdwffz", "username": "bjxxjgxt", } }, "writer": { "name": "mysqlwriter", "parameter": { "column": ["zybh","xmbh","jsbh","xmmc","sfje","fylx","bz","sjc","zllx"], "connection": [ { "jdbcUrl": "jdbc:mysql://192.168.11.75:3336/prison_practical_platform", "table": ["JDZLJCXM"] } ], "password": "root", "username": "root", } } } ], "setting": { "speed": { "channel": "5" } } } }
第二步。啓動DataX
cd /usr/local/DataX-master/target/datax/datax/bin ./datax.py ../job/oracle11mysql8.json
同步結束,相關日誌以下:
DataX (DATAX-OPENSOURCE-3.0), From Alibaba ! Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved. 2019-05-24 11:59:06.065 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl 2019-05-24 11:59:06.075 [main] INFO Engine - the machine info => osInfo: Oracle Corporation 1.8 25.161-b12 jvmInfo: Linux amd64 3.10.0-862.el7.x86_64 cpu num: 1 totalPhysicalMemory: -0.00G freePhysicalMemory: -0.00G maxFileDescriptorCount: -1 currentOpenFileDescriptorCount: -1 GC Names [Copy, MarkSweepCompact] MEMORY_NAME | allocation_size | init_size Eden Space | 273.06MB | 273.06MB Code Cache | 240.00MB | 2.44MB Survivor Space | 34.13MB | 34.13MB Compressed Class Space | 1,024.00MB | 0.00MB Metaspace | -0.00MB | 0.00MB Tenured Gen | 682.69MB | 682.69MB 2019-05-24 11:59:06.093 [main] INFO Engine - { "content":[ { "reader":{ "name":"oraclereader", "parameter":{ "column":[ "zybh", "xmbh", "jsbh", "xmmc", "sfje", "fylx", "bz", "sjc", "zllx" ], "connection":[ { "jdbcUrl":[ "jdbc:oracle:thin:@192.168.11.91:1521:orcl" ], "table":[ "JDZLJCXM" ] } ], "password":"********", "username":"bjxxjgxt" } }, "writer":{ "name":"mysqlwriter", "parameter":{ "column":[ "zybh", "xmbh", "jsbh", "xmmc", "sfje", "fylx", "bz", "sjc", "zllx" ], "connection":[ { "jdbcUrl":"jdbc:mysql://192.168.11.75:3336/prison_practical_platform", "table":[ "JDZLJCXM" ] } ], "password":"****", "username":"root" } } } ], "setting":{ "speed":{ "channel":"5" } } } 2019-05-24 11:59:06.111 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null 2019-05-24 11:59:06.121 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0 2019-05-24 11:59:06.121 [main] INFO JobContainer - DataX jobContainer starts job. 2019-05-24 11:59:06.122 [main] INFO JobContainer - Set jobId = 0 2019-05-24 11:59:06.488 [job-0] INFO OriginalConfPretreatmentUtil - Available jdbcUrl:jdbc:oracle:thin:@192.168.11.91:1521:orcl. 2019-05-24 11:59:06.605 [job-0] INFO OriginalConfPretreatmentUtil - table:[JDZLJCXM] has columns:[ZYBH,XMBH,JSBH,XMMC,SFJE,FYLX,BZ,SJC,ZLLX]. 2019-05-24 11:59:06.952 [job-0] INFO OriginalConfPretreatmentUtil - table:[JDZLJCXM] all columns:[ ZYBH,XMBH,JSBH,XMMC,SFJE,FYLX,BZ,SJC,ZLLX ]. 2019-05-24 09:48:44.768 [job-0] INFO OriginalConfPretreatmentUtil - Write data [ INSERT INTO %s (zybh,xmbh,jsbh,xmmc,sfje,fylx,bz,sjc,zllx) VALUES(?,?,?,?,?,?,?,?,?) ], which jdbcUrl like:[jdbc:mysql://192.168.11.75:3336/prison_practical_platform?yearIsDateType=false&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true] 2019-05-24 09:48:44.768 [job-0] INFO JobContainer - jobContainer starts to do prepare ... 2019-05-24 09:48:44.769 [job-0] INFO JobContainer - DataX Reader.Job [oraclereader] do prepare work . 2019-05-24 09:48:44.769 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do prepare work . 2019-05-24 09:48:44.769 [job-0] INFO JobContainer - jobContainer starts to do split ... 2019-05-24 09:48:44.769 [job-0] INFO JobContainer - Job set Channel-Number to 5 channels. 2019-05-24 09:48:44.772 [job-0] INFO JobContainer - DataX Reader.Job [oraclereader] splits to [1] tasks. 2019-05-24 09:48:44.772 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] splits to [1] tasks. 2019-05-24 09:48:44.820 [job-0] INFO JobContainer - jobContainer starts to do schedule ... 2019-05-24 09:48:44.830 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups. 2019-05-24 09:48:44.832 [job-0] INFO JobContainer - Running by standalone Mode. 2019-05-24 09:48:44.866 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks. 2019-05-24 09:48:44.876 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated. 2019-05-24 09:48:44.877 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated. 2019-05-24 09:48:44.898 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started 2019-05-24 09:48:44.901 [0-0-0-reader] INFO CommonRdbmsReader$Task - Begin to read record by Sql: [select zybh,xmbh,jsbh,xmmc,sfje,fylx,bz,sjc,zllx from JDZLJCXM ] jdbcUrl:[jdbc:oracle:thin:@192.168.11.91:1521:orcl]. 2019-05-24 09:48:45.050 [0-0-0-reader] INFO CommonRdbmsReader$Task - Finished read record by Sql: [select zybh,xmbh,jsbh,xmmc,sfje,fylx,bz,sjc,zllx from JDZLJCXM ] jdbcUrl:[jdbc:oracle:thin:@192.168.11.91:1521:orcl]. 2019-05-24 09:48:46.419 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[1538]ms 2019-05-24 09:48:46.420 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks. 2019-05-24 09:48:54.873 [job-0] INFO StandAloneJobContainerCommunicator - Total 1554 records, 79079 bytes | Speed 7.72KB/s, 155 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.020s | All Task WaitReaderTime 0.124s | Percentage 100.00% 2019-05-24 09:48:54.874 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks. 2019-05-24 09:48:54.874 [job-0] INFO JobContainer - DataX Writer.Job [mysqlwriter] do post work. 2019-05-24 09:48:54.874 [job-0] INFO JobContainer - DataX Reader.Job [oraclereader] do post work. 2019-05-24 09:48:54.874 [job-0] INFO JobContainer - DataX jobId [0] completed successfully. 2019-05-24 09:48:54.875 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /usr/local/DataX-master/target/datax/datax/hook 2019-05-24 09:48:54.875 [job-0] INFO JobContainer - [total cpu info] => averageCpu | maxDeltaCpu | minDeltaCpu -1.00% | -1.00% | -1.00% [total gc info] => NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime Copy | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s MarkSweepCompact | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s 2019-05-24 09:48:54.875 [job-0] INFO JobContainer - PerfTrace not enable! 2019-05-24 09:48:54.876 [job-0] INFO StandAloneJobContainerCommunicator - Total 1554 records, 79079 bytes | Speed 7.72KB/s, 155 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.020s | All Task WaitReaderTime 0.124s | Percentage 100.00% 2019-05-24 09:48:54.876 [job-0] INFO JobContainer - 任務啓動時刻 : 2019-05-24 09:48:43 任務結束時刻 : 2019-05-24 09:48:54 任務總計耗時 : 11s 任務平均流量 : 7.72KB/s 記錄寫入速度 : 155rec/s 讀出記錄總數 : 1554 讀寫失敗總數 : 0
在配置過程當中出現的問題,錯誤以下
[ERROR] Failed to execute goal on project otsstreamreader: Could not resolve dependencies for project com.alibaba.datax:otsstreamreader:jar:1.0.0-SNAPSHOT: Could not find artifact com.aliyun.openservices:tablestore-streamclient:jar:1.0.0-SNAPSHOT -> [Help 1]
該錯誤是因爲快照版本不一致,因爲ots基本不會被用到,直接把pom.xml中的<module>ots</module>去掉,也能夠更改版本otsstreamreader中的默認版本爲0.0.1,改成1.0.0
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-jar-plugin:2.4:jar (default-jar) on project ocswriter: Error assembling JAR: /Users/FengZhen/Desktop/Hadoop/DataX/源碼/DataX/ocswriter/pom.xml isn't a file. -> [Help 1]
將ocs註釋掉,從新打包便可。
參考來源:https://github.com/alibaba/DataX/blob/master/userGuid.md