一、import增量導入的官方說明html
二、測試sqoop的increment import數據庫
增量導入在企業當中,通常都是須要常常執行的,如隔一個星期就執行一次增量導入,故增量導入的方式須要屢次執行,而每次執行時,又去寫相應的執行命令的話,比較麻煩。而sqoop提供了一個很好的工具save job的方式。oracle
測試的方式是經過--incremental來執行 lastmodified 模式, --check-column來設置 LASTMODIFIED檢查的字段,意思就是當該字段發生更新或者添加操做,則纔會執行導入。--last-value來設置初始值 '2014/8/27 13:00:00',該值是用來做爲第一次導入的下界,從第二次開始,sqoop會自動更新該值爲上一次導入的上界。工具
測試開始:sqoop建立一個job的方式來實現平常的增量導入,首先在關係型的數據庫中oracle穿件一個測試表oracletablename,添加兩條數據:oop
select * from oracletablename;測試
id name lastmodifiedthis
1 張三 2015-10-10 17:52:20.0spa
2 李四 2015-10-10 17:52:20.0日誌
(1)建立sqoop jobcode
sqoop job --create jobname -- import --connect jdbc:oracle:thin:@192.168.27.235:1521/orcl --username DATACENTER --password clear --table oracletablename --hive-import --hive-table hivetablename --incremental lastmodified --check-column LASTMODIFIED --last-value '2014/8/27 13:00:00'
說明:
1)在上面的job當中,不能指定-m ,由於指定了-m的話,對應的導入會在hdfs上差生相應的中間結果,當你下一次再次執行job時,則會由於output directory is exist 報錯。
2)上面的hivetablename必須是已存在的。在第一次導入的時候,爲了使得表存在,能夠經過將oracletablename的表結構導入到hive中,執行的命令以下:
sqoop create-hive-table --connect jdbc:oracle:thin:@//192.168.27.235:1521/ORCL --username DATACENTER --password clear --table tablename
執行完後,會在hive中建立一個具備相同名字和相同表結構的表。
(2)查看並執行job
上面已經建立了job後,能夠經過下面的命令來查看是否已經建立job成功:
sqoop job --list 列出全部的job
sqoop job --show jobname 顯示jobname的信息
sqoop job --delete jobname 刪除jobname
sqoop job --exec jobname 執行jobname
(3)執行完job後,查看hive中的表是否有數據。固然不出意外確定是有數據的
而且在 執行的過程當中,咱們能夠看到對應的執行日誌以下:
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/10/12 15:59:37 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 15:59:37 INFO manager.SqlManager: Executing SQL statement: SELECT t.* F ROM TEMP2 t WHERE 1=0
15/10/12 15:59:37 INFO tool.ImportTool: Incremental import based on column LASTM ODIFIED
15/10/12 15:59:37 INFO tool.ImportTool: Lower bound value: TO_TIMESTAMP('2014/8/ 27 13:00:00', 'YYYY-MM-DD HH24:MI:SS.FF')
15/10/12 15:59:37 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2015-10 -12 15:59:35.0', 'YYYY-MM-DD HH24:MI:SS.FF')
15/10/12 15:59:37 WARN manager.OracleManager: The table TEMP2 contains a multi-c olumn primary key. Sqoop will default to the column ID only for this job.
15/10/12 15:59:37 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 15:59:37 WARN manager.OracleManager: The table TEMP2 contains a multi-c olumn primary key. Sqoop will default to the column ID only for this job.
15/10/12 15:59:37 INFO mapreduce.ImportJobBase: Beginning import of TEMP2
15/10/12 15:59:37 INFO Configuration.deprecation: mapred.jar is deprecated. Inst ead, use mapreduce.job.jar
15/10/12 15:59:37 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 15:59:37 INFO Configuration.deprecation: mapred.map.tasks is deprecated . Instead, use mapreduce.job.maps
15/10/12 15:59:37 INFO client.RMProxy: Connecting to ResourceManager at hadoop3/ 192.168.27.233:8032
15/10/12 15:59:42 INFO db.DBInputFormat: Using read commited transaction isolati on
15/10/12 15:59:42 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN (ID), MAX(ID) FROM TEMP2 WHERE ( LASTMODIFIED >= TO_TIMESTAMP('2014/8/27 13:00:0 0', 'YYYY-MM-DD HH24:MI:SS.FF') AND LASTMODIFIED < TO_TIMESTAMP('2015-10-12 15:59:35.0', 'YYYY-MM-DD HH24:MI:SS.FF') )
15/10/12 15:59:42 INFO mapreduce.JobSubmitter: number of splits:4
說明:從上面的紅色部分咱們很清楚的知道,sqoop在導入的時候是怎麼導入。咱們能夠知道設置的--last-value的值就是對應的下界。
(4)在關係數據庫oracle中對oracletablename添加一個字段
id name lastmodified
1 張三 2015-10-10 17:52:20.0
2 李四 2015-10-10 17:52:20.0
3 李四 2015-10-12 16:01:23.0
(5)此時進行增量導入
即再一次執行job:sqoop job --exec jobname
再次查看日誌的內容以下:
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
15/10/12 16:02:17 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 16:02:17 INFO manager.SqlManager: Executing SQL statement: SELECT t.* F ROM TEMP2 t WHERE 1=0
15/10/12 16:02:17 INFO tool.ImportTool: Incremental import based on column LASTM ODIFIED
15/10/12 16:02:17 INFO tool.ImportTool: Lower bound value: TO_TIMESTAMP('2015-10 -12 15:59:35.0', 'YYYY-MM-DD HH24:MI:SS.FF')
15/10/12 16:02:17 INFO tool.ImportTool: Upper bound value: TO_TIMESTAMP('2015-10 -12 16:02:15.0', 'YYYY-MM-DD HH24:MI:SS.FF')
15/10/12 16:02:17 WARN manager.OracleManager: The table TEMP2 contains a multi-c olumn primary key. Sqoop will default to the column ID only for this job.
15/10/12 16:02:17 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 16:02:17 WARN manager.OracleManager: The table TEMP2 contains a multi-c olumn primary key. Sqoop will default to the column ID only for this job.
15/10/12 16:02:17 INFO mapreduce.ImportJobBase: Beginning import of TEMP2
15/10/12 16:02:17 INFO Configuration.deprecation: mapred.jar is deprecated. Inst ead, use mapreduce.job.jar
15/10/12 16:02:17 INFO manager.OracleManager: Time zone has been set to GMT
15/10/12 16:02:17 INFO Configuration.deprecation: mapred.map.tasks is deprecated . Instead, use mapreduce.job.maps
15/10/12 16:02:17 INFO client.RMProxy: Connecting to ResourceManager at hadoop3/ 192.168.27.233:8032
15/10/12 16:02:23 INFO db.DBInputFormat: Using read commited transaction isolati on
15/10/12 16:02:23 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN (ID), MAX(ID) FROM TEMP2 WHERE ( LASTMODIFIED >= TO_TIMESTAMP('2015-10-12 15:59:35.0', 'YYYY-MM-DD HH24:MI:SS.FF') AND LASTMODIFIED < TO_TIMESTAMP('2015-10-12 1 6:02:15.0', 'YYYY-MM-DD HH24:MI:SS.FF') )
15/10/12 16:02:23 WARN db.BigDecimalSplitter: Set BigDecimal splitSize to MIN_IN CREMENT
15/10/12 16:02:23 INFO mapreduce.JobSubmitter: number of splits:1
說明:咱們能夠從執行的日誌中看出,--last-value的值會自動更新爲上一次的上界的值,注意看一下上次的上界便可。