sqoop導入數據到hive

時間 2019-11-18

標籤 sqoop 導入數據 hive 欄目 Hadoop 简体版

原文原文鏈接

1.1hive-import參數java

使用--hive-import就能夠將數據導入到hive中，可是下面這個命令執行後會報錯，報錯信息以下：mysql

sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password 123456 --table person -m 1 --hive-importsql

16/07/22 02:22:58 ERROR tool.ImportTool: Encountered IOException running import job: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://192.168.223.129:9000/user/root/person already exists
    at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
    at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:562)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:432)

報錯是由於在用戶的家目錄下已經存在了一個person目錄。apache

緣由是由於sqoop導數據到hive會先將數據導入到HDFS上，而後再將數據load到hive中，最後吧這個目錄再刪除掉。當這個目錄存在的狀況下，就會報錯。oop

1.2target-dir參數來指定臨時目錄spa

爲了解決上面的問題，能夠把person目錄刪除掉，也能夠使用target-dir來指定一個臨時目錄code

sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password 123456 --table person -m 1 --hive-import --target-dir temporm

執行完成以後，就能夠看到在hive中的表了blog

hive> select * from person;
OK
1    zhangsan
2    LISI

1.3hive-overwrite參數hadoop

若是上面的語句執行屢次，那麼會產生這個表數據的屢次拷貝

執行三次以後，hive中的數據是

hive> select * from person;
OK
1    zhangsan
2    LISI
1    zhangsan
2    LISI
1    zhangsan
2    LISI
Time taken: 2.079 seconds, Fetched: 6 row(s)

在hdfs中的表現是：

hive> dfs -ls /user/hive/warehouse/person;
Found 3 items
-rwxrwxrwt   3 18232184201 supergroup         18 2016-07-22 17:48 /user/hive/warehouse/person/part-m-00000
-rwxrwxrwt   3 18232184201 supergroup         18 2016-07-22 17:51 /user/hive/warehouse/person/part-m-00000_copy_1
-rwxrwxrwt   3 18232184201 supergroup         18 2016-07-22 17:52 /user/hive/warehouse/person/part-m-00000_copy_2

若是想要對這個表的數據進行覆蓋，那麼就須要用到--hive-overwrite參數

sqoop import --connect jdbc:mysql://localhost:3306/test --username root --password 123456 --table person --hive-import --target-dir temp -m 1 --hive-overwrite

1.4fields-terminated-by

當吧mysql中的數據導入到hdfs中，默認使用的分隔符是逗號

當吧數據導入到hive中，默認使用的是hive表的默認的字段分割符

Storage Desc Params:          
    field.delim             \u0001              
    line.delim              \n                  
    serialization.format    \u0001

若是想要改變默認的分隔符，能夠使用--fields-terminated-by參數

這個參數在第一次導入hive表的時候決定表的默認分隔符

如今吧hive中的表刪除掉，而後從新導入

sqoop import --connect jdbc:mysql://localhost:3306/test --username root--password 123456--table person -m 1 --hive-import --fields-terminated-by "|"

再次查看hive表的分隔符：

Storage Desc Params:          
    field.delim             |                   
    line.delim              \n                  
    serialization.format    |