1、安裝hadoophtml
hadoop:java
sqoop2.x:mysql
http://my.oschina.net/u/204498/blog/518941git
2、安裝sqoop1.xgithub
1.選擇對應的版本算法
[hadoop@hftclclw0001 ~]$ pwd /home/hadoop [hadoop@hftclclw0001 ~]$ wget [hadoop@hftclclw0001 ~]$ tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz [hadoop@hftclclw0001 ~]$ cd sqoop-1.4.6.bin__hadoop-2.0.4-alpha/conf [hadoop@hftclclw0001 conf]$ ls -al total 44 drwx------ 2 hadoop root 4096 Nov 25 04:32 . drwx------ 9 hadoop root 4096 Nov 25 04:20 .. -rw------- 1 hadoop root 818 Apr 27 2015 .gitignore -rw------- 1 hadoop root 3895 Apr 27 2015 oraoop-site-template.xml -rw------- 1 hadoop root 1404 Apr 27 2015 sqoop-env-template.cmd -rwx------ 1 hadoop root 1345 Apr 27 2015 sqoop-env-template.sh -rw------- 1 hadoop root 5531 Apr 27 2015 sqoop-site-template.xml -rw------- 1 hadoop root 5531 Apr 27 2015 sqoop-site.xml [hadoop@hftclclw0001 conf]$ cp sqoop-env-template.sh sqoop-env.sh [hadoop@hftclclw0001 conf]$ vim sqoop-env.sh export HADOOP_COMMON_HOME=/home/hadoop/hadoop-2.7.1 #Set path to where hadoop-*-core.jar is available export HADOOP_MAPRED_HOME=/home/hadoop/hadoop-2.7.1/share/hadoop/mapreduce #set the path to where bin/hbase is available => 能夠不用,當使用到HBASE時再配置 #export HBASE_HOME=/home/hadoop/hbase-1.0.1.1 #Set the path to where bin/hive is available => 能夠不使用,當使用的HIVE時再配置 #export HIVE_HOME=/home/hadoop/apache-hive-1.2.1-bin #Set the path for where zookeper config dir is #export ZOOCFGDIR= export JAVA_HOME=/usr/java/latest => 要安裝JDK,以前安裝的JRE,使用時會有問題
2.添加對應的jdbc 驅動,我使用的是mysqlsql
[hadoop@hftclclw0001 lib]$ pwd /home/hadoop/sqoop-1.4.6.bin__hadoop-2.0.4-alpha/lib [hadoop@hftclclw0001 lib]$ ls -al | grep mysql -rw------- 1 hadoop root 848401 Nov 3 06:41 mysql-connector-java-5.1.25-bin.jar
3、Sqoop 1.x 語法shell
1.安裝mysql(配置相應的repo)數據庫
[root@hftclclw0001 opt] yum install mysql-server mysql mysql-client
2.啓動並測試,並給root用戶添加密碼apache
[root@hftclclw0001 opt] service mysqld start [root@hftclclw0001 opt]# netstat -apn|grep 3306 tcp 0 0 0.0.0.0:3306 0.0.0.0:* LISTEN 24540/mysqld [root@hftclclw0001 opt]# mysql -u root -p Enter password: mysql>
3.準備測試數據
我參考的是Apache Sqoop Cookbook 使用的mysql的
https://github.com/jarcec/Apache-Sqoop-Cookbook
使用github上面的mysql文件,建立sqoop用戶,建立sqoop數據庫,並新增對應的tables。並給sqoop用戶賦予相應的權限 mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mysql | | performance_schema | | sqoop | +--------------------+ 4 rows in set (0.00 sec) mysql> use sqoop mysql> show tables; +-----------------+ | Tables_in_sqoop | +-----------------+ | cities | | countries | | normcities | | staging_cities | | visits | +-----------------+ 5 rows in set (0.00 sec)
chapter2 importing data
sqoop list: [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop-list-tables --connect jdbc:mysql://{ip}:{por}/sqoop \ > --username sqoop \ > --password sqoop ... ... cities countries normcities staging_cities visits => 這些tables就是以前mysql中的新建的
sqoop import:全表導入(transferring an entire table) [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop import \ > --connect jdbc:mysql://{ip}:{port}/sqoop \ > --username sqoop \ > --password sqoop \ > --table cities ... => 會調用MR,讀取mysql,並寫入文件中(默認理解是當前用戶下,table名稱對應的木) 總共三條記錄,生成了三個文件 [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ hadoop dfs -ls /user/hadoop/cities -rw-r--r-- 3 hadoop supergroup 0 2015-11-25 05:29 /user/hadoop/cities/_SUCCESS -rw-r--r-- 3 hadoop supergroup 16 2015-11-25 05:29 /user/hadoop/cities/part-m-00000 -rw-r--r-- 3 hadoop supergroup 22 2015-11-25 05:29 /user/hadoop/cities/part-m-00001 -rw-r--r-- 3 hadoop supergroup 16 2015-11-25 05:29 /user/hadoop/cities/part-m-00002 [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ hadoop dfs -cat cities/part-m-00000 1,USA,Palo Alto
sqoop import:指定路徑(specifying a target directory) --target-dir 指定路徑不能存在(針對單表使用的) [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop import \ > --connect jdbc:mysql://{ip}:{port}/sqoop \ > --username sqoop \ > --password sqoop \ > --table cities \ > --target-dir /tmp/cities [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ hadoop dfs -ls /tmp/cities -rw-r--r-- 3 hadoop supergroup 0 2015-11-25 05:29 /user/hadoop/cities/_SUCCESS -rw-r--r-- 3 hadoop supergroup 16 2015-11-25 05:29 /user/hadoop/cities/part-m-00000 -rw-r--r-- 3 hadoop supergroup 22 2015-11-25 05:29 /user/hadoop/cities/part-m-00001 -rw-r--r-- 3 hadoop supergroup 16 2015-11-25 05:29 /user/hadoop/cities/part-m-00002 [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ hadoop dfs -cat cities/part-m-00000 1,USA,Palo Alto 當多表導入是,能夠使用--warehouse-dir 會再指定目錄下,再生成以table表名稱的目錄
sqoop import:帶where條件的sql,即子集 (importing only a subset of data) mysql> select * from sqoop.cities; +----+----------------+-----------+ | id | country | city | +----+----------------+-----------+ | 1 | USA | Palo Alto | | 2 | Czech Republic | Brno | | 3 | USA | Sunnyvale | +----+----------------+-----------+ 3 rows in set (0.00 sec) mysql> select * from sqoop.cities where country = 'USA'; +----+---------+-----------+ | id | country | city | +----+---------+-----------+ | 1 | USA | Palo Alto | | 3 | USA | Sunnyvale | +----+---------+-----------+ 2 rows in set (0.00 sec) [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ hadoop dfs -rmr /tmp/cities [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop import \ > --connect jdbc:mysql://{ip}:{port}/sqoop \ > --username sqoop \ > --password sqoop \ > --table cities \ > --where "country = 'USA'" \ > --target-dir /tmp/cities
sqoop import:(protecting your password) [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop import \ > --connect jdbc:mysql://{ip}:{port}/sqoop \ > --username sqoop \ > --table cities \ > --where "country = 'USA'" \ > --target-dir /tmp/cities \ > -P => 命令行輸入 >--password-file my-sqoop-password => 指定密碼文件 sqoop import:(Using a File Format Other Than CSV) 默認生成的是CSV文件,字段間使用tab間隔 --as-sequencefile --as-avrodatafile sqoop import:(Compressing imported data) --compress --compression-codec org.apache.hadoop.io.compress.BZip2Codec =>指定壓縮算法
sqoop import:(speeding up transfers) 默認mr的inputformat是經過jdbc的形式讀取數據,效率低,能夠使用數據庫提供的一些工具,如mysql的 mysqldump等 --direct
chapter3 Incremental Import
mysql> select * from sqoop.visits; +----+----------+---------------------+ | id | city | last_update_date | +----+----------+---------------------+ | 1 | Freemont | 1983-05-22 01:01:01 | | 2 | Jicin | 1987-02-02 02:02:02 | +----+----------+---------------------+ 2 rows in set (0.00 sec) importing only new data 表中有個id的主鍵(int類型的) 咱們導入>1的數據 --check-column => 檢查那個字段 --last-value => 檢查的字段,上次的值是多少,此次會 +1 開始導入 [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop import \ >--connect jdbc:mysql://{ip}:{port}/sqoop \ >--username sqoop \ >--password sqoop \ >--table visits \ >--target-dir /tmp/visits \ >--incremental append \ => incremental 模式是append 即追加 >--check-column id \ => append模式下, 須要一個遞增的主鍵 >--last-value 1 => 會從 id>1開始導入 注意這邊在執行的時候是輸出如下日誌,提示下次增量import是last-value 2 (即本次導入的最後一條記錄) 並提示你最好使用 sqoop job --create 來處理相似的定時增量導入 15/11/25 06:05:28 INFO tool.ImportTool: Incremental import complete! To run another incremental import of all data following this import, supply the following arguments: 15/11/25 06:05:28 INFO tool.ImportTool: --incremental append 15/11/25 06:05:28 INFO tool.ImportTool: --check-column id 15/11/25 06:05:28 INFO tool.ImportTool: --last-value 2 15/11/25 06:05:28 INFO tool.ImportTool: (Consider saving this with 'sqoop job --create') [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ hadoop dfs -ls /tmp/visits -rw-r--r-- 3 hadoop supergroup 30 2015-11-25 06:05 /tmp/visits/part-m-00000 [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ hadoop dfs -cat /tmp/visits/part-m-00000 2,Jicin,1987-02-02 02:02:02.0 incrementally importing mutable data
Sqoop Job:
http://shiyanjun.cn/archives/621.html
咱們使用Sqoop1.x是,在rdbms和hadoop/hive進行數據同步時,若是是用了--incremental append模式,咱們要記錄--last-value.若是每次執行同步腳步時候,都須要從日誌中解析出一個--last-value的值,而後從新設置腳步的參數,才能正確的保證同步正確。
[hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop job \ >--create visits-sync-job \ => 建立job: job-id(visits-sync-job) >-- \ >import \ >--connect jdbc:mysql://10.224.243.124:3306/sqoop \ >--username sqoop \ >--password sqopp \ >--table visits \ >--incremental append \ >--check-column id \ >--last-value 1 [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop job --list 15/11/25 06:40:00 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 Available jobs: visits-sync-job [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop job --show visits-sync-job 15/11/25 06:40:10 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 Enter password: Job: visits-sync-job Tool: import ... ... incremental.last.value = 1 ... [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop job --exec visits-sync-job Enter password: 執行job後,咱們在show job [hadoop@hftclclw0001 sqoop-1.4.6.bin__hadoop-2.0.4-alpha]$ ./bin/sqoop job --show visits-sync-job 15/11/25 06:44:52 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6 Enter password: incremental.last.value = 2 => last_value已經被記錄下,下次再執行的時候就會讀取該值,再執行
chapter 4 Free-Form Query Import
sqoop import:(importing data from two tables) mysql> select * from sqoop.cities; +----+----------------+-----------+ | id | country | city | +----+----------------+-----------+ | 1 | USA | Palo Alto | | 2 | Czech Republic | Brno | | 3 | USA | Sunnyvale | +----+----------------+-----------+ 3 rows in set (0.00 sec) mysql> select * from sqoop.countries; +------------+----------------+ | country_id | country | +------------+----------------+ | 1 | USA | | 2 | Czech Republic | +------------+----------------+ 2 rows in set (0.00 sec)