1、將關係型數據導入到HDFS文件系統java
一、數據準備mysql
這裏以MYSQL爲例,將MYSQL中部分表導入HDFSsql
我這裏有一個sakila數據庫數據庫
mysql> show databases;app
+--------------------+ide
| Database |工具
+--------------------+oop
| information_schema |url
| hive |spa
| mysql |
| performance_schema |
| sakila |
| sqoopdb |
| sys |
+--------------------+
7 rows in set (0.01 sec)
該數據庫中有以下表
mysql> show tables
-> ;
+----------------------------+
| Tables_in_sakila |
+----------------------------+
| actor |
| actor_info |
| address |
| category |
| city |
| country |
| customer |
| customer_list |
| film |
| film_actor |
| film_category |
| film_list |
| film_text |
| inventory |
| language |
| nicer_but_slower_film_list |
| payment |
| rental |
| sales_by_film_category |
| sales_by_store |
| staff |
| staff_list |
| store |
這裏有個actor表,這個表中有200條數據,數據形式以下:
+----------+-------------+--------------+---------------------+
| actor_id | first_name | last_name | last_update |
+----------+-------------+--------------+---------------------+
| 1 | PENELOPE | GUINESS | 2006-02-15 04:34:33 |
| 2 | NICK | WAHLBERG | 2006-02-15 04:34:33 |
| 3 | ED | CHASE | 2006-02-15 04:34:33 |
如今經過sqoop工具將actor表數據導入到HDFS集羣
二、執行導入前的準備工做
(1)因爲sqoop是一個數據遷移工具,經過內置一些數據導入導出的MR程序來實現數據遷移,所以須要依賴Hadoop集羣才能進行遷移操做。這裏須要啓動HDFS和Yarn進程
[root@hadoop-server01 ~]# start-dfs.sh
[root@hadoop-server01 ~]# start-yarn.sh
[root@hadoop-server01 ~]# jps
2802 SecondaryNameNode
3042 NodeManager
2534 NameNode
2653 DataNode
2943 ResourceManager
(2)sqoop導入須要鏈接關係型數據庫,所以執行導入操做前須要將關係型數據庫的jdbc驅動包拷貝到sqoop安裝目錄的lib目錄,這裏以mysql爲例:
[root@hadoop-server01 lib]# ll /usr/local/apps/sqoop-1.4.4.bin__hadoop-2.0.4-alpha/lib/mysql-connector-java-5.0.8-bin.jar
-rw-r--r--. 1 root root 540852 Jul 9 05:36 /usr/local/apps/sqoop-1.4.4.bin__hadoop-2.0.4-alpha/lib/mysql-connector-java-5.0.8-bin.jar
三、執行導入操做
使用如下命令執行關係型數據導入HDFS操做
sqoop import --connect jdbc:mysql://ip:port/dbname --username username--password passwd --table tablename
參數說明:
--import 導入命令
--connect 指定鏈接關係型數據庫鏈接字符串,不一樣數據庫類型鏈接方式有差別
--username 鏈接關係型數據庫用戶名
--password 鏈接關係型數據庫的密碼
--table 指定要導入HDFS的表
下面正式執行導入操做:
[root@hadoop-server01 lib]# sqoop import --connect jdbc:mysql://hadoop-server03:3306/sakila --username sqoop --password hive#2018 --table actor
18/07/09 05:37:36 INFO client.RMProxy: Connecting to ResourceManager at hadoop-server01/192.168.1.201:8032
18/07/09 05:37:39 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`actor_id`), MAX(`actor_id`) FROM `actor`
18/07/09 05:37:39 INFO mapreduce.JobSubmitter: number of splits:4
18/07/09 05:37:39 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1531139617936_0002
18/07/09 05:37:40 INFO impl.YarnClientImpl: Submitted application application_1531139617936_0002
18/07/09 05:37:40 INFO mapreduce.Job: The url to track the job: http://hadoop-server01:8088/proxy/application_1531139617936_0002/
18/07/09 05:37:40 INFO mapreduce.Job: Running job: job_1531139617936_0002
....................................
從以上信息能夠看出,執行的sqoop後會自動轉爲MR程序,並封裝MapReduce job提交Hadoop集羣執行,所以這也驗證了sqoop執行數據遷移確實是轉換成了MapReduce程序來實現數據的遷移。
四、驗證導入HDFS數據
[root@hadoop-server01 lib]# hadoop fs -ls /user/root/actor
Found 5 items
-rw-r--r-- 1 root supergroup 0 2018-07-09 05:37 /user/root/actor/_SUCCESS
-rw-r--r-- 1 root supergroup 1911 2018-07-09 05:37 /user/root/actor/part-m-00000
-rw-r--r-- 1 root supergroup 1929 2018-07-09 05:37 /user/root/actor/part-m-00001
-rw-r--r-- 1 root supergroup 1978 2018-07-09 05:37 /user/root/actor/part-m-00002
-rw-r--r-- 1 root supergroup 1981 2018-07-09 05:37 /user/root/actor/part-m-00003
導入HDFS中的數據存放在 /user/root/actor目錄下,默認目錄生成機制爲/user/當前操做系統用戶/表名
生成的數據爲/part-m-00000~/part-m-00003 四個文件,說明本次sqoop數據遷移生成的MR程序對原關係型數據庫中表遷移操做使用了4個分片,每一個分片job生成一個文件。part-m:表示只執行Map Tasks任務,而後直接輸出結果,沒有使用到ReduceTasks任務。
下面驗證導入數據是否和關係型數據庫中表actor一致
[root@hadoop-server01 lib]# hadoop fs -cat /user/root/actor/part-m-00000
1,PENELOPE,GUINESS,2006-02-15 04:34:33.0
2,NICK,WAHLBERG,2006-02-15 04:34:33.0
3,ED,CHASE,2006-02-15 04:34:33.0
......省略中間數據
49,ANNE,CRONYN,2006-02-15 04:34:33.0
50,NATALIE,HOPKINS,2006-02-15 04:34:33.0
[root@hadoop-server01 lib]# hadoop fs -cat /user/root/actor/part-m-00001
51,GARY,PHOENIX,2006-02-15 04:34:33.0
52,CARMEN,HUNT,2006-02-15 04:34:33.0
53,MENA,TEMPLE,2006-02-15 04:34:33.0
......省略中間數據
99,JIM,MOSTEL,2006-02-15 04:34:33.0
100,SPENCER,DEPP,2006-02-15 04:34:33.0
100,SPENCER,DEPP,2006-02-15 04:34:33.0
[root@hadoop-server01 lib]# hadoop fs -cat /user/root/actor/part-m-00002
101,SUSAN,DAVIS,2006-02-15 04:34:33.0
102,WALTER,TORN,2006-02-15 04:34:33.0
103,MATTHEW,LEIGH,2006-02-15 04:34:33.0
......省略中間數據
149,RUSSELL,TEMPLE,2006-02-15 04:34:33.0
150,JAYNE,NOLTE,2006-02-15 04:34:33.0
[root@hadoop-server01 lib]# hadoop fs -cat /user/root/actor/part-m-00003
151,GEOFFREY,HESTON,2006-02-15 04:34:33.0
152,BEN,HARRIS,2006-02-15 04:34:33.0
......省略中間數據
199,JULIA,FAWCETT,2006-02-15 04:34:33.0
200,THORA,TEMPLE,2006-02-15 04:34:33.0
經過檢驗,導入到HDFS中的4個分區文件的數據條數總和恰好爲200條,同關係型數據庫數據條目一致。同時能夠看出導入到HDFS中的數據默認使用逗號分隔。
到此,利用sqoop將關係型數據庫中數據導入HDFS集羣簡單驗證結束。