Sqoop數據遷移工具的使用

時間 2019-12-04

標籤 sqoop 數據遷移工具使用简体版

原文原文鏈接

文章做者：foochanephp

原文連接：foochane.cn/article/201…html

1 sqoop簡單介紹

sqoop是apache旗下一款「Hadoop和關係數據庫服務器之間傳送數據」的工具。用於數據的導入和導出。java

導入數據：MySQL，Oracle導入數據到Hadoop的HDFS、HIVE、HBASE等數據存儲系統；
導出數據：從Hadoop的文件系統中導出數據到關係數據庫mysql等

sqoop的工做機制是將導入或導出命令翻譯成mapreduce程序來實現，在翻譯出的mapreduce中主要是對inputformat和outputformat進行定製。mysql

2 sqoop安裝

安裝sqoop前要先安裝好java環境和hadoop環境。git

sqoop只是一個工具，安裝在那個節點均可以，只要有java環境和hadoop環境，而且能鏈接到對應數據庫便可。sql

2.1 下載並解壓

下載地址：mirror.bit.edu.cn/apache/sqoo… 下載：sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz數據庫

解壓到安裝目錄下apache

2.2 修改配置文件

將sqoop-env-template.sh複製一份重命名爲sqoop-env.sh文件，在sqoop-env.sh文件中添加以下內容：bash

export HADOOP_COMMON_HOME=/usr/local/bigdata/hadoop-2.7.1
export HADOOP_MAPRED_HOME=/usr/local/bigdata/hadoop-2.7.1
export HIVE_HOME=/usr/local/bigdata/hive-2.3.5
複製代碼

2.3 安裝mysql的jdbc啓動

將 mysql-connector-java-5.1.45.jar 拷貝到sqoop的lib目錄下。服務器

$ sudo apt-get install libmysql-java #以前已經裝過了
$ ln -s /usr/share/java/mysql-connector-java-5.1.45.jar /usr/local/bigdata/sqoop-1.4.7/lib

複製代碼

也能夠本身手動複製 mysql-connector-java-5.1.45.jar。

2.4 驗證sqoop

查看sqoop版本

$ bin/sqoop-version
Warning: /usr/local/bigdata/sqoop-1.4.7/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/local/bigdata/sqoop-1.4.7/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/06/30 03:03:07 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
Sqoop 1.4.7
git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8
Compiled by maugli on Thu Dec 21 15:59:58 STD 20
複製代碼

會出現幾個警告，暫時先無論。

驗證sqoop到mysql業務庫之間的連通性：

$ bin/sqoop-list-databases --connect jdbc:mysql://Master:3306 --username hiveuser --password 123456
$ bin/sqoop-list-tables --connect jdbc:mysql://Master:3306/metastore --username hiveuser --password 123456

複製代碼

3 sqoop數據導入

3.1 從MySql導數據到HDFS

先在mysql中，建表插入測試數據；

SET FOREIGN_KEY_CHECKS=0;

-- ----------------------------
-- Table structure for `emp`
-- ----------------------------
DROP TABLE IF EXISTS `emp`;
CREATE TABLE `emp` (
  `id` int(11) DEFAULT NULL,
  `name` varchar(100) DEFAULT NULL,
  `deg` varchar(100) DEFAULT NULL,
  `salary` int(11) DEFAULT NULL,
  `dept` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- ----------------------------
-- Records of emp
-- ----------------------------
INSERT INTO `emp` VALUES ('1201', 'gopal', 'manager', '50000', 'TP');
INSERT INTO `emp` VALUES ('1202', 'manisha', 'Proof reader', '50000', 'TP');
INSERT INTO `emp` VALUES ('1203', 'khalil', 'php dev', '30000', 'AC');
INSERT INTO `emp` VALUES ('1204', 'prasanth', 'php dev', '30000', 'AC');
INSERT INTO `emp` VALUES ('1205', 'kranthi', 'admin', '20000', 'TP');

-- ----------------------------
-- Table structure for `emp_add`
-- ----------------------------
DROP TABLE IF EXISTS `emp_add`;
CREATE TABLE `emp_add` (
  `id` int(11) DEFAULT NULL,
  `hno` varchar(100) DEFAULT NULL,
  `street` varchar(100) DEFAULT NULL,
  `city` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- ----------------------------
-- Records of emp_add
-- ----------------------------
INSERT INTO `emp_add` VALUES ('1201', '288A', 'vgiri', 'jublee');
INSERT INTO `emp_add` VALUES ('1202', '108I', 'aoc', 'sec-bad');
INSERT INTO `emp_add` VALUES ('1203', '144Z', 'pgutta', 'hyd');
INSERT INTO `emp_add` VALUES ('1204', '78B', 'old city', 'sec-bad');
INSERT INTO `emp_add` VALUES ('1205', '720X', 'hitec', 'sec-bad');

-- ----------------------------
-- Table structure for `emp_conn`
-- ----------------------------
DROP TABLE IF EXISTS `emp_conn`;
CREATE TABLE `emp_conn` (
  `id` int(100) DEFAULT NULL,
  `phno` varchar(100) DEFAULT NULL,
  `email` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

-- ----------------------------
-- Records of emp_conn
-- ----------------------------
INSERT INTO `emp_conn` VALUES ('1201', '2356742', 'gopal@tp.com');
INSERT INTO `emp_conn` VALUES ('1202', '1661663', 'manisha@tp.com');
INSERT INTO `emp_conn` VALUES ('1203', '8887776', 'khalil@ac.com');
INSERT INTO `emp_conn` VALUES ('1204', '9988774', 'prasanth@ac.com');
INSERT INTO `emp_conn` VALUES ('1205', '1231231', 'kranthi@tp.com');

複製代碼

導入：

bin/sqoop import   \
--connect jdbc:mysql://Master:3306/test   \
--username root  \
--password root   \
--target-dir /sqooptest \
--fields-terminated-by ',' \
--table emp   \
--m 2 \
--split-by id;
複製代碼

--connect:指定數據庫
--username：指定用戶名
--password：指定密碼
--table:指定要導入的表
--target-dir:指定hdfs的目錄
--fields-terminated-by：指定文件分割符
--m: 指定maptask個數，若是大於1，必須指定split-by參數，如指定爲2，最後生產的文件會是兩個
--split-by:指定分片的字段

若是表的數據量不是很大就不用指定設置--m參數了

注意導入前前啓動hdfs和yarn，而且提交的yarn上運行，而不是在本地運行。

示例：

$ bin/sqoop import   --connect jdbc:mysql://Master:3306/test   --username hadoop  --password 123456   --target-dir /sqooptest --fields-terminated-by ',' --table emp  --m 1  --split-by id;
Warning: /usr/local/bigdata/sqoop-1.4.7/bin/../../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /usr/local/bigdata/sqoop-1.4.7/bin/../../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
19/06/30 05:00:43 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7
19/06/30 05:00:43 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
19/06/30 05:00:44 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
19/06/30 05:00:44 INFO tool.CodeGenTool: Beginning code generation
Sun Jun 30 05:00:45 UTC 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
19/06/30 05:00:46 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
19/06/30 05:00:46 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `emp` AS t LIMIT 1
19/06/30 05:00:46 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/local/bigdata/hadoop-2.7.1
Note: /tmp/sqoop-hadoop/compile/cd17c36add75dfe67edd3facf7538def/emp.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
19/06/30 05:00:56 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-hadoop/compile/cd17c36add75dfe67edd3facf7538def/emp.jar
19/06/30 05:00:56 WARN manager.MySQLManager: It looks like you are importing from mysql.
19/06/30 05:00:56 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
19/06/30 05:00:56 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
19/06/30 05:00:56 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
19/06/30 05:00:56 INFO mapreduce.ImportJobBase: Beginning import of emp
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/bigdata/hadoop-2.7.1/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/bigdata/hbase-2.0.5/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
19/06/30 05:00:58 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
19/06/30 05:01:06 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
19/06/30 05:01:07 INFO client.RMProxy: Connecting to ResourceManager at Master/192.168.233.200:8032
Sun Jun 30 05:01:55 UTC 2019 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
19/06/30 05:01:56 INFO db.DBInputFormat: Using read commited transaction isolation
19/06/30 05:01:56 INFO mapreduce.JobSubmitter: number of splits:1
19/06/30 05:01:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1561868076549_0002
19/06/30 05:02:05 INFO impl.YarnClientImpl: Submitted application application_1561868076549_0002
19/06/30 05:02:06 INFO mapreduce.Job: The url to track the job: http://Master:8088/proxy/application_1561868076549_0002/
19/06/30 05:02:06 INFO mapreduce.Job: Running job: job_1561868076549_0002
19/06/30 05:02:47 INFO mapreduce.Job: Job job_1561868076549_0002 running in uber mode : false
19/06/30 05:02:48 INFO mapreduce.Job:  map 0% reduce 0%
19/06/30 05:03:35 INFO mapreduce.Job:  map 100% reduce 0%
19/06/30 05:03:36 INFO mapreduce.Job: Job job_1561868076549_0002 completed successfully
19/06/30 05:03:37 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=135030
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=87
                HDFS: Number of bytes written=151
                HDFS: Number of read operations=4
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Other local map tasks=1
                Total time spent by all maps in occupied slots (ms)=42476
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=42476
                Total vcore-seconds taken by all map tasks=42476
                Total megabyte-seconds taken by all map tasks=43495424
        Map-Reduce Framework
                Map input records=5
                Map output records=5
                Input split bytes=87
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=250
                CPU time spent (ms)=2700
                Physical memory (bytes) snapshot=108883968
                Virtual memory (bytes) snapshot=1934733312
                Total committed heap usage (bytes)=18415616
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=151
19/06/30 05:03:37 INFO mapreduce.ImportJobBase: Transferred 151 bytes in 150.6495 seconds (1.0023 bytes/sec)
19/06/30 05:03:37 INFO mapreduce.ImportJobBase: Retrieved 5 records.
複製代碼

查看是否導入成功：

$ hdfs dfs -cat /sqooptest/part-m-*
1201,gopal,manager,50000,TP
1202,manisha,Proof reader,50000,TP
1203,khalil,php dev,30000,AC
1204,prasanth,php dev,30000,AC
1205,kranthi,admin,20000,TP
複製代碼

3.2 從MySql導數據到Hive

命令：

bin/sqoop import \
--connect jdbc:mysql://Master:3306/test \
--username hadoop  \
--password 123456  \
--table emp  \
--hive-import  \
--split-by id  \
--m 1;
複製代碼

導入到hive，須要添加--hive-import參數，不用指定--target-dir其餘參數跟導入到hdfs上同樣。

3.3 導入表數據子集

有時候咱們並不須要，導入數據表中的所有數據，sqoop也支持導入數據表的部分數據。

這是可使用Sqoop的where語句。where子句的一個子集。它執行在各自的數據庫服務器相應的SQL查詢，並將結果存儲在HDFS的目標目錄。

where子句的語法以下:

--where <condition>

複製代碼

下面的命令用來導入emp_add表數據的子集。居住城市爲：sec-bad

bin/sqoop import \
--connect jdbc:mysql://Master:3306/test \
--username hadoop \
--password 123456 \
--where "city ='sec-bad'" \
--target-dir /wherequery \
--table emp_add \
 --m 1
複製代碼

另外也可使用select語句：

bin/sqoop import \
--connect jdbc:mysql://Master:3306/test \
--username hadoop \
--password 123456 \
--target-dir /wherequery2 \
--query 'select id,name,deg from emp WHERE id>1207 and $CONDITIONS' \
--split-by id \
--fields-terminated-by '\t' \
--m 2
複製代碼

3.4 增量導入

增量導入是僅導入新添加的表中的行的技術。

sqoop支持兩種增量MySql導入到hive的模式，一種是append，即經過指定一個遞增的列。另種是能夠根據時間戳。

3.4.1 append

指定以下參數：

--incremental append  
--check-column num_id 
--last-value 0 
複製代碼

--check-column 表示指定遞增的字段，--last-value指定上一次到入的位置

如：

bin/sqoop import \
--connect jdbc:mysql://Master:3306/test \
--username hadoop \
--password 123456 \
--table emp --m 1 \
--incremental append \
--check-column id \
--last-value 1208
複製代碼

3.4.2 根據時間戳

命令中添加以下參數：

--incremental lastmodified 
--check-column created 
--last-value '2012-02-01 11:0:00' 
複製代碼

就是隻導入created 比2012-02-01 11:0:00更大的數據。

4 Sqoop的數據導出

將數據從HDFS把文件導出到RDBMS數據庫,導出前目標表必須存在於目標數據庫中。默認操做是從將文件中的數據使用INSERT語句插入到表中。更新模式下，是生成UPDATE語句更新表數據語法

$ sqoop export (generic-args) (export-args) 

複製代碼

導入過程

一、首先須要手動建立mysql中的目標表

mysql> USE db;
mysql> CREATE TABLE employee ( 
   id INT NOT NULL PRIMARY KEY, 
   name VARCHAR(20), 
   deg VARCHAR(20),
   salary INT,
   dept VARCHAR(10));

複製代碼

二、執行導出命令

bin/sqoop export \
--connect jdbc:mysql://Master:3306/test \
--username root \
--password root \
--table employee \
--export-dir /user/hadoop/emp/
複製代碼

三、驗證表mysql命令行。

mysql>select * from employee;

複製代碼

若是給定的數據存儲成功，那麼能夠找到數據在以下的employee表。

+------+--------------+-------------+-------------------+--------+
| Id   | Name         | Designation | Salary            | Dept   |
+------+--------------+-------------+-------------------+--------+
| 1201 | gopal        | manager     | 50000             | TP     |
| 1202 | manisha      | preader     | 50000             | TP     |
| 1203 | kalil        | php dev     | 30000               | AC     |
| 1204 | prasanth     | php dev     | 30000             | AC     |
| 1205 | kranthi      | admin       | 20000             | TP     |
| 1206 | satish p     | grp des     | 20000             | GR     |
+------+--------------+-------------+-------------------+--------+
複製代碼

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。