使用Hadoop來分析和處理數據須要將數據加載到集羣中而且將它和企業生產數據庫中的其餘數據進行結合處理。從生產系統加載大塊數據到Hadoop中或者從大型集羣的map reduce應用中得到數據是個挑戰。用戶必須意識到確保數據一致性,消耗生產系統資源,供應下游管道的數據預處理這些細節。用腳原本轉化數據是低效和耗時的方式。使用map reduce應用直接去獲取外部系統的數據使得應用變得複雜和增長了生產系統來自集羣節點過分負載的風險。html
這就是Apache Sqoop可以作到的。Aapche Sqoop 目前是Apache軟件會的孵化項目。更多關於這個項目的信息能夠在http://incubator.apache.org/sqoop查看
Sqoop可以使得像關係型數據庫、企業數據倉庫和NoSQL系統那樣簡單地從結構化數據倉庫中導入導出數據。你可使用Sqoop將數據從外部系統加載到HDFS,存儲在Hive和HBase表格中。Sqoop配合Ooozie可以幫助你調度和自動運行導入導出任務。Sqoop使用基於支持插件來提供新的外部連接的鏈接器。
當你運行Sqoop的時候看起來是很是簡單的,可是表象底層下面發生了什麼呢?數據集將被切片分到不一樣的partitions和運行一個只有map的做業來負責數據集的某個切片。由於Sqoop使用數據庫的元數據來推斷數據類型因此每條數據都以一種類型安全的方式來處理。node
在這篇文章其他部分中咱們將經過一個例子來展現Sqoop的各類使用方式。這篇文章的目標是提供Sqoop操做的一個概述而不是深刻高級功能的細節。mysql
下面的命令用於將一個MySQL數據庫中名爲ORDERS的表中全部數據導入到集羣中
---
$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password ****
---sql
在這條命令中的各類選項解釋以下:數據庫
導入操做經過下面Figure1所描繪的那兩步來完成。第一步,Sqoop從數據庫中獲取要導入的數據的元數據。第二步,Sqoop提交map-only做業到Hadoop集羣中。第二步經過在前一步中獲取的元數據作實際的數據傳輸工做。apache
Figure 1: Sqoop Import Overview數組
導入的數據存儲在HDFS目錄下。正如Sqoop大多數操做同樣,用戶能夠指定任何替換路徑來存儲導入的數據。安全
默認狀況下這些文檔包含用逗號分隔的字段,用新行來分隔不一樣的記錄。你能夠明確地指定字段分隔符和記錄結束符容易地實現文件複製過程當中的格式覆蓋。
Sqoop也支持不一樣數據格式的數據導入。例如,你能夠經過指定 --as-avrodatafile 選項的命令行來簡單地實現導入Avro 格式的數據。app
There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.框架
Sqoop提供許多選項能夠用來知足指定需求的導入操做。
在許多狀況下,導入數據到Hive就跟運行一個導入任務而後使用Hive建立和加載一個肯定的表和partition。手動執行這個操做須要你要知道正確的數據類型映射和其餘細節像序列化格式和分隔符。Sqoop負責將合適的表格元數據填充到Hive 元數據倉庫和調用必要的指令來加載table和partition。這些操做均可以經過簡單地在命令行中指定--hive-import 來實現。
----
$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** --hive-import
----
當你運行一個Hive import時,Sqoop將會將數據的類型從外部數據倉庫的原生數據類型轉換成Hive中對應的類型,Sqoop自動地選擇Hive使用的本地分隔符。若是被導入的數據中有新行或者有其餘Hive分隔符,Sqoop容許你移除這些字符而且獲取導入到Hive的正確數據。
一旦導入操做完成,你就像Hive其餘表格同樣去查看和操做。
你可使用Sqoop將數據插入到HBase表格中特定列族。跟Hive導入操做很像,能夠經過指定一個額外的選項來指定要插入的HBase表格和列族。全部導入到HBase的數據將轉換成字符串並以UTF-8字節數組的格式插入到HBase中
----
$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** \
--hbase-create-table --hbase-table ORDERS --column-family mysql
----
下面是命令行中各類選項的解釋:
剩下的選項跟普通的導入操做同樣。
在一些狀況中,經過Hadoop pipelines來處理數據可能須要在生產系統中運行額外的關鍵業務函數來提供幫助。Sqoop能夠在必要的時候用來導出這些的數據到外部數據倉庫。仍是使用上面的例子,若是Hadoop pieplines產生的數據對應數據庫OREDERS表格中的某些地方,你可使用下面的命令行:
----
$ sqoop export --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** \
--export-dir /user/arvind/ORDERS
----
下面是各類選項的解釋:
導入操做經過下面Figure2所描繪的那兩步來完成。第一步,從數據庫中獲取要導入的數據的元數據,第二步則是數據的傳輸。Sqoop將輸入數據集分割成片而後用map任務將片插入到數據庫中。爲了確保最佳的吞吐量和最小的資源使用率,每一個map任務經過多個事務來執行這個數據傳輸。
Figure 2: Sqoop Export Overview
一些鏈接器支持臨時表格來幫助隔離那些任何緣由致使的做業失敗而產生的生產表格。一旦全部的數據都傳輸完成,臨時表格中的數據首先被填充到map任務和合併到目標表格。
使用專門鏈接器,Sqoop能夠鏈接那些擁有優化導入導出基礎設施的外部系統,或者不支持本地JDBC。鏈接器是插件化組件基於Sqoop的可擴展框架和能夠添加到任何當前存在的Sqoop。一旦鏈接器安裝好,Sqoop可使用它在Hadoop和鏈接器支持的外部倉庫之間進行高效的傳輸數據。
默認狀況下,Sqoop包含支持各類經常使用數據庫例如MySQL,PostgreSQL,Oracle,SQLServer和DB2的鏈接器。它也包含支持MySQL和PostgreSQL數據庫的快速路徑鏈接器。快速路徑鏈接器是專門的鏈接器用來實現批次傳輸數據的高吞吐量。Sqoop也包含通常的JDBC鏈接器用於鏈接經過JDBC鏈接的數據庫
跟內置的鏈接不一樣的是,許多公司會開發他們本身的鏈接器插入到Sqoop中,從專門的企業倉庫鏈接器到NoSQL數據庫。
在這篇文檔中能夠看到大數據集在Hadoop和外部數據倉庫例如關係型數據庫的傳輸是多麼的簡單。除此以外,Sqoop提供許多高級提醒如不一樣數據格式、壓縮、處理查詢等等。咱們建議你多嘗試Sqoop並給咱們提供反饋。
更多關於Sqoop的信息能夠在下面路徑找到:
Project Website: http://incubator.apache.org/sqoop
Wiki: https://cwiki.apache.org/confluence/display/SQOOP
Project Status: http://incubator.apache.org/projects/sqoop.html
Mailing Lists: https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists
下面是原文
Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.
This is where Apache Sqoop fits in. Apache Sqoop is currently undergoing incubation at Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.
Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. Using Sqoop, you can provision the data from external system on to HDFS, and populate tables in Hive and HBase. Sqoop integrates with Oozie, allowing you to schedule and automate import and export tasks. Sqoop uses a connector based architecture which supports plugins that provide connectivity to new external systems.
What happens underneath the covers when you run Sqoop is very straightforward. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset. Each record of the data is handled in a type safe manner since Sqoop uses the database metadata to infer the data types.
In the rest of this post we will walk through an example that shows the various ways you can use Sqoop. The goal of this post is to give an overview of Sqoop operation without going into much detail or advanced functionality.
The following command is used to import all data from a table called ORDERS from a MySQL database:
---
$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password ****
---
In this command the various options specified are as follows:
The import is done in two steps as depicted in Figure 1 below. In the first Step Sqoop introspects the database to gather the necessary metadata for the data being imported. The second step is a map-only Hadoop job that Sqoop submits to the cluster. It is this job that does the actual data transfer using the metadata captured in the previous step.
Figure 1: Sqoop Import Overview
The imported data is saved in a directory on HDFS based on the table being imported. As is the case with most aspects of Sqoop operation, the user can specify any alternative directory where the files should be populated.
By default these files contain comma delimited fields, with new lines separating different records. You can easily override the format in which data is copied over by explicitly specifying the field separator and record terminator characters.
Sqoop also supports different data formats for importing data. For example, you can easily import data in Avro data format by simply specifying the option --as-avrodatafile with the import command.
There are many other options that Sqoop provides which can be used to further tune the import operation to suit your specific requirements.
In most cases, importing data into Hive is the same as running the import task and then using Hive to create and load a certain table or partition. Doing this manually requires that you know the correct type mapping between the data and other details like the serialization format and delimiters. Sqoop takes care of populating the Hive metastore with the appropriate metadata for the table and also invokes the necessary commands to load the table or partition as the case may be. All of this is done by simply specifying the option --hive-import with the import command.
----
$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** --hive-import
----
When you run a Hive import, Sqoop converts the data from the native datatypes within the external datastore into the corresponding types within Hive. Sqoop automatically chooses the native delimiter set used by Hive. If the data being imported has new line or other Hive delimiter characters in it, Sqoop allows you to remove such characters and get the data correctly populated for consumption in Hive.
Once the import is complete, you can see and operate on the table just like any other table in Hive.
You can use Sqoop to populate data in a particular column family within the HBase table. Much like the Hive import, this can be done by specifying the additional options that relate to the HBase table and column family being populated. All data imported into HBase is converted to their string representation and inserted as UTF-8 bytes.
----
$ sqoop import --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** \
--hbase-create-table --hbase-table ORDERS --column-family mysql
----
In this command the various options specified are as follows:
The rest of the options are the same as that for regular import operation.
In some cases data processed by Hadoop pipelines may be needed in production systems to help run additional critical business functions. Sqoop can be used to export such data into external datastores as necessary. Continuing our example from above - if data generated by the pipeline on Hadoop corresponded to the ORDERS table in a database somewhere, you could populate it using the following command:
----
$ sqoop export --connect jdbc:mysql://localhost/acmedb \
--table ORDERS --username test --password **** \
--export-dir /user/arvind/ORDERS
----
In this command the various options specified are as follows:
Export is done in two steps as depicted in Figure 2. The first step is to introspect the database for metadata, followed by the second step of transferring the data. Sqoop divides the input dataset into splits and then uses individual map tasks to push the splits to the database. Each map task performs this transfer over many transactions in order to ensure optimal throughput and minimal resource utilization.
Figure 2: Sqoop Export Overview
Some connectors support staging tables that help isolate production tables from possible corruption in case of job failures due to any reason. Staging tables are first populated by the map tasks and then merged into the target table once all of the data has been delivered it.
Using specialized connectors, Sqoop can connect with external systems that have optimized import and export facilities, or do not support native JDBC. Connectors are plugin components based on Sqoop’s extension framework and can be added to any existing Sqoop installation. Once a connector is installed, Sqoop can use it to efficiently transfer data between Hadoop and the external store supported by the connector.
By default Sqoop includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server and DB2. It also includes fast-path connectors for MySQL and PostgreSQL databases. Fast-path connectors are specialized connectors that use database specific batch tools to transfer data with high throughput. Sqoop also includes a generic JDBC connector that can be used to connect to any database that is accessible via JDBC.
Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop. These range from specialized connectors for enterprise data warehouse systems to NoSQL datastores.
In this post you saw how easy it is to transfer large datasets between Hadoop and external datastores such as relational databases. Beyond this, Sqoop offers many advance features such as different data formats, compression, working with queries instead of tables etc. We encourage you to try out Sqoop and give us your feedback.
More information regarding Sqoop can be found at:
Project Website: http://incubator.apache.org/sqoop
Wiki: https://cwiki.apache.org/confluence/display/SQOOP
Project Status: http://incubator.apache.org/projects/sqoop.html
Mailing Lists: https://cwiki.apache.org/confluence/display/SQOOP/Mailing+Lists