使用sqoop過程

With Sqoop, you can import data from a relational database system or a mainframe(主機) into HDFS. The input(投入) to the import process is either database table or mainframe datasets. For databases, Sqoop will read the table row-by-row into HDFS. For mainframe datasets, Sqoop will read records from each mainframe dataset into HDFS. The output(輸出) of this import process is a set of files containing a copy of the imported table or datasets. The import process is performed in parallel(平行線). For this reason, the output will be in multiple files. These files may be delimited(劃界) text files (for example, with commas or tabs separating each field), or binary(二進制的) Avro or SequenceFiles containing serialized(序列化) record data.
在Sqoop,你能夠從關係型數據庫或主機中導入數據到HDFS,導入過程的輸入的數據要麼是數據庫表,要麼是大型機數據集。若是是數據庫,sqoop將以row-by-row的方式寫進hdfs,若是是大型機的數據集,sqoop將在讀取數據集中每條集合到hdfs。此導入過程是輸出一組包含導入表或數據集副本的文件。這個導入過程是並行執行的。基於這個緣由,輸出的時候會在多個文件中。這些文件應該可能會分隔文本文件(例如,會以逗號或者tabs分割開每一個field),或者binary Avro 或者 序列文件包括序列化的數據記錄java


A by-product of the import process is a generated(生成的) Java class which can encapsulate(壓縮) one row of the imported table. This class is used during the import process by Sqoop itself. The Java source code for this class is also provided to you, for use in subsequent(後來的) MapReduce processing of the data. This class can serialize and deserialize(並行化) data to and from the SequenceFile format. It can also parse(解析) the delimited-text form of a record. These abilities allow you to quickly develop MapReduce applications that use the HDFS-stored records in your processing pipeline(管道). You are also free to parse the delimiteds record data yourself, using any other tools you prefer.
導入過程的副產物是生成一個能壓縮導入的數據表中一行java類,這個類在導入過程當中由Sqoop自身使用。還向您提供了該類的Java源代碼,用於數據的後續MapReduce處理。這個類能夠序列化和反序列化數據到Sequence文件格式。它還能夠解析帶分隔符內容文件的記錄。這些功能容許您快速開發MapReduce應用程序,這個應用程序在處理管道中使用hdfs存儲的記錄的。您也能夠使用您喜歡的任何其餘工具自行解析分隔記錄數據。shell


After manipulating(操縱) the imported records (for example, with MapReduce or Hive) you may have a result data set which you can then export back to the relational database. Sqoop’s export process will read a set of delimited text files from HDFS in parallel, parse them into records, and insert them as new rows in a target database table, for consumption by external a pplications or users.
在操做導入的記錄(例如,使用MapReduce或Hive)以後,您將有一個結果數據集,而後能夠將其導出回關係數據庫。sqoop的導出過程將並行地從HDFS讀取一組分隔的文本文件 ,將它們解析爲記錄,並將它們做爲新行插入目標數據庫表中,供外部應用程序或用戶使用 數據庫


Sqoop includes some other commands which allow you to inspect the database you are working with. For example, you can list the available database schemas (with the sqoop-list-databases tool) and tables within a schema (with the sqoop-list-tables tool). Sqoop also includes a primitive(原始的) SQL execution(執行) shell(剝皮) (the sqoop-eval tool).
Sqoop包括一些其餘命令,這些命令容許您檢查正在使用的數據庫。例如,能夠列出可用的數據庫集合(使用sqoop-list-database工具)和集合中的表(使用sqoop-list-table工具)。sqoop還包括一個基本的SQL執行shell(sqoop-val工具)。 app


Most aspects of the import, code generation, and export processes can be customized. For databases, you can control the specific row range or columns imported. You can specify particular delimiters(指定特定的分隔符) and escape characters(轉義字符) for the file-based representation of the data, as well as the file format used. You can also control the class or package names used in generated(生成的) code. Subsequent(後來的) sections of this document explain how to specify these and other
大多數的導入、代碼生成和導出過程的均可以定製。對於數據庫,能夠控制導入的特定行範圍或列。能夠爲基於文件的數據表示指定特定的分隔符和轉義字符,以及文件使用的格式。還能夠控制生成代碼中使用的類或包名稱。 本文檔的後續部分將解釋如何指定這些和其餘方面。ide

相關文章
相關標籤/搜索