Hive Tutorial 閱讀記錄

Hive Tutorial

一、Concepts

1.一、What Is Hive

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.html

Hive是一個基於 Apache Hadoop 的數據倉庫基礎設施。Hadoop 爲在商業硬件上存儲和處理數據提供了大規模的向外擴展和容錯能力。java

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive's SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).node

Hive 能夠方便地對大量數據進行彙總、ad-hoc 查詢和分析。python

它提供了 SQL,使用戶可以輕鬆地進行 ad-hoc 查詢、彙總和數據分析。同時,Hive 的 SQL 給用戶提供了多個地方來集成他們本身的功能來作自定義分析,好比用戶定義函數(UDFs)。正則表達式

1.二、What Hive Is NOT

Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks.sql

Hive 不是爲在線事務處理而設計的。它最適合用於傳統的數據倉庫任務。數據庫

1.三、Getting Started

For details on setting up Hive, HiveServer2, and Beeline, please refer to the GettingStarted guide.express

關於 Hive、HiveServer2 和 Beeline 的詳細設置,請參考 GettingStarted guide。apache

Books about Hive lists some books that may also be helpful for getting started with Hive.編程

關於 Hive 的書中列出了一些可能對開始使用 Hive 有幫助的書。

In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables, and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of Hive with the help of some examples.

在下面的章節中,咱們將提供關於該系統功能的教程。咱們首先描述數據類型、表和分區的概念(與傳統關係 DBMS 很是類似),而後經過一些例子說明 Hive 的功能。

1.四、Data Units

In the order of granularity - Hive data is organized into:

hive 數據有以下組織形式:

Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on. Databases can also be used to enforce security for a user or group of users.

  • 數據庫:名稱空間函數,以免表、視圖、分區、列等的命名衝突。數據庫還能夠用於爲一個用戶或一組用戶實施安全性。

Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):

timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was viewed.
userid —which is of BIGINT type that identifies the user who viewed the page.
page_url—which is of STRING type that captures the location of the page.
referer_url—which is of STRING that captures the location of the page from where the user arrived at the current page.
IP—which is of STRING type that captures the IP address from where the page request was made.

  • 表:具備相同schema的同種數據單元。一個表的示例就是 page_views 表,表中每行都由下面的列組成:

    • timestamp:INT類型,頁面瀏覽時間
    • userid:BIGINT類型,
    • page_url:STRING類型,
    • referer_url:STRING類型,
    • IP:STRING類型,

Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions—apart from being storage units—also allow the user to efficiently identify the rows that satisfy a specified criteria; for example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example, all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table, thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience; it is the user's job to guarantee the relationship between partition name and data content! Partition columns are virtual columns, they are not part of the data itself but are derived on load.

  • 分區:

每一個表有一個或多個分區 key ,這些 key 決定了數據如何被存儲。

除了做爲存儲單元以外,分區還容許用戶有效地標識知足指定條件的行。例如,STRING 類型的 date_partition 和 STRING 類型的 country_partition。

每一個惟一的分區 key 對應表的一個分區。例如,從"2009-12-23"開始的全部"US"下的數據都是 page_views 表的一個分區中的數據。

所以,若是基於"US"下2009-12-23的數據分析,你能夠只在相關分區下執行查詢。從而,能夠提升分析效率

然而,僅僅由於一個分區命名爲2009-12-23,並不意味着它包含或僅包含該日期的全部數據,分區以日期命名是爲了方便。

保證分區名和數據內容之間的關係是用戶的工做!分區列是虛擬列,它們不是數據自己的一部分,而是在加載時派生的。

Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.

  • 分桶:

經過計算表的某些列的 hash 值,分區中的數據再被劃分到桶中。這能夠被用來高效地抽樣數據。

例如,page_views 表根據 userid 分桶,userid 是 page_view 表的列之一,而不是分區列

Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.

分區或分桶並非必要的。可是這些抽象容許系統在查詢處理期間刪除大量數據,從而加快查詢的執行。

1.五、Type System

Hive supports primitive and complex data types, as described below. See Hive Data Types for additional information.

Hive 支持基本和複雜數據類型,以下所述。有關更多信息,請參閱 Hive Data Types。

1.5.一、Primitive Types

  • Types are associated with the columns in the tables. The following Primitive types are supported:【類型和表中的列相關,下面是支持的基本數據類型:】

  • Integers

    • TINYINT—1 byte integer
    • SMALLINT—2 byte integer
    • INT—4 byte integer
    • BIGINT—8 byte integer
  • Boolean type

    • BOOLEAN—TRUE/FALSE
  • Floating point numbers

    • FLOAT—single precision
    • DOUBLE—Double precision
  • Fixed point numbers

    • DECIMAL—a fixed point value of user defined scale and precision
  • String types

    • STRING—sequence of characters in a specified character set
    • VARCHAR—sequence of characters in a specified character set with a maximum length
    • CHAR—sequence of characters in a specified character set with a defined length
  • Date and time types

    • TIMESTAMP — A date and time without a timezone ("LocalDateTime" semantics)
    • TIMESTAMP WITH LOCAL TIME ZONE — A point in time measured down to nanoseconds ("Instant" semantics)
    • DATE—a date
  • Binary types

    • BINARY—a sequence of bytes

The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):

這些類型按如下層次結構組織(父實例是全部子實例的超類型):

  • Type
    • Primitive Type
      • Number
        • DOUBLE

          • FLOAT

            • BIGINT
              • INT
                • SMALLINT
                  • TINYINT
          • STRING

        • BOOLEAN

This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Note that the type hierarchy allows the implicit conversion of STRING to DOUBLE.

這種類型層次結構定義瞭如何在查詢語言中隱式地轉換類型。

容許從子類型到祖先類型的隱式轉換。

所以,當查詢表達式指望類型1且數據爲類型2時,若是在類型層次結構中,類型1是類型2的祖先,則類型2將隱式轉換爲類型1。

請注意,類型層次結構容許隱式地將 STRING 轉換爲 DOUBLE。

Explicit type conversion can be done using the cast operator as shown in the #Built In Functions section below.

顯式類型轉換可使用強制轉換操做符完成,以下面的 #Built in Functions 一節所示。

1.5.二、Complex Types

Complex Types can be built up from primitive types and other composite types using:

複雜類型能夠從基本類型和其餘組合類型構建:

Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a

  • Structs:類型中的元素可使用點號訪問。例如,c 列的類型是 STRUCT {a INT; b INT},經過 c.a 訪問字段。

Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']

  • Maps(鍵值元組):使用 ['element name'] 訪問元素。例如,映射 M 由 'group' -> gid 組成,能夠經過 M['group'] 訪問 gid 值。

Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.

  • Arrays(可索引的列表):數組中的元素必須具備相同的類型。可使用 [n] 來訪問元素,n 是索引(從0開始)。例如,數組 A 有元素 ['a', 'b', 'c'],那麼 A[1] 將返回 'b'。

Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields:

使用基本類型和用於建立複雜類型的構造,能夠建立具備任意嵌套級別的類型。例如,一個類型用戶可能包含如下字段:

  • gender—which is a STRING.
  • active—which is a BOOLEAN.

1.5.三、Timestamp

Timestamps have been the source of much confusion, so we try to document the intended semantics of Hive.

Timestamps 一直是不少困惑的根源,因此咱們試圖記錄 Hive 的語義。

Timestamp ("LocalDateTime" semantics)

Java's "LocalDateTime" timestamps record a date and time as year, month, date, hour, minute, and seconds without a timezone. These timestamps always have those same values regardless of the local time zone.

Java 的 「LocalDateTime」 時間戳將日期和時間記錄爲年、月、日、時、分和秒,而沒有時區。

不管本地時區是什麼,這些時間戳老是具備相同的值。

For example, the timestamp value of "2014-12-12 12:34:56" is decomposed into year, month, day, hour, minute and seconds fields, but with no time zone information available. It does not correspond to any specific instant. It will always be the same value regardless of the local time zone. Unless your application uses UTC consistently, timestamp with local time zone is strongly preferred over timestamp for most applications. When users say an event is at 10:00, it is always in reference to a certain timezone and means a point in time, rather than 10:00 in an arbitrary time zone.

例如,「2014-12-12 12:34:56」 的時間戳值被分解爲年、月、日、小時、分鐘和秒字段,可是沒有時區信息可用。

它不對應於任何特定的時刻。它將始終是相同的值,不管當地時區是什麼。除非你的應用程序一導致用 UTC,不然具備當地時區的時間戳對於大多數應用程序來講都比時間戳更受歡迎。當用戶說一個事件在 10:00 時,它老是與某個時區有關,意思是一個時間點,而不是任意時區中的 10:00。

Timestamp with local time zone ("Instant" semantics)

Java's "Instant" timestamps define a point in time that remains constant regardless of where the data is read. Thus, the timestamp will be adjusted by the local time zone to match the original point in time.

Java 的 "Instant" 時間戳定義了一個不管從何處讀取數據都保持不變的時間點。所以,時間戳將根據當地時區調整,以匹配原始的時間點。

Type Value in America/Los_Angeles Value in America/New York
timestamp 2014-12-12 12:34:56 2014-12-12 12:34:56
timestamp with local time zone 2014-12-12 12:34:56 2014-12-12 15:34:56

Comparisons with other tools

見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

1.六、Built In Operators and Functions

The operators and functions listed below are not necessarily up to date. (Hive Operators and UDFs has more current information.) In Beeline or the Hive CLI, use these commands to show the latest documentation:

下面列出的操做符和函數不必定是最新的。在 Beeline 或 Hive 命令行中,使用這些命令顯示最新的文檔:

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

Case-insensitive. All Hive keywords are case-insensitive, including the names of Hive operators and functions.

不區分大小寫。全部 Hive 關鍵字不區分大小寫,包括 Hive 操做符和函數的名稱。

1.6.一、Built In Operators

Relational Operators:The following operators compare the passed operands and generate a TRUE or FALSE value, depending on whether the comparison between the operands holds or not.

  • 關係操做符:下面的操做符比較傳入的操做數,生成一個 TRUE 或 FALSE 值,這取決於操做數之間的比較是否有效。

操做符表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

Arithmetic Operators:The following operators support various common arithmetic operations on the operands. All of them return number types.

  • 算術操做符:如下操做符支持對操做數進行各類常見的算術操做。它們都返回數字類型。

操做符表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

Logical Operators:The following operators provide support for creating logical expressions. All of them return boolean TRUE or FALSE depending upon the boolean values of the operands.

  • 邏輯操做符:如下操做符提供了對建立邏輯表達式的支持。它們都返回布爾值 TRUE 或 FALSE,這取決於操做數的布爾值。

操做符表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

Operators on Complex Types:The following operators provide mechanisms to access elements in Complex Types

  • 在複雜類型上的操做符:下面的操做符提供了訪問複雜類型中元素的機制。

操做符表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

1.6.二、Built In Functions

Hive supports the following built in functions:(Function list in source code: FunctionRegistry.java)

  • Hive 支持下面的內建函數:

函數表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

The following built in aggregate functions are supported in Hive:

  • Hive 中支持下面內建的聚合函數:

函數表格見原文:https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-BuiltInOperatorsandFunctions

1.七、Language Capabilities

Hive's SQL provides the basic SQL operations. These operations work on tables or partitions. These operations are:

Hive 的 SQL 提供了基本的 SQL 操做。這些操做在表或分區上工做。這些操做是:

  • 可以使用 WHERE 子句從表中過濾行。

  • 可以使用 SELECT 子句從表中選擇某些列。

  • 可以在兩個表之間進行 equi-joins。

  • 可以對存儲在一個表中的數據的多個 「group by」 列上的聚合進行評估。

  • 可以將查詢結果存儲到另外一個表中。

  • 可以將表的內容下載到本地目錄(例如,nfs)。

  • 可以將查詢結果存儲在 hadoop dfs 目錄中。

  • 可以管理表和分區(建立、刪除和修改)。

  • 可以插入自定義腳本的語言選擇自定義映射/減小做業。

Ability to filter rows from a table using a WHERE clause.
Ability to select certain columns from the table using a SELECT clause.
Ability to do equi-joins between two tables.
Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.
Ability to store the results of a query into another table.
Ability to download the contents of a table to a local (for example,, nfs) directory.
Ability to store the results of a query in a hadoop dfs directory.
Ability to manage tables and partitions (create, drop and alter).
Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.

二、Usage and Examples

NOTE: Many of the following examples are out of date. More up to date information can be found in the LanguageManual.

注意:下面的許多例子是過期的。能夠在 LanguageManual 查找到更多信息。

The following examples highlight some salient features of the system. A detailed set of query test cases can be found at Hive Query Test Cases and the corresponding results can be found at Query Test Case Results.

下面的例子突出了該系統的一些顯著特徵。

詳細的查詢測試用例集能夠在 Hive Query Test Cases 中找到,相應的結果能夠在 Query Test Case Results 中找到。

  • Creating, Showing, Altering, and Dropping Tables
  • Loading Data
  • Querying and Inserting Data

2.一、Creating, Showing, Altering, and Dropping Tables

See Hive Data Definition Language for detailed information about creating, showing, altering, and dropping tables.

見 Hive Data Definition Language 查看更多關於建立、展現、修改和刪除表的信息。

2.1.一、Creating Tables

An example statement that would create the page_view table mentioned above would be like:

建立上面提到的 page_view 表的示例語句以下:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
STORED AS SEQUENCEFILE;

In this example, the columns of the table are specified with the corresponding types. Comments can be attached both at the column level as well as at the table level. Additionally, the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. When specified in this way, the data in the files is assumed to be delimited with ASCII 001(ctrl-A) as the field delimiter and newline as the row delimiter.

在本例中,表的列使用相應的類型指定。

能夠在列級別和表級別添加註釋。

此外,partitioned by 子句定義了分區列,分區列與數據列不一樣,實際上並不與數據一塊兒存儲。

當以這種方式指定時,假設文件中的數據以 ASCII 001(ctrl-A)做爲字段分隔符,以換行分隔符做爲行分隔符。

The field delimiter can be parametrized if the data is not in the above format as illustrated in the following example:

若是數據不是上述格式,則能夠參數化字段分隔符:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
STORED AS SEQUENCEFILE;

The row delimintor currently cannot be changed since it is not determined by Hive but Hadoop delimiters.

行分隔符目前不能更改,由於它不是由 Hive 決定的,而是 Hadoop 決定的。

It is also a good idea to bucket the tables on certain columns so that efficient sampling queries can be executed against the data set. If bucketing is absent, random sampling can still be done on the table but it is not efficient as the query has to scan all the data. The following example illustrates the case of the page_view table that is bucketed on the userid column:

在某些列上對錶進行分桶是個好主意,以便對數據集執行有效的抽樣查詢。

若是不使用分桶,對錶的表隨機抽樣仍然能夠作,但不是有效的查詢,必須掃描全部的數據。

下面的例子說明了 page_view 表在 userid 列上分桶的狀況:

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In the example above, the table is clustered by a hash function of userid into 32 buckets. Within each bucket the data is sorted in increasing order of viewTime. Such an organization allows the user to do efficient sampling on the clustered column—n this case userid. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries with greater efficiency.

在上面的例子中,表被 userid 的哈希函數彙集到 32 個桶中。在每一個桶中,數據按 viewTime 的遞增順序排序。

這樣的組織容許用戶在彙集列上進行有效的抽樣,在本例中爲 userid。

排序屬性容許內部操做符利用已知的數據結構,同時以更高的效率計算查詢。

CREATE TABLE page_view(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
                ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

In this example, the columns that comprise of the table row are specified in a similar way as the definition of types. Comments can be attached both at the column level as well as at the table level. Additionally, the partitioned by clause defines the partitioning columns which are different from the data columns and are actually not stored with the data. The CLUSTERED BY clause specifies which column to use for bucketing as well as how many buckets to create. The delimited row format specifies how the rows are stored in the hive table. In the case of the delimited format, this specifies how the fields are terminated, how the items within collections (arrays or maps) are terminated, and how the map keys are terminated. STORED AS SEQUENCEFILE indicates that this data is stored in a binary format (using hadoop SequenceFiles) on hdfs. The values shown for the ROW FORMAT and STORED AS clauses in the above, example represent the system defaults.

在本例中,由錶行組成的列的指定方式與類型的定義相似。

能夠在列級和表級附加註釋。

此外,partitioned by 子句定義了分區列,分區列與數據列不一樣,實際上並不與數據一塊兒存儲。

CLUSTERED BY 子句指定使用哪一個列進行分桶以及建立多少個桶。

分隔的行格式指定了行在 hive 表中的存儲方式。

對於分隔格式,它指定如何終止字段,集合(數組或映射)中的項如何終止,以及如何終止映射鍵。

STORED AS SEQUENCEFILE 表示該數據以二進制格式(使用hadoop SequenceFiles)存儲在 hdfs 上。

上面示例中顯示的 ROW FORMAT 和 STORED AS 子句的值表示系統默認值。

Table names and column names are case insensitive.

表名和列名不區分大小寫。

2.1.二、Browsing Tables and Partitions

SHOW TABLES;

To list existing tables in the warehouse; there are many of these, likely more than you want to browse.

列出倉庫中已存在的表;有不少,可能比你想瀏覽的要多。

SHOW TABLES 'page.*';

To list tables with prefix 'page'. The pattern follows Java regular expression syntax (so the period is a wildcard).

列出以 「page」 爲前綴的表。該模式遵循 Java 正則表達式語法(所以句點是通配符)。

SHOW PARTITIONS page_view;

To list partitions of a table. If the table is not a partitioned table then an error is thrown.

列出一個表的分區。若是表不是一個分區表,那麼就拋出一個錯誤。

DESCRIBE page_view;

To list columns and column types of table.

列出表的列和列類型。

DESCRIBE EXTENDED page_view;

To list columns and all other properties of table. This prints lot of information and that too not in a pretty format. Usually used for debugging.

列出表的列和全部其餘的屬性。這打印不少信息,那也不是一個好看的格式。一般用於調試。

DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

To list columns and all other properties of a partition. This also prints lot of information which is usually used for debugging.

列出一個分區的列和全部其餘的屬性。這打印不少信息,那也不是一個好看的格式。一般用於調試。

2.1.三、Altering Tables

To rename existing table to a new name. If a table with new name already exists then an error is returned:

將已存在的表重命名爲一個新名字。若是具備新名字的表已存在,就返回一個錯誤:

ALTER TABLE old_table_name RENAME TO new_table_name;

To rename the columns of an existing table. Be sure to use the same column types, and to include an entry for each preexisting column:

重命名一個已存在表的列。確保使用相同的列類型,包含一個預先存在的列的入口

ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);

To add columns to an existing table:

給一個已存在的表添加列:

ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');

Note that a change in the schema (such as the adding of the columns), preserves the schema for the old partitions of the table in case it is a partitioned table. All the queries that access these columns and run over the old partitions implicitly return a null value or the specified default values for these columns.

注意,模式中的更改(如添加列)將保留表的舊分區的模式,以防它是一個分區表。

訪問這些列,並在舊分區上運行的全部查詢都會隱式地爲這些列返回空值或指定的默認值。

In the later versions, we can make the behavior of assuming certain values as opposed to throwing an error in case the column is not found in a particular partition configurable.

在之後的版本中,咱們能夠採用假設某些值的行爲,而不是在某個特定分區中找不到可配置的列時拋出錯誤。

2.1.四、Dropping Tables and Partitions

Dropping tables is fairly trivial. A drop on the table would implicitly drop any indexes(this is a future feature) that would have been built on the table. The associated command is:

刪除表很是簡單。對錶進行刪除操做將隱式地刪除在表上構建的任何索引(這是未來的特性)。相關的命令是:

DROP TABLE pv_users;

To dropping a partition. Alter the table to drop the partition.

刪除一個分區。修改表以刪除分區。

ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')

Note that any data for this table or partitions will be dropped and may not be recoverable.

注意,該表或分區的任何數據都將被刪除,而且可能沒法恢復。

2.二、Loading Data

There are multiple ways to load data into Hive tables. The user can create an external table that points to a specified location within HDFS. In this particular usage, the user can copy a file into the specified location using the HDFS put or copy commands and create a table pointing to this location with all the relevant row format information. Once this is done, the user can transform the data and insert them into any other Hive table. For example, if the file /tmp/pv_2008-06-08.txt contains comma separated page views served on 2008-06-08, and this needs to be loaded into the page_view table in the appropriate partition, the following sequence of commands can achieve this:

將數據加載到 Hive 表中有多種方式。

用戶能夠建立一個外部表,該表指向 HDFS 中的指定位置。在這種特殊的用法中,用戶可使用 HDFS 的 put 或 copy 命令將文件複製到指定的位置,並建立一個指向該位置的表,其中包含全部相關的行格式信息。

一旦完成,用戶就能夠轉換數據,並將它們插入到任何其餘 Hive 表中。例如,若是文件 /tmp/pv_2008-06-08.txt 包含了 2008-06-08 上的以逗號分隔的頁面視圖,而且須要將其加載到 page_view 表的相應分區中,下面的命令序列能夠實現這一點:

CREATE EXTERNAL TABLE page_view_stg(viewTime INT, userid BIGINT,
                page_url STRING, referrer_url STRING,
                ip STRING COMMENT 'IP Address of the User',
                country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '44' LINES TERMINATED BY '12'
STORED AS TEXTFILE
LOCATION '/user/data/staging/page_view';
 
hadoop dfs -put /tmp/pv_2008-06-08.txt /user/data/staging/page_view
 
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip
WHERE pvs.country = 'US';

This code results in an error due to LINES TERMINATED BY limitation

因爲 LINES TERMINATED BY 限制,這個代碼會致使錯誤

FAILED: SemanticException 6:67 LINES TERMINATED BY only supports newline '\n' right now. Error encountered near token ''12''

See [HIVE-5999](https://issues.apache.org/jira/browse/HIVE-5999) - Allow other characters for LINES TERMINATED BY OPEN [HIVE-11996](https://issues.apache.org/jira/browse/HIVE-11996) - Row Delimiter other than '\n' throws error in Hive. OPEN

In the example above, nulls are inserted for the array and map types in the destination tables but potentially these can also come from the external table if the proper row formats are specified.

在上面的示例中,將爲目標表中的 array 和 map 類型插入 nulls,但若是指定了適當的行格式,這些 nulls 也可能來自外部表。

This method is useful if there is already legacy data in HDFS on which the user wants to put some metadata so that the data can be queried and manipulated using Hive.

若是 HDFS 中已經有一些遺留數據,用戶想要在這些數據上放一些元數據,這樣就可使用 Hive 來查詢和操做這些數據,那麼這種方法很是有用。

Additionally, the system also supports syntax that can load the data from a file in the local files system directly into a Hive table where the input data format is the same as the table format. If /tmp/pv_2008-06-08_us.txt already contains the data for US, then we do not need any additional filtering as shown in the previous example. The load in this case can be done using the following syntax:

此外,系統還支持將本地文件系統中文件的數據直接加載到 Hive 表中,而且輸入的數據格式與表格式相同。

若是 /tmp/pv_2008-06-08_us.txt 已經包含了 US 的數據,那麼咱們不須要像前面的例子中所示的任何額外的過濾。在這種狀況下,可使用如下語法進行加載:

LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

The path argument can take a directory (in which case all the files in the directory are loaded), a single file name, or a wildcard (in which case all the matching files are uploaded). If the argument is a directory, it cannot contain subdirectories. Similarly, the wildcard must match file names only.

path 參數能夠接受一個目錄(在這種狀況下,該目錄中的全部文件都被加載)、一個單個文件名或一個通配符(在這種狀況下,全部匹配的文件都被上傳)。

若是參數是目錄,則不能包含子目錄。相似地,通配符必須只匹配文件名。

In the case that the input file /tmp/pv_2008-06-08_us.txt is very large, the user may decide to do a parallel load of the data (using tools that are external to Hive). Once the file is in HDFS - the following syntax can be used to load the data into a Hive table:

若是輸入文件 /tmp/pv_2008-06-08_us.txt 很是大,用戶可能會決定並行加載數據(使用 Hive 外部的工具)。一旦文件在 HDFS 中,可使用如下語法將數據加載到 Hive 表中:

LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')

It is assumed that the array and map fields in the input.txt files are null fields for these examples.

對於這些例子,假設 input.txt 文件中的 array 和 map 字段是空字段。

See Hive Data Manipulation Language for more information about loading data into Hive tables, and see External Tables for another example of creating an external table.

2.三、Querying and Inserting Data

The Hive query operations are documented in Select, and the insert operations are documented in Inserting data into Hive Tables from queries and Writing data into the filesystem from queries.

Hive 查詢操做記錄在 Select 中,insert 操做記錄在 Inserting data into Hive Tables from queries 和 Writing data into the filesystem from queries 中。

2.3.一、Simple Query

For all the active users, one can use the query of the following form:

對於全部活躍用戶,可使用下面形式的查詢:

INSERT OVERWRITE TABLE user_active
SELECT user.*
FROM user
WHERE user.active = 1;

Note that unlike SQL, we always insert the results into a table. We will illustrate later how the user can inspect these results and even dump them to a local file. You can also run the following query in Beeline or the Hive CLI:

注意,與 SQL 不一樣,咱們老是將結果插入表中。咱們稍後將演示用戶如何檢查這些結果,甚至將它們轉儲到本地文件。

你也能夠在 Beeline 或 Hive 命令行中執行以下查詢:

SELECT user.*
FROM user
WHERE user.active = 1;

This will be internally rewritten to some temporary file and displayed to the Hive client side.

這將在內部被重寫到一些臨時文件,並顯示到 Hive 客戶端。

2.3.二、Partition Based Query

What partitions to use in a query is determined automatically by the system on the basis of where clause conditions on partition columns. For example, in order to get all the page_views in the month of 03/2008 referred from domain xyz.com, one could write the following query:

在查詢中,使用什麼分區由系統根據分區列上的 where 子句條件自動決定。

例如,爲了從域名 xyz.com 得到全部在 2008 年 3 月的 page_views,能夠寫如下查詢:

INSERT OVERWRITE TABLE xyz_com_page_views
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31' AND
      page_views.referrer_url like '%xyz.com';

Note that page_views.date is used here because the table (above) was defined with PARTITIONED BY(date DATETIME, country STRING) ; if you name your partition something different, don't expect .date to do what you think!

注意,這裏使用 page_views.date 是由於上面的表是用 PARTITIONED BY(date DATETIME, country STRING) 定義的;若是你給你的分區起了不一樣的名字,別期望 .date 會像你想的那樣!

2.3.三、Joins

In order to get a demographic breakdown (by gender) of page_view of 2008-03-03 one would need to join the page_view table and the user table on the userid column. This can be accomplished with a join as shown in the following query:

爲了得到 2008-03-03 的 page_view 的人口統計分類(按性別),須要在 userid 列上鍊接 page_view 表和 user 表。這能夠經過 join 來實現,以下面的查詢所示:

INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order to do outer joins the user can qualify the join with LEFT OUTER, RIGHT OUTER or FULL OUTER keywords in order to indicate the kind of outer join (left preserved, right preserved or both sides preserved). For example, in order to do a full outer join in the query above, the corresponding syntax would look like the following query:

爲了進行外部鏈接,用戶可使用 LEFT OUTER、RIGHT OUTER 或 FULL OUTER 關鍵字來限定鏈接,以代表外部鏈接的類型(左保留、右保留或兩邊保留)。

例如,爲了在上面的查詢中執行 full outer join,相應的語法看起來像下面的查詢:

INSERT OVERWRITE TABLE pv_users
SELECT pv.*, u.gender, u.age
FROM user u FULL OUTER JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example.

爲了檢查另外一個表中是否存在一個鍵,用戶可使用 LEFT SEMI JOIN,以下面的例子所示。

INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';

In order to join more than one tables, the user can use the following syntax:

爲了鏈接多個表,用戶可使用如下語法:

INSERT OVERWRITE TABLE pv_friends
SELECT pv.*, u.gender, u.age, f.friends
FROM page_view pv JOIN user u ON (pv.userid = u.id) JOIN friend_list f ON (u.id = f.uid)
WHERE pv.date = '2008-03-03';

Note that Hive only supports equi-joins. Also it is best to put the largest table on the rightmost side of the join to get the best performance.

Hive 只支持 equi-joins。另外,最好將最大的表放在鏈接的最右邊,以得到最佳性能。

2.3.四、Aggregations

In order to count the number of distinct users by gender one could write the following query:

爲了按性別計算不一樣用戶的數量,能夠編寫如下查詢:

INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count (DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

Multiple aggregations can be done at the same time, however, no two aggregations can have different DISTINCT columns .e.g while the following is possible

多個聚合能夠同時進行,可是,兩個聚合不能有不一樣的列。例如,如下是可能的:

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(*), sum(DISTINCT pv_users.userid)
FROM pv_users
GROUP BY pv_users.gender;

however, the following query is not allowed

可是,下面的查詢是不容許的

INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip)
FROM pv_users
GROUP BY pv_users.gender;

2.3.五、Multi Table/File Inserts

The output of the aggregations or simple selects can be further sent into multiple tables or even to hadoop dfs files (which can then be manipulated using hdfs utilities). For example, if along with the gender breakdown, one needed to find the breakdown of unique page views by age, one could accomplish that with the following query:

聚合或簡單選擇的輸出能夠進一步發送到多個表中,甚至發送到 hadoop dfs 文件中(而後可使用 hdfs 實用程序操做這些文件)。

例如,若是在性別分類的同時,須要找到按年齡劃分的惟一頁面瀏覽量,能夠經過如下查詢來實現:

FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
    SELECT pv_users.gender, count_distinct(pv_users.userid)
    GROUP BY pv_users.gender
 
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
    SELECT pv_users.age, count_distinct(pv_users.userid)
    GROUP BY pv_users.age;

The first insert clause sends the results of the first group by to a Hive table while the second one sends the results to a hadoop dfs files.

第一個 insert 子句將第一個 group by 的結果發送到 Hive 表中,而第二個 insert 子句將結果發送到 hadoop dfs 文件中。

2.3.六、Dynamic-Partition Insert

In the previous examples, the user has to know which partition to insert into and only one partition can be inserted in one insert statement. If you want to load into multiple partitions, you have to use multi-insert statement as illustrated below.

在前面的例子中,用戶必須知道要插入哪一個分區,而且在一條 insert 語句中只能插入一個分區。

若是想要加載到多個分區,必須使用多條 insert 語句,以下所示。

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';

In order to load data into all country partitions in a particular day, you have to add an insert statement for each country in the input data. This is very inconvenient since you have to have the priori knowledge of the list of countries exist in the input data and create the partitions beforehand. If the list changed for another day, you have to modify your insert DML as well as the partition creation DDLs. It is also inefficient since each insert statement may be turned into a MapReduce Job.

爲了將某一天的數據加載到全部 country 分區中,必須在輸入數據中爲每一個國家添加 insert 一條語句。

這很是不方便,由於你必須事先知道輸入數據中存在的國家列表,並預先建立分區。

若是列表某天更改了,則必須修改 insert DML 和建立分區 DML。這也是低效的,由於每一個 insert 語句均可能變成 MapReduce Job。

Dynamic-partition insert (or multi-partition insert) is designed to solve this problem by dynamically determining which partitions should be created and populated while scanning the input table. This is a newly added feature that is only available from version 0.6.0. In the dynamic partition insert, the input column values are evaluated to determine which partition this row should be inserted into. If that partition has not been created, it will create that partition automatically. Using this feature you need only one insert statement to create and populate all necessary partitions. In addition, since there is only one insert statement, there is only one corresponding MapReduce job. This significantly improves performance and reduce the Hadoop cluster workload comparing to the multiple insert case.

動態分區插入(或多分區插入)是爲了解決這個問題而設計的,它在掃描輸入表時動態地肯定應該建立和填充哪些分區。

這是一個新添加的特性,僅在 0.6.0 版本中可用。

在動態分區插入中,將計算輸入列的值,以肯定應該將該行插入哪一個分區。若是尚未建立該分區,它將自動建立該分區。

使用這個特性,你只須要一條 insert 語句就能夠建立和填充全部必要的分區。另外,因爲只有一條 insert 語句,因此對應的 MapReduce job 也只有一個。

與屢次插入相比,這顯著提升了性能並減小了 Hadoop 集羣的工做負載。

Below is an example of loading data to all country partitions using one insert statement:

下面是一個使用 insert 語句將數據加載到全部 country 分區的例子:

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country

There are several syntactic differences from the multi-insert statement:

與多條 insert 語句有幾個語法上的區別:

country appears in the PARTITION specification, but with no value associated. In this case, country is a dynamic partition column. On the other hand, ds has a value associated with it, which means it is a static partition column. If a column is dynamic partition column, its value will be coming from the input column. Currently we only allow dynamic partition columns to be the last column(s) in the partition clause because the partition column order indicates its hierarchical order (meaning dt is the root partition, and country is the child partition). You cannot specify a partition clause with (dt, country='US') because that means you need to update all partitions with any date and its country sub-partition is 'US'.

  • country 出如今 PARTITION 規範中,但沒有關聯的值。

在本例中,country 是一個動態分區列。另外一方面,ds 有一個與之相關聯的值,這意味着它是一個靜態分區列。

若是一個列是動態分區列,那麼它的值未來自輸入列。

目前,咱們只容許動態分區列是 partition 子句中的最後一列,由於分區列的順序代表了它的層次順序(意味着 dt 是根分區,而 country 是子分區)。

你不能用 (dt, country='US') 指定分區子句,由於這意味着你須要用任一日期更新全部分區,而它的 country 子分區是 'US'。

An additional pvs.country column is added in the select statement. This is the corresponding input column for the dynamic partition column. Note that you do not need to add an input column for the static partition column because its value is already known in the PARTITION clause. Note that the dynamic partition values are selected by ordering, not name, and taken as the last columns from the select clause.

  • 在 select 語句中添加一個額外的 pvs.country 列。

這是動態分區列的相應輸入列。

注意,你不須要爲靜態 partition 列添加輸入列,由於它的值在 PARTITION 子句中已知。

注意,動態分區值是按照順序選擇的,而不是名稱,並做爲 select 子句的最後一列。

Semantics of the dynamic partition insert statement:

動態分區插入語句的語義:

When there are already non-empty partitions exists for the dynamic partition columns, (for example, country='CA' exists under some ds root partition), it will be overwritten if the dynamic partition insert saw the same value (say 'CA') in the input data. This is in line with the 'insert overwrite' semantics. However, if the partition value 'CA' does not appear in the input data, the existing partition will not be overwritten.

  • 當動態分區列已經存在非空的分區時(例如,country='CA' 存在於某些 ds 根分區下),若是動態分區插入在輸入數據中遇到相同的值(好比'CA'),那麼它將被覆蓋。這符合 'insert overwrite' 語義。可是,若是分區值 'CA' 沒有出如今輸入數據中,則不會覆蓋現有分區。

Since a Hive partition corresponds to a directory in HDFS, the partition value has to conform to the HDFS path format (URI in Java). Any character having a special meaning in URI (for example, '%', ':', '/', '#') will be escaped with '%' followed by 2 bytes of its ASCII value.

  • 一個 Hive 分區對應 HDFS 中的一個目錄,分區值必須符合 HDFS 的路徑格式(Java爲URI)。任何在 URI 中有特殊含義的字符(例如'%',':','/','#')將用 '%' 後跟它的 ASCII 值的 2 個字節進行轉義。

If the input column is a type different than STRING, its value will be first converted to STRING to be used to construct the HDFS path.

  • 若是輸入列不是 STRING 類型,則首先將其值轉換爲 STRING,用於構建 HDFS 路徑。

If the input column value is NULL or empty string, the row will be put into a special partition, whose name is controlled by the hive parameter hive.exec.default.partition.name. The default value is HIVE_DEFAULT_PARTITION{}. Basically this partition will contain all "bad" rows whose value are not valid partition names. The caveat of this approach is that the bad value will be lost and is replaced by HIVE_DEFAULT_PARTITION{} if you select them Hive. JIRA HIVE-1309 is a solution to let user specify "bad file" to retain the input partition column values as well.

  • 若是輸入的列值爲 NULL 或空字符串,行將被放入一個特殊的分區,分區的名稱由 hive.exec.default.partition.name 參數控制。默認值爲 HIVE_DEFAULT_PARTITION{}。基本上,這個分區將包含「壞」行,行的值不是有效的分區名。注意,若是選擇它們,「壞」值將丟失,並由 HIVE_DEFAULT_PARTITION{} 替換。JIRA HIVE-1309 是一個讓用戶指定 「壞文件」 來保留輸入分區列值的解決方案。

Dynamic partition insert could potentially be a resource hog in that it could generate a large number of partitions in a short time. To get yourself buckled, we define three parameters:

  • 動態分區插入可能會佔用大量資源,由於它可能會在很短的時間內生成大量分區。咱們定義了三個參數:

    • hive.exec.max.dynamic.partitions.pernode(默認值是100):是每一個 mapper 或 reducer 能夠建立的最大動態分區。若是一個 mapper 或 reducer 建立的值超過這個閾值,一個致命錯誤將從 mapper/reducer(經過計數器)引起,整個 job 將被殺死。

    • hive.exec.max.dynamic.partitions(默認值爲1000):是一個 DML 能夠建立的動態分區的總數。若是每一個 mapper/reducer 都沒有超過這個限制,可是動態分區的總數卻超過了,那麼在 job 結束時,中間數據被移動到最終目的地以前,會引起一個異常。

    • hive.exec.max.created.files(默認值爲100000)是全部 mappers 和 reducers 建立的最大文件總數。這是經過在建立新文件時由每一個 mapper/reducer 更新 Hadoop 計數器來實現的。若是總數超過了 hive.exec.max.created.files 時,一個致命錯誤將被拋出,job 將被終止。

hive.exec.max.dynamic.partitions.pernode (default value being 100) is the maximum dynamic partitions that can be created by each mapper or reducer. If one mapper or reducer created more than that the threshold, a fatal error will be raised from the mapper/reducer (through counter) and the whole job will be killed.

hive.exec.max.dynamic.partitions (default value being 1000) is the total number of dynamic partitions could be created by one DML. If each mapper/reducer did not exceed the limit but the total number of dynamic partitions does, then an exception is raised at the end of the job before the intermediate data are moved to the final destination.

hive.exec.max.created.files (default value being 100000) is the maximum total number of files created by all mappers and reducers. This is implemented by updating a Hadoop counter by each mapper/reducer whenever a new file is created. If the total number is exceeding hive.exec.max.created.files, a fatal error will be thrown and the job will be killed.

Another situation we want to protect against dynamic partition insert is that the user may accidentally specify all partitions to be dynamic partitions without specifying one static partition, while the original intention is to just overwrite the sub-partitions of one root partition. We define another parameter hive.exec.dynamic.partition.mode=strict to prevent the all-dynamic partition case. In the strict mode, you have to specify at least one static partition. The default mode is strict. In addition, we have a parameter hive.exec.dynamic.partition=true/false to control whether to allow dynamic partition at all. The default value is false prior to Hive 0.9.0 and true in Hive 0.9.0 and later.

咱們但願防止動態分區插入的另外一種狀況是,用戶可能會意外地將全部分區指定爲動態分區,而沒有指定一個靜態分區,而最初的目的只是覆蓋一個根分區的子分區。

咱們定義另外一個參數 hive.exec.dynamic.partition.mode=strict,防止出現所有動態分區的狀況。在嚴格模式下,必須指定至少一個靜態分區。默認模式爲嚴格的。

此外,咱們還有一個參數 hive.exec.dynamic.partition=true/false 來控制是否容許動態分區。Hive 0.9.0 以前的默認值爲 false, Hive 0.9.0 以後的默認值爲 true。

In Hive 0.6, dynamic partition insert does not work with hive.merge.mapfiles=true or hive.merge.mapredfiles=true, so it internally turns off the merge parameters. Merging files in dynamic partition inserts are supported in Hive 0.7 (see JIRA HIVE-1307 for details).

在 Hive 0.6 中,動態分區插入不能與 hive.merge.mapfiles=truehive.merge.mapredfiles=true 一塊兒使用。所以它在內部關閉了合併參數。

Hive 0.7 支持在動態分區插入中合併文件。

Troubleshooting and best practices:

故障排除和最佳實踐:

As stated above, there are too many dynamic partitions created by a particular mapper/reducer, a fatal error could be raised and the job will be killed. The error message looks something like:

如上所述,一個特定的 mapper/reducer 建立的動態分區太多了,可能會引起一個致命錯誤,job 將被終止。錯誤信息是這樣的:

beeline> set hive.exec.dynamic.partition.mode=nonstrict;
    beeline> FROM page_view_stg pvs
          INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
                 SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
                        from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country;
...
2010-05-07 11:10:19,816 Stage-1 map = 0%,  reduce = 0%
[Fatal Error] Operator FS_28 (id=41): fatal error. Killing the job.
Ended Job = job_201005052204_28178 with errors
...

The problem of this that one mapper will take a random set of rows and it is very likely that the number of distinct (dt, country) pairs will exceed the limit of hive.exec.max.dynamic.partitions.pernode. One way around it is to group the rows by the dynamic partition columns in the mapper and distribute them to the reducers where the dynamic partitions will be created. In this case the number of distinct dynamic partitions will be significantly reduced. The above example query could be rewritten to:

這樣作的問題是,一個 mapper 將取一個隨機的行集,而且極可能不一樣的(dt, country)對的數量將超過 hive.exec.max.dynamic.partitions.pernode 的限制。

解決這個問題的一種方法是按照 mapper 中的動態分區列對行進行分組,並將它們分發到將在其中建立動態分區的 reducers 中。

在這種狀況下,不一樣動態分區的數量將顯著減小。上面的示例查詢能夠重寫爲:

beeline> set hive.exec.dynamic.partition.mode=nonstrict;
beeline> FROM page_view_stg pvs
      INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
             SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
                    from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country
             DISTRIBUTE BY ds, country;

This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.

該查詢將生成一個 MapReduce job,而不是 Map-only job。

SELECT 子句將被轉換爲 mappers 的計劃,並根據 (ds, country)對的值將輸出分配給 reducers。

INSERT 子句將被轉換爲 reducer 中的計劃,用於寫入動態分區。

Additional documentation:

2.3.七、Inserting into Local Files

In certain situations you would want to write the output into a local file so that you could load it into an excel spreadsheet. This can be accomplished with the following command:

在某些狀況下,可能但願將輸出寫入到本地文件,以便將其加載到 excel 表格中。這能夠經過如下命令完成:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/pv_gender_sum'
SELECT pv_gender_sum.*
FROM pv_gender_sum;

2.3.八、Sampling

The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the columns that are specified in the CLUSTERED BY clause of the CREATE TABLE statement. In the following example we choose 3rd bucket out of the 32 buckets of the pv_gender_sum table:

抽樣子句容許用戶對數據的樣本(而不是整個表)編寫查詢。

目前,抽樣是在 CREATE TABLE 語句的 CLUSTERED BY 子句中指定的列上進行的。

在下面的例子中,咱們從 pv_gender_sum 表的 32 個桶中選擇第三個桶:

INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

In general the TABLESAMPLE syntax looks like:

通常狀況下,TABLESAMPLE 語法以下:

TABLESAMPLE(BUCKET x OUT OF y)

y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time. The buckets chosen are determined if bucket_number module y is equal to x. So in the above example the following tablesample clause

y 必須是該表在建立表時指定的桶數的倍數或除數。

若是 bucket_number 模塊 y 等於 x,則肯定所選擇的桶。因此在上面的例子中,下面的 tablesample 子句

TABLESAMPLE(BUCKET 3 OUT OF 16)

would pick out the 3rd and 19th buckets. The buckets are numbered starting from 0.

選出第3和第19個桶。桶從0開始編號。

On the other hand the tablesample clause

另外一方面,tablesample 子句

TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)

would pick out half of the 3rd bucket.

會取出第三個桶的一半。

2.3.九、Union All

The language also supports union all, for example, if we suppose there are two different tables that track which user has published a video and which user has published a comment, the following query joins the results of a union all with the user table to create a single annotated stream for all the video publishing and comment publishing events:

語言還支持 union all,

例如,若是咱們假設有兩個不一樣的表,跟蹤用戶發佈的一個視頻,和用戶發表的評論。

下面的查詢使用 user 表 join 了 union all 的結果,來建立一個帶註釋的全部視頻發佈和評論發佈事件流:

INSERT OVERWRITE TABLE actions_users
SELECT u.id, actions.date
FROM (
    SELECT av.uid AS uid
    FROM action_video av
    WHERE av.date = '2008-06-03'
 
    UNION ALL
 
    SELECT ac.uid AS uid
    FROM action_comment ac
    WHERE ac.date = '2008-06-03'
    ) actions JOIN users u ON(u.id = actions.uid);

2.3.十、Array Operations

Array columns in tables can be as follows:

表中的數組列能夠以下:

CREATE TABLE array_table (int_array_column ARRAY<INT>);

Assuming that pv.friends is of the type ARRAY (i.e. it is an array of integers), the user can get a specific element in the array by its index as shown in the following command:

假設 pv.friends 的類型爲 ARRAY (即它是一個整型數組),用戶能夠經過數組的索引獲取數組中的特定元素,以下所示:

SELECT pv.friends[2]
FROM page_views pv;

The select expression gets the third item in the pv.friends array.

select 表達式得到 pv.friends 數組中的第三項。

The user can also get the length of the array using the size function as shown below:

用戶也可使用 size 函數獲取數組的長度,以下所示:

SELECT pv.userid, size(pv.friends)
FROM page_view pv;

2.3.十一、Map (Associative Arrays) Operations

Maps provide collections similar to associative arrays. Such structures can only be created programmatically currently. We will be extending this soon. For the purpose of the current example assume that pv.properties is of the type map<String, String> i.e. it is an associative array from strings to string. Accordingly, the following query:

映射提供了相似於關聯數組的集合。目前只能以編程方式建立此類結構。咱們很快就會擴展這個。

對於當前的例子,假設 pv.properties 類型爲 map<String, String>,即它是一個從字符串到字符串的關聯數組。所以,如下查詢:

INSERT OVERWRITE page_views_map
SELECT pv.userid, pv.properties['page type']
FROM page_views pv;

can be used to select the 'page_type' property from the page_views table.

可用於從 page_views 表中選擇 'page_type' 屬性。

Similar to arrays, the size function can also be used to get the number of elements in a map as shown in the following query:

與數組相似,size 函數也能夠用於獲取 map 中元素的數量,以下所示:

SELECT size(pv.properties)
FROM page_view pv;

2.3.十二、Custom Map/Reduce Scripts

Users can also plug in their own custom mappers and reducers in the data stream by using features natively supported in the Hive language. for example, in order to run a custom mapper script - map_script - and a custom reducer script - reduce_script - the user can issue the following command which uses the TRANSFORM clause to embed the mapper and the reducer scripts.

用戶還可使用 Hive 語言本地支持的特性,在數據流中插入本身的自定義 mappers 和 reducers。

例如,爲了運行一個自定義 mapper 腳本 map_script 和一個自定義 reducer 腳本 reduce_script,用戶能夠發出如下命令,該命令使用 TRANSFORM 子句來嵌入 mapper 和 reducer 腳本。

Note that columns will be transformed to string and delimited by TAB before feeding to the user script, and the standard output of the user script will be treated as TAB-separated string columns. User scripts can output debug information to standard error which will be shown on the task detail page on hadoop.

注意,在提供給用戶腳本以前,列將被轉換爲字符串,並以製表符分隔,用戶腳本的標準輸出將被視爲製表符分隔的字符串列。

用戶腳本能夠將調試信息輸出到標準錯誤,這些錯誤將顯示在 hadoop 上的 task detail 頁面上。

FROM (
     FROM pv_users
     MAP pv_users.userid, pv_users.date
     USING 'map_script'
     AS dt, uid
     CLUSTER BY dt) map_output
 
 INSERT OVERWRITE TABLE pv_users_reduced
     REDUCE map_output.dt, map_output.uid
     USING 'reduce_script'
     AS date, count;

Sample map script (weekday_mapper.py )

抽樣 map 腳本:

import sys
import datetime
 
for line in sys.stdin:
  line = line.strip()
  userid, unixtime = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print ','.join([userid, str(weekday)])

Of course, both MAP and REDUCE are "syntactic sugar" for the more general select transform. The inner query could also have been written as such:

固然,MAP 和 REDUCE 對於更通常的 select 轉換來講都是「語法糖」。內部查詢也能夠這樣寫:

SELECT TRANSFORM(pv_users.userid, pv_users.date) USING 'map_script' AS dt, uid CLUSTER BY dt FROM pv_users;

Schema-less map/reduce: If there is no "AS" clause after "USING map_script", Hive assumes the output of the script contains 2 parts: key which is before the first tab, and value which is the rest after the first tab. Note that this is different from specifying "AS key, value" because in that case value will only contains the portion between the first tab and the second tab if there are multiple tabs.

無模式的 map/reduce:

若是 USING map_script 後沒有 AS 子句,Hive 假設腳本輸出包含兩個部分:key 在第一個製表符前,value 在第一個製表符後。

注意,這與指定 「AS key, value」 不一樣,由於在這種狀況下,若是有多個製表符,value 將只包含第一個製表符和第二個製表符之間的部分。

In this way, we allow users to migrate old map/reduce scripts without knowing the schema of the map output. User still needs to know the reduce output schema because that has to match what is in the table that we are inserting to.

經過這種方式,咱們容許用戶在不知道 map 輸出模式的狀況下遷移舊的 map/reduce 腳本。

用戶仍然須要知道 reduce 輸出模式,由於它必須與咱們要插入的表中的內容相匹配。

FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    CLUSTER BY key) map_output
 
INSERT OVERWRITE TABLE pv_users_reduced
 
    REDUCE map_output.dt, map_output.uid
    USING 'reduce_script'
    AS date, count;

Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.

Distribute By 和 Sort By:用戶能夠指定 Distribute By 和 Sort By,而不是指定 cluster By,所以分區列和排序列能夠不一樣。

一般狀況下,分區列是排序列的前綴,但這不是必需的。

FROM (
    FROM pv_users
    MAP pv_users.userid, pv_users.date
    USING 'map_script'
    AS c1, c2, c3
    DISTRIBUTE BY c2
    SORT BY c2, c1) map_output
 
INSERT OVERWRITE TABLE pv_users_reduced
 
    REDUCE map_output.c1, map_output.c2, map_output.c3
    USING 'reduce_script'
    AS date, count;

2.3.一、Co-Groups

Amongst the user community using map/reduce, cogroup is a fairly common operation wherein the data from multiple tables are sent to a custom reducer such that the rows are grouped by the values of certain columns on the tables. With the UNION ALL operator and the CLUSTER BY specification, this can be achieved in the Hive query language in the following way. Suppose we wanted to cogroup the rows from the actions_video and action_comments table on the uid column and send them to the 'reduce_script' custom reducer, the following syntax can be used by the user:

在使用 map/reduce 的用戶社區中,cogroup 是一種至關常見的操做,其中來自多個表的數據被髮送到一個自定義的 reducer,這樣行就按表上某些列的值分組。

經過 UNION ALL 操做符和 CLUSTER BY 規範,Hive 查詢語言能夠經過如下方式實現這一點。

假設咱們想在 uid 列上對 actions_video 和 action_comments 表中的行進行分組,並將它們發送到 'reduce_script' 自定義 reducer,用戶可使用如下語法:

FROM (
     FROM (
             FROM action_video av
             SELECT av.uid AS uid, av.id AS id, av.date AS date
 
            UNION ALL
 
             FROM action_comment ac
             SELECT ac.uid AS uid, ac.id AS id, ac.date AS date
     ) union_actions
     SELECT union_actions.uid, union_actions.id, union_actions.date
     CLUSTER BY union_actions.uid) map
 
 INSERT OVERWRITE TABLE actions_reduced
     SELECT TRANSFORM(map.uid, map.id, map.date) USING 'reduce_script' AS (uid, id, reduced_val);
相關文章
相關標籤/搜索