DML: Load, Insert, Update, Delete 官網閱讀記錄

DML: Load, Insert, Update, Delete

官網原文連接:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DMLhtml

There are multiple ways to modify data in Hive:node

  • LOADsql

  • INSERTexpress

    • into Hive tables from queriesapache

    • into directories from queriesapp

    • into Hive tables from SQLless

  • UPDATEide

  • DELETEoop

  • MERGEui

EXPORT and IMPORT commands are also available (as of Hive 0.8).

一、Loading files into tables

Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.

Hive 在加載數據到表時不作任何轉換。加載操做目前是純粹的複製/移動操做,將數據文件移動到對應的 Hive 表的位置。

1.一、Syntax

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
 
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)] [INPUTFORMAT 'inputformat' SERDE 'serde'] (3.0 or later)

1.二、Synopsis

Load operations prior to Hive 3.0 are pure copy/move operations that move datafiles into locations corresponding to Hive tables.

在 Hive 3.0 以前,Load 操做是純粹的複製/移動操做,將數據文件移動到 Hive 表對應的位置。

filepath can be:

文件路徑 filepath 能夠是:

  • 一個相對路徑,如 project/data1

  • 一個絕對路徑,如 /user/hive/project/data1

  • 帶有 scheme 和(可選)authority 的完整 URI,如 hdfs://namenode:9000/user/hive/project/data1

a relative path, such as project/data1
an absolute path, such as /user/hive/project/data1
a full URI with scheme and (optionally) an authority, such as hdfs://namenode:9000/user/hive/project/data1

The target being loaded to can be a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns.

  • 被加載到的目標能夠是一個表或一個分區。若是表是分區的,則必須,爲全部分區列指定值,來指定表的特定分區。

filepath can refer to a file (in which case Hive will move the file into the table) or it can be a directory (in which case Hive will move all the files within that directory into the table). In either case, filepath addresses a set of files.

  • 文件路徑 filepath 能夠指向一個文件(在這種狀況下,Hive 會將文件移動到表中),也能夠指向一個目錄(在這種狀況下,Hive 會將該目錄下的全部文件移動到表中)。在這兩種狀況下,filepath 都指向一組文件。

If the keyword LOCAL is specified, then:

  • 若是指定了關鍵字 LOCAL,則:

    • load 命令將在本地文件系統中查找文件路徑。若是指定了相對路徑,它將相對於用戶當前的工做目錄進行解釋。用戶也能夠爲本地文件指定一個完整的 URI:file:///user/hive/project/data1

    • load 命令將嘗試將 filepath 下的全部文件複製到目標文件系統。經過查看錶的 location 屬性來推斷目標文件系統。複製的數據文件將被移動到表中。

    • 注意:若是你對 HiveServer2 實例運行這個命令,那麼本地路徑指的是 HiveServer2 實例上的路徑。HiveServer2 必須具備訪問該文件的適當權限。

the load command will look for filepath in the local file system. If a relative path is specified, it will be interpreted relative to the user's current working directory. The user can specify a full URI for local files as well - for example: file:///user/hive/project/data1

the load command will try to copy all the files addressed by filepath to the target filesystem. The target file system is inferred by looking at the location attribute of the table. The copied data files will then be moved to the table.

Note: If you run this command against a HiveServer2 instance then the local path refers to a path on the HiveServer2 instance. HiveServer2 must have the proper permissions to access that file.

If the keyword LOCAL is not specified, then Hive will either use the full URI of filepath, if one is specified, or will apply the following rules:

  • 若是沒有指定關鍵字 LOCAL,那麼 Hive 將使用 filepath 的完整 URI(若是指定了的話),或者應用如下規則:

    • 若是沒有指定 scheme 或 authority, Hive 將使用 hadoop 配置變量 fs.default.name 中的 scheme 和 authority 來指定 Namenode URI。

    • 若是路徑不是絕對的,Hive 會將路徑解釋爲,相對於 /user/<username> 的路徑。

    • Hive 將 filepath 下的文件移動到表(或分區)中。

If scheme or authority are not specified, Hive will use the scheme and authority from the hadoop configuration variable fs.default.name that specifies the Namenode URI.

If the path is not absolute, then Hive will interpret it relative to /user/

Hive will move the files addressed by filepath into the table (or partition)

If the OVERWRITE keyword is used then the contents of the target table (or partition) will be deleted and replaced by the files referred to by filepath; otherwise the files referred by filepath will be added to the table.

  • 若是使用了 OVERWRITE 關鍵字,那麼目標表(或分區)的內容將被刪除,並被 filepath 引用的文件所取代;不然,filepath 引用的文件將被添加到表中。

Additional load operations are supported by Hive 3.0 onwards as Hive internally rewrites the load into an INSERT AS SELECT.

Hive 3.0 之後的版本還支持額外的 load 操做,由於 Hive 內部會將 load 改寫爲 INSERT as SELECT

If table has partitions, however, the load command does not have them, the load would be converted into INSERT AS SELECT and assume that the last set of columns are partition columns. It will throw an error if the file does not conform to the expected schema.

  • 若是表有分區,然而,load 命令沒有分區,那麼 load 將被轉換爲 INSERT as SELECT,並假定最後一組列是分區列。若是文件不符合預期的 schema,它將拋出一個錯誤。

If table is bucketed then the following rules apply:

  • 若是表被分桶,則適用如下規則:

    • 嚴格模式:做爲 INSERT as SELECT 做業啓動。

    • 在非嚴格模式:若是文件名稱符合命名約定(若是文件屬於桶0,它應該命名爲000000 _0000000 _0_copy_1,或若是它屬於桶2,名稱應該像000002 _0000002 _0_copy_3,等等),那麼這將是一個純粹的複製/移動操做,不然做爲 INSERT as SELECT 做業啓動。

In strict mode : launches an INSERT AS SELECT job.
In non-strict mode : if the file names conform to the naming convention (if the file belongs to bucket 0, it should be named 000000_0 or 000000_0_copy_1, or if it belongs to bucket 2 the names should be like 000002_0 or 000002_0_copy_3, etc.) then it will be a pure copy/move operation, else it will launch an INSERT AS SELECT job.

filepath can contain subdirectories, provided each file conforms to the schema.

  • 若是每一個文件都符合 schema,filepath 能夠包含子目錄。

inputformat can be any Hive input format such as text, ORC, etc.

  • inputformat 能夠是任何 Hive 的輸入格式,好比 text,ORC 等。

serde can be the associated Hive SERDE.

  • serde 能夠是關聯的 Hive serde。

Both inputformat and serde are case sensitive.

  • inputformat 和 serde 都區分大小寫。

Example of such a schema:

CREATE TABLE tab1 (col1 int, col2 int) PARTITIONED BY (col3 int) STORED AS ORC;

LOAD DATA LOCAL INPATH 'filepath' INTO TABLE tab1;

Here, partition information is missing which would otherwise give an error, however, if the file(s) located at filepath conform to the table schema such that each row ends with partition column(s) then the load will rewrite into an INSERT AS SELECT job.

在這裏,分區信息丟失了,不然就會產生錯誤,可是,若是位於 filepath 的文件符合表 schema,每一行都以分區列結束,那麼加載將做爲 INSERT AS SELECT 做業。

The uncompressed data should look like this:

未壓縮的數據應該是這樣:

(1,2,3), (2,3,4), (4,5,3) etc.

1.三、Notes

filepath cannot contain subdirectories (except for Hive 3.0 or later, as described above).

  • filepath 不能包含子目錄(如上所述,Hive 3.0 及以上版本除外)。

If the keyword LOCAL is not given, filepath must refer to files within the same filesystem as the table's (or partition's) location.

  • 若是沒有給出關鍵字 LOCAL,則 filepath 必須引用與表(或分區)位置相同的文件系統中的文件。

Hive does some minimal checks to make sure that the files being loaded match the target table. Currently it checks that if the table is stored in sequencefile format, the files being loaded are also sequencefiles, and vice versa.

  • Hive 會作一些最小的檢查,以確保正在加載的文件與目標表匹配。目前,它會檢查表是否以 sequencefile 格式存儲,所加載的文件是否也是 sequencefile,反之亦然。

A bug that prevented loading a file when its name includes the "+" character is fixed in release 0.13.0 (HIVE-6048).

  • 在 0.13.0 版中修復了一個在文件名中包含 「+」 字符時沒法加載文件的 bug。

Please read CompressedStorage if your datafile is compressed.

  • 若是你的數據文件被壓縮,請閱讀 CompressedStorage。

二、Inserting data into Hive Tables from queries

Query Results can be inserted into tables by using the insert clause.

可使用 insert 子句將查詢結果插入表中。

2.一、Syntax

-- Standard syntax:
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
 
-- Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;

FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
 
-- Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

2.二、Synopsis

INSERT OVERWRITE will overwrite any existing data in the table or partition

  • INSERT OVERWRITE 將覆蓋表或分區中的任何現有數據

    • 除非爲分區提供 IF NOT EXISTS(從Hive 0.9.0開始)。

    • 從 Hive 2.3.0 開始,若是表有 TBLPROPERTIES("auto.purge"="true"),當對錶執行 INSERT OVERWRITE 查詢時,表以前的數據不會被移到垃圾箱中。此功能僅適用於受管表,並當 "auto.purge " 屬性未設置或設置爲 false,是關閉的。

unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0).

As of Hive 2.3.0 (HIVE-15880), if the table has TBLPROPERTIES ("auto.purge"="true") the previous data of the table is not moved to Trash when INSERT OVERWRITE query is run against the table. This functionality is applicable only for managed tables (see managed tables) and is turned off when "auto.purge" property is unset or set to false.

INSERT INTO will append to the table or partition, keeping the existing data intact. (Note: INSERT INTO syntax is only available starting in version 0.8.)

  • INSERT INTO 將數據追加到表或分區上,保持現有數據不變。(注意:INSERT INTO 語法只在 0.8 版開始使用。)

As of Hive 0.13.0, a table can be made immutable by creating it with TBLPROPERTIES ("immutable"="true"). The default is "immutable"="false".

在 Hive 0.13.0 版本中,能夠經過使用 TBLPROPERTIES("immutable"="true")建立一個不可變表。默認值是 "immutable"="false"。

INSERT INTO behavior into an immutable table is disallowed if any data is already present, although INSERT INTO still works if the immutable table is empty. The behavior of INSERT OVERWRITE is not affected by the "immutable" table property.

若是不可變表中已經存在任何數據,則不容許 INSERT INTO 行爲。若是不可變表爲空,INSERT INTO 仍然能夠工做。NSERT OVERWRITE 的行爲不受「不可變」表屬性的影響。

An immutable table is protected against accidental updates due to a script loading data into it being run multiple times by mistake. The first insert into an immutable table succeeds and successive inserts fail, resulting in only one set of data in the table, instead of silently succeeding with multiple copies of the data in the table.

不可變表受到保護,防止因爲錯誤地屢次運行向其加載數據的腳本而致使意外更新。第一次成功插入到不可變表後,後續插入會失敗,致使表中只有一組數據,而不是靜默地成功插入表中數據的多個副本。

Inserts can be done to a table or a partition. If the table is partitioned, then one must specify a specific partition of the table by specifying values for all of the partitioning columns. If hive.typecheck.on.insert is set to true, these values are validated, converted and normalized to conform to their column types (Hive 0.12.0 onward).

能夠對錶或分區進行插入。

若是表是被分區的,那麼必須經過爲全部分區列指定值來指定表的特定分區。

若是 hive.typecheck.on.insert 設置爲 true,這些值將被驗證、轉換和規範化,以符合它們的列類型。

Multiple insert clauses (also known as Multi Table Insert) can be specified in the same query.

  • 能夠在同一個查詢中,指定多個 insert 子句(也稱爲多表插入)。

The output of each of the select statements is written to the chosen table (or partition). Currently the OVERWRITE keyword is mandatory and implies that the contents of the chosen table or partition are replaced with the output of corresponding select statement.

  • 每一個 select 語句的輸出都被寫入到所選擇的表(或分區)中。目前,OVERWRITE 關鍵字是必須的,它意味着選擇的表或分區的內容將被相應的 select 語句的輸出所替換。

The output format and serialization class is determined by the table's metadata (as specified via DDL commands on the table).

  • 輸出格式和序列化類由表的元數據決定(經過表上的DDL命令指定)。

As of Hive 0.14, if a table has an OutputFormat that implements AcidOutputFormat and the system is configured to use a transaction](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions) manager that implements ACID, then INSERT OVERWRITE will be disabled for that table. This is to avoid users unintentionally overwriting transaction history. The same functionality can be achieved by using TRUNCATE TABLE (for non-partitioned tables) or DROP PARTITION followed by INSERT INTO.

  • 在 Hive 0.14 中,若是一個表有一個實現了 AcidOutputFormat 的 OutputFormat,而且爲使用一個實現了 ACID 的事務管理器配置了系統,那麼該表的 INSERT OVERWRITE 將被禁用。這是爲了不用戶無心中覆蓋事務歷史記錄。經過在 INSERT INTO 後使用 TRUNCATE TABLE(對於非分區表)或 DROP PARTITION 能夠實現相同的功能。

As of Hive 1.1.0 the TABLE keyword is optional.

  • 從 Hive 1.1.0 開始,TABLE 關鍵字是可選的。

As of Hive 1.2.0 each INSERT INTO T can take a column list like INSERT INTO T (z, x, c1). See Description of HIVE-9481 for examples.

2.三、Notes

Multi Table Inserts minimize the number of data scans required. Hive can insert data into multiple tables by scanning the input data just once (and applying different query operators) to the input data.

  • 多表插入能夠最小化所需的數據掃描次數。Hive 能夠經過掃描一次輸入數據(並應用不一樣的查詢操做符)將數據插入到多個表中。

Starting with Hive 0.13.0, the select statement can include one or more common table expressions (CTEs) as shown in the SELECT syntax. For an example, see Common Table Expression.

  • 從 Hive 0.13.0 開始,select 語句能夠包含一個或多個 cte (common table expression),如 select 語法所示。
create table s1 like src;
with q1 as ( select key, value from src where key = '5')
from q1
insert overwrite table s1
select *;

2.四、Dynamic Partition Inserts

Version information:This information reflects the situation in Hive 0.12; dynamic partition inserts were added in Hive 0.6.

版本信息:這個信息反映了 Hive 0.12 中的狀況;Hive 0.6 中添加了動態分區插入。

In the dynamic partition inserts, users can give partial partition specifications, which means just specifying the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this a static partition, otherwise it is a dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column. The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause. As of Hive 3.0.0 (HIVE-19083) there is no need to specify dynamic partition columns. Hive will automatically generate partition specification if it is not specified.

在動態分區插入中,用戶能夠給出部分分區規範,這意味着只需在 PARTITION 子句中指定分區列名列表

列值是可選的。若是給出了分區列的值,咱們稱之爲靜態分區,不然就是動態分區

每一個動態分區列都有一個來自 SELECT 語句的對應輸入列。這意味着動態分區的建立是由輸入列的值決定的。

動態分區列必須在 SELECT 語句的列中最後指定,而且與它們在 PARTITION() 子句中出現的順序相同

從 Hive 3.0.0 開始,不須要指定動態分區列。若是沒有指定分區規格,Hive會自動生成分區規格

Dynamic partition inserts are disabled by default prior to Hive 0.9.0 and enabled by default in Hive 0.9.0 and later. These are the relevant configuration properties for dynamic partition inserts:

在 Hive 0.9.0 以前,默認是禁用動態分區插入的,在 Hive 0.9.0 及之後的版本中,默認是啓用的

如下是與動態分區插入相關的配置屬性:

Configuration Property Default note
hive.exec.dynamic.partition true Needs to be set to true to enable dynamic partition inserts 【須要設置爲true,來啓用動態分區插入】
hive.exec.dynamic.partition.mode strict In strict mode, the user must specify at least one static partition in case the user accidentally overwrites all partitions, in nonstrict mode all partitions are allowed to be dynamic【在嚴格模式下,用戶必須執行至少一個靜態分區,以防用戶不當心覆蓋了全部分區,在非嚴格模式下,全部的分區都容許是動態的。】
hive.exec.max.dynamic.partitions.pernode 100 Maximum number of dynamic partitions allowed to be created in each mapper/reducer node 【每一個mapper/reducer節點容許建立的最大動態分區數】
hive.exec.max.dynamic.partitions 1000 Maximum number of dynamic partitions allowed to be created in total 【總共容許建立的最大動態分區數】
hive.exec.max.created.files 100000 Maximum number of HDFS files created by all mappers/reducers in a MapReduce job 【在一個MapReduce做業中全部mapper/reducers建立的HDFS文件的最大數量】
hive.error.on.empty.partition false Whether to throw an exception if dynamic partition insert generates empty results 【是否在動態分區插入生成空結果時拋出異常】

2.4.一、Example

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

Here the country partition will be dynamically created by the last column from the SELECT clause (i.e. pvs.cnt). Note that the name is not used. In nonstrict mode the dt partition could also be dynamically created.

在這裏,country 分區將由 SELECT 子句中的最後一列(即pvs.cnt)動態建立。

注意,沒有使用該名稱。在非嚴格模式下,也能夠動態建立 dt 分區。

2.4.二、Additional Documentation

三、Writing data into the filesystem from queries

Query results can be inserted into filesystem directories by using a slight variation of the syntax above:

查詢結果能夠經過使用上面的語法的細微變化插入到文件系統目錄中:

3.一、Syntax

-- Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
  SELECT ... FROM ...
 
-- Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
 
  
-- row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)

3.二、Synopsis

Directory can be a full URI. If scheme or authority are not specified, Hive will use the scheme and authority from the hadoop configuration variable fs.default.name that specifies the Namenode URI.

  • 目錄能夠是一個完整的 URI。若是沒有指定 scheme 或 authority, Hive 將使用 hadoop 配置變量 fs.default.name 中的 scheme 和 authority 來指定 Namenode URI。

If LOCAL keyword is used, Hive will write data to the directory on the local file system.

  • 使用 LOCAL 關鍵字時,Hive 會向本地文件系統的目錄寫入數據。

Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.

  • 寫入文件系統的數據序列化爲文本,列之間用 ^A隔開,行之間用換行符隔開。若是任何列不是基本類型,那麼這些列將序列化爲 JSON 格式。

3.三、Notes

INSERT OVERWRITE statements to directories, local directories, and tables (or partitions) can all be used together within the same query.

  • 對目錄、本地目錄和表(或分區)的 INSERT OVERWRITE 語句均可以在同一個查詢中一塊兒使用。

INSERT OVERWRITE statements to HDFS filesystem directories are the best way to extract large amounts of data from Hive. Hive can write to HDFS directories in parallel from within a map-reduce job.

  • HDFS 文件系統目錄的 INSERT OVERWRITE 語句是從 Hive 中提取大量數據的最好方法。Hive 能夠在 map-reduce 做業中並行寫入 HDFS 目錄。

The directory is, as you would expect, OVERWRITten; in other words, if the specified path exists, it is clobbered and replaced with the output.

  • 目錄被覆蓋了;換句話說,若是指定的路徑存在,它將被刪除,並替換爲輸出。

As of Hive 0.11.0 the separator used can be specified; in earlier versions it was always the ^A character (\001). However, custom separators are only supported for LOCAL writes in Hive versions 0.11.0 to 1.1.0 – this bug is fixed in version 1.2.0 (see HIVE-5672).

  • 從 Hive 0.11.0 開始,能夠指定分隔符;在早期版本中,它老是 ^A 字符(\001)。然而,在 Hive 0.11.0 到 1.1.0 中,自定義分隔符只支持本地寫入:這個 bug 在 1.2.0 中已經修復。

In Hive 0.14, inserts into ACID compliant tables will deactivate vectorization for the duration of the select and insert. This will be done automatically. ACID tables that have data inserted into them can still be queried using vectorization.

  • 在 Hive 0.14 中,插入到符合 ACID 的表中會在 select 和 insert 期間中止向量化。這將自動完成。仍然可使用矢量化查詢已插入數據的 ACID 表。

四、Inserting values into tables from SQL

The INSERT...VALUES statement can be used to insert data into tables directly from SQL.

INSERT...VALUES 語句能夠直接從 SQL 中將數據插入到表中。

Version Information:INSERT...VALUES is available starting in Hive 0.14.

從 Hive 0.14 開始可用。

4.一、Syntax

-- Standard Syntax:
INSERT INTO TABLE tablename [PARTITION (partcol1[=val1], partcol2[=val2] ...)] VALUES values_row [, values_row ...]
	  
-- Where values_row is:
( value [, value ...] )

-- where a value is either null or any valid SQL literal

4.二、Synopsis

Each row listed in the VALUES clause is inserted into table tablename.

  • VALUES 子句中列出的每一行都被插入到表 tablename 中。

Values must be provided for every column in the table. The standard SQL syntax that allows the user to insert values into only some columns is not yet supported. To mimic the standard SQL, nulls can be provided for columns the user does not wish to assign a value to.

  • 必須爲表中的每一列提供值。目前還不支持用戶將值插入一些列的標準 SQL 語法。爲了模擬標準 SQL,能夠爲用戶不但願賦值的列提供空值。

Dynamic partitioning is supported in the same way as for INSERT...SELECT.

  • 動態分區的支持方式與 INSERT…SELECT 相同。

If the table being inserted into supports ACID and a transaction manager that supports ACID is in use, this operation will be auto-committed upon successful completion.

  • 若是要插入的表支持 ACID,而且正在使用支持 ACID 的事務管理器,則該操做將在成功完成後自動提交。

Hive does not support literals for complex types (array, map, struct, union), so it is not possible to use them in INSERT INTO...VALUES clauses. This means that the user cannot insert data into a complex datatype column using the INSERT INTO...VALUES clause.

  • Hive 不支持複雜類型(array, map, struct, union)的字面量,因此不可能在 INSERT INTO…VALUES 子句中使用它們。這意味着用戶不能使用 INSERT INTO…VALUES 將數據插入複雜數據類型列

4.三、Examples

CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2))
  CLUSTERED BY (age) INTO 2 BUCKETS STORED AS ORC;
 
INSERT INTO TABLE students
  VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
 
 
CREATE TABLE pageviews (userid VARCHAR(64), link STRING, came_from STRING)
  PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS STORED AS ORC;
 
INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23')
  VALUES ('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);
 
-- 動態分區 
INSERT INTO TABLE pageviews PARTITION (datestamp)
  VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');
  
INSERT INTO TABLE pageviews
  VALUES ('tjohnson', 'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null, '2014-09-21');

五、Update

Version Information:UPDATE is available starting in Hive 0.14.Updates can only be performed on tables that support ACID. See Hive Transactions for details.

版本信息:

從 Hive 0.14 開始 UPDATE 可用。

只能在支持 ACID 的表上執行更新。

5.一、Syntax

UPDATE tablename SET column = value [, column = value ...] [WHERE expression]

5.二、Synopsis

  • 被引用的列必須是正在更新的表的列。

  • 賦的值必須是 Hive 在 select 子句中支持的表達式。所以,算術運算符、UDFs、casts、literals 等都是支持的。不支持子查詢

  • 只有匹配 WHERE 子句的行纔會被更新。

  • 沒法更新分區列。

  • 沒法更新桶列。

  • 在 Hive 0.14 中,成功完成此操做後,更改將自動提交。

The referenced column must be a column of the table being updated.
The value assigned must be an expression that Hive supports in the select clause. Thus arithmetic operators, UDFs, casts, literals, etc. are supported. Subqueries are not supported.
Only rows that match the WHERE clause will be updated.
Partitioning columns cannot be updated.
Bucketing columns cannot be updated.
In Hive 0.14, upon successful completion of this operation the changes will be auto-committed.

5.三、Notes

  • 對於更新操做,向量化將被關閉。這是自動的,不須要用戶作任何操做。非更新操做不受影響。使用矢量化仍然能夠查詢更新的表。

  • 在 0.14 版本中,當執行更新時,建議設置hive.optimize.sort.dynamic.partition=false。由於這會產生更有效的執行計劃。

Vectorization will be turned off for update operations. This is automatic and requires no action on the part of the user. Non-update operations are not affected. Updated tables can still be queried using vectorization.
In version 0.14 it is recommended that you set hive.optimize.sort.dynamic.partition=false when doing updates, as this produces more efficient execution plans.

六、Delete

Version Information:DELETE is available starting in Hive 0.14.Deletes can only be performed on tables that support ACID. See Hive Transactions for details.

版本信息:

在 Hive 0.14 中 DELETE 可用。

DELETE 操做只能在支持 ACID 的表上執行。

6.一、Syntax

DELETE FROM tablename [WHERE expression]

6.二、Synopsis

Only rows that match the WHERE clause will be deleted.

In Hive 0.14, upon successful completion of this operation the changes will be auto-committed.

  • 只有匹配到 WHERE 子句的行纔會被刪除。

  • 在 Hive 0.14 中,成功完成此操做後,更改將自動提交。

6.三、Notes

  • 對於刪除操做,矢量化將被關閉。這是自動的,不須要用戶作任何操做。不影響非刪除操做。刪除數據的表仍然可使用向量化來查詢。

  • 在 0.14 版本中,當執行刪除操做時,建議設置 hive.optimize.sort.dynamic.partition=false。由於這會產生更有效的執行計劃。

Vectorization will be turned off for delete operations. This is automatic and requires no action on the part of the user. Non-delete operations are not affected. Tables with deleted data can still be queried using vectorization.

In version 0.14 it is recommended that you set hive.optimize.sort.dynamic.partition=false when doing deletes, as this produces more efficient execution plans.

七、Merge

Version Information:MERGE is available starting in Hive 2.2.Merge can only be performed on tables that support ACID. See Hive Transactions for details.

版本信息:

MERGE 從 Hive 2.2 開始可用

Merge 只能在支持 ACID 的表上執行

7.一、Syntax

MERGE INTO <target table> AS T USING <source expression/table> AS S
ON <boolean expression1>
WHEN MATCHED [AND <boolean expression2>] THEN UPDATE SET <set clause list>
WHEN MATCHED [AND <boolean expression3>] THEN DELETE
WHEN NOT MATCHED [AND <boolean expression4>] THEN INSERT VALUES<value list>

7.二、Synopsis

Merge allows actions to be performed on a target table based on the results of a join with a source table.

Merge 容許根據與源表的鏈接結果在目標表上執行操做

In Hive 2.2, upon successful completion of this operation the changes will be auto-committed.

在 Hive 2.2 中,成功完成此操做後,更改將自動提交。

7.三、Performance Note

SQL Standard requires that an error is raised if the ON clause is such that more than 1 row in source matches a row in target. This check is computationally expensive and may affect the overall runtime of a MERGE statement significantly. hive.merge.cardinality.check=false may be used to disable the check at your own risk. If the check is disabled, but the statement has such a cross join effect, it may lead to data corruption.

SQL Standard 要求,若是 ON 子句中,原表的多行與目標表的一行匹配,則會引起錯誤。這種檢查在計算上是昂貴的,而且可能會顯著影響 MERGE 語句的整體運行時間。

hive.merge.cardinality.check=false可能被用於禁用該檢查,風險自負。若是禁用了檢查,但語句具備這樣的交叉鏈接效果,則可能致使數據損壞。

7.四、Notes

1, 2, or 3 WHEN clauses may be present; at most 1 of each type: UPDATE/DELETE/INSERT.
WHEN NOT MATCHED must be the last WHEN clause.
If both UPDATE and DELETE clauses are present, the first one in the statement must include [AND ].
Vectorization will be turned off for merge operations. This is automatic and requires no action on the part of the user. Non-delete operations are not affected. Tables with deleted data can still be queried using vectorization.

  • 可能存在 1 個、2 個或 3 個 WHEN 子句;每一種類型最多有 1 種:UPDATE/DELETE/INSERT。

  • WHEN NOT MATCHED必須是最後一個 WHEN 子句。

  • 若是同時存在 UPDATE 和 DELETE 子句,則語句的第一個子句必須包含 [AND <boolean expression>]

  • 對於 merge 操做將關閉向量化。這是自動的,不須要用戶作任何操做。不影響非刪除操做。刪除數據的表仍然可使用向量化來查詢。

7.五、Examples

See here.

以下:

首先啓用acid

set hive.support.concurrency = true;
set hive.enforce.bucketing = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1;
set hive.auto.convert.join=false;
set hive.merge.cardinality.check=false;

而後建立兩個表,一個做爲 merge 的目標表,一個做爲 merge 的源表。

請注意,目標表必須被分桶,設置啓用事務,並以 orc 格式存儲。

CREATE TABLE transactions(
 ID int,
 TranValue string,
 last_update_user string)
PARTITIONED BY (tran_date string)
CLUSTERED BY (ID) into 5 buckets 
STORED AS ORC TBLPROPERTIES ('transactional'='true');

CREATE TABLE merge_source(
 ID int,
 TranValue string,
 tran_date string)
STORED AS ORC;

而後用一些數據填充目標表和源表。

INSERT INTO transactions PARTITION (tran_date) VALUES
(1, 'value_01', 'creation', '20170410'),
(2, 'value_02', 'creation', '20170410'),
(3, 'value_03', 'creation', '20170410'),
(4, 'value_04', 'creation', '20170410'),
(5, 'value_05', 'creation', '20170413'),
(6, 'value_06', 'creation', '20170413'),
(7, 'value_07', 'creation', '20170413'),
(8, 'value_08', 'creation', '20170413'),
(9, 'value_09', 'creation', '20170413'),
(10, 'value_10','creation', '20170413');

INSERT INTO merge_source VALUES 
(1, 'value_01', '20170410'),
(4, NULL, '20170410'),
(7, 'value_77777', '20170413'),
(8, NULL, '20170413'),
(8, 'value_08', '20170415'),
(11, 'value_11', '20170415');

當咱們檢查這兩個表時,咱們指望在 Merge 以後,第 1 行保持不變,第 4 行被刪除(暗示了一個業務規則:空值表示刪除),第 7 行被更新,第 11 行被插入。

第 8 行涉及到將行從一個分區移動到另外一個分區。當前 Merge 不支持動態更改分區值。這須要在舊分區中刪除,並在新分區中插入。在實際的用例中,須要根據這個標準構建源表。

而後建立 merge 語句,以下所示。請注意,並非全部的 3 個 WHEN 合併語句都須要存在,只有 2 個甚至 1 個 WHEN 語句也能夠。咱們用不一樣的 last_update_user 標記數據。

MERGE INTO transactions AS T 
USING merge_source AS S
ON T.ID = S.ID and T.tran_date = S.tran_date
WHEN MATCHED AND (T.TranValue != S.TranValue AND S.TranValue IS NOT NULL) THEN UPDATE SET TranValue = S.TranValue, last_update_user = 'merge_update'
WHEN MATCHED AND S.TranValue IS NULL THEN DELETE
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.TranValue, 'merge_insert', S.tran_date);

做爲 update 子句的一部分,set value 語句不該該包含目標表裝飾器「T.」,不然將獲得 SQL 編譯錯誤。

一旦合併完成,從新檢查數據就會顯示數據已經按照預期進行了合併。

第 1 行沒有改變;第 4 行被刪除;第 7 行被更新,第 11 行被插入。第 8 行被移到了一個新的分區。

相關文章
相關標籤/搜索