Hive是一個以Apache Hadoop爲基礎的數據倉儲基礎設施。Hadoop爲數據的存儲和運行在商業機器上提供了可擴展和高容錯的性能。 html
Hive的設計目標是使得數據彙總更加簡單和針對大容量數據的查詢和分析。它提供SWL來使得用戶能夠更簡單地查詢、彙總和數據分析。同時,Hive的SQL爲用戶提供了多種地方來融合他們本身的方法實現自定義分析,例如User Defined Functions (UDFs)。java
Hive不是爲事務聯機處理設計的。它是用於處理傳統數據倉儲任務。數據庫
至於如何配置Hive,HiveServer2和Beeline的細節,請參考GettingStarted指南。express
Books about Hive 展現了一些能夠幫助更好開始Hive的書籍。apache
在接下來的部分咱們將提供一份關於系統性能的指導。咱們開始描述data types,tables和partitions(跟傳統關係型數據庫類似)的概念和經過舉例幫助瞭解Hive的能力。json
爲了使得粒度合適,Hive數據採用下面展現的組織結構:c#
timestamp
—當網頁被瀏覽時UNIX timestamp一致的INT類型的數據userid
—用來識別瀏覽該頁面的用戶的BIGINT類型的數據page_url —
獲取網頁位置的STRING類型的數據referer_url—
用於獲取用戶所在當前頁面的位置的STRING類型的數據。IP—
用於獲取頁面請求時的IP地址。須要說明是對於表來講partitioned和bucketed不是必需的,但這些抽象化概念容許系統在查詢操做中篩選掉大量數據來提升查詢速度。api
Hive支持原始和複雜數據類型,正以下面多描述的。能夠在Hive Data Types中查看更多信息。數組
類型的層次結構以下(父類是全部子類實例的超類型):安全
類型層級定義了類型在查詢語言中的隱性轉換。隱性轉換容許子類轉換成父類。因此當一個查詢表達式須要type1可是數據是type2,type1在層級結構中是type2的父類,那麼type2能夠轉換成type1.須要說明的是類型層級容許STRING轉換成DOUBLE。
明確的類型轉換能夠用下面部分#Built In Functions中的cast操做符來實現。
複雜類型能夠用原始類型和其餘組合類型來組合:
使用原始數據類型和創造複雜類型的架構,任意級別的嵌套類型均可以被創造。例如,對於一個類型,用戶可能包含下面的字段:
下面列出的操做符和方法不必定是最新的(Hive Operators and UDFs裏面有更多最新信息)在 Beeline 或者 Hive CLI, 使用這些命令行得到最新文檔:
SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;
區分大小寫
全部的Hive關鍵詞都是區分大小寫,包括Hive操做和方法名。
Relational Operator |
Operand types |
Description |
A = B |
all primitive types |
TRUE if expression A is equivalent to expression B; otherwise FALSE |
A != B |
all primitive types |
TRUE if expression A is not equivalent to expression B; otherwise FALSE |
A < B |
all primitive types |
TRUE if expression A is less than expression B; otherwise FALSE |
A <= B |
all primitive types |
TRUE if expression A is less than or equal to expression B; otherwise FALSE |
A > B |
all primitive types |
TRUE if expression A is greater than expression B] otherwise FALSE |
A >= B |
all primitive types |
TRUE if expression A is greater than or equal to expression B otherwise FALSE |
A IS NULL |
all types |
TRUE if expression A evaluates to NULL otherwise FALSE |
A IS NOT NULL |
all types |
FALSE if expression A evaluates to NULL otherwise TRUE |
A LIKE B |
strings |
TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, |
A RLIKE B |
strings |
NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to TRUE and so does 'foobar' rlike '^f.*r$'. |
A REGEXP B |
strings |
Same as RLIKE |
Arithmetic Operators |
Operand types |
Description |
A + B |
all number types |
Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float. |
A - B |
all number types |
Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A * B |
all number types |
Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. |
A / B |
all number types |
Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division. |
A % B |
all number types |
Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A & B |
all number types |
Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A | B |
all number types |
Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A ^ B |
all number types |
Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
~A |
all number types |
Gives the result of bitwise NOT of A. The type of the result is the same as the type of A. |
Logical Operators |
Operands types |
Description |
A AND B |
boolean |
TRUE if both A and B are TRUE, otherwise FALSE |
A && B |
boolean |
Same as A AND B |
A OR B |
boolean |
TRUE if either A or B or both are TRUE, otherwise FALSE |
A || B |
boolean |
Same as A OR B |
NOT A |
boolean |
TRUE if A is FALSE, otherwise FALSE |
!A |
boolean |
Same as NOT A |
Operator |
Operand types |
Description |
A[n] |
A is an Array and n is an int |
returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' |
M[key] |
M is a Map<K, V> and key has type K |
returns the value corresponding to the key in the map for example, if M is a map comprising of |
S.x |
S is a struct |
returns the x field of S, for example, for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct. |
Return Type |
Function Name (Signature) |
Description |
BIGINT |
round(double a) |
returns the rounded BIGINT value of the double |
BIGINT |
floor(double a) |
returns the maximum BIGINT value that is equal or less than the double |
BIGINT |
ceil(double a) |
returns the minimum BIGINT value that is equal or greater than the double |
double |
rand(), rand(int seed) |
returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. |
string |
concat(string A, string B,...) |
returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. |
string |
substr(string A, int start) |
returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' |
string |
substr(string A, int start, int length) |
returns the substring of A starting from start position with the given length, for example, |
string |
upper(string A) |
returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR' |
string |
ucase(string A) |
Same as upper |
string |
lower(string A) |
returns the string resulting from converting all characters of B to lower case, for example, lower('fOoBaR') results in 'foobar' |
string |
lcase(string A) |
Same as lower |
string |
trim(string A) |
returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar' |
string |
ltrim(string A) |
returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' |
string |
rtrim(string A) |
returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' |
string |
regexp_replace(string A, string B, string C) |
returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' |
int |
size(Map<K.V>) |
returns the number of elements in the map type |
int |
size(Array<T>) |
returns the number of elements in the array type |
value of <type> |
cast(<expr> as <type>) |
converts the results of the expression expr to <type>, for example, cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed. |
string |
from_unixtime(int unixtime) |
convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" |
string |
to_date(string timestamp) |
Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" |
int |
year(string date) |
Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 |
int |
month(string date) |
Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 |
int |
day(string date) |
Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 |
string |
get_json_object(string json_string, string path) |
Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid. |
Return Type |
Aggregation Function Name (Signature) |
Description |
BIGINT |
count(*), count(expr), count(DISTINCT expr[, expr_.]) |
count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL. |
DOUBLE |
sum(col), sum(DISTINCT col) |
returns the sum of the elements in the group or the sum of the distinct values of the column in the group |
DOUBLE |
avg(col), avg(DISTINCT col) |
returns the average of the elements in the group or the average of the distinct values of the column in the group |
DOUBLE |
min(col) |
returns the minimum value of the column in the group |
DOUBLE |
max(col) |
returns the maximum value of the column in the group |
Hive's SQL 提供基礎 SQL操做. 這些操做是用在表和partition上,這些操做是:
下面是原文
Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive's SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).
Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks.
For details on setting up Hive, HiveServer2, and Beeline, please refer to the GettingStarted guide.
Books about Hive lists some books that may also be helpful for getting started with Hive.
In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables, and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of Hive with the help of some examples.
In the order of granularity - Hive data is organized into:
timestamp
—which is of INT type that corresponds to a UNIX timestamp of when the page was viewed.userid
—which is of BIGINT type that identifies the user who viewed the page.page_url—
which is of STRING type that captures the location of the page.referer_url—
which is of STRING that captures the location of the page from where the user arrived at the current page.IP—
which is of STRING type that captures the IP address from where the page request was made.Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.
Hive supports primitive and complex data types, as described below. See Hive Data Types for additional information.
The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):
This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Note that the type hierarchy allows the implicit conversion of STRING to DOUBLE.
Explicit type conversion can be done using the cast operator as shown in the #Built In Functions section below.
Complex Types can be built up from primitive types and other composite types using:
Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields:
The operators and functions listed below are not necessarily up to date. (Hive Operators and UDFs has more current information.) In Beeline or the Hive CLI, use these commands to show the latest documentation:
SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;
Case-insensitive
All Hive keywords are case-insensitive, including the names of Hive operators and functions.
Relational Operator |
Operand types |
Description |
A = B |
all primitive types |
TRUE if expression A is equivalent to expression B; otherwise FALSE |
A != B |
all primitive types |
TRUE if expression A is not equivalent to expression B; otherwise FALSE |
A < B |
all primitive types |
TRUE if expression A is less than expression B; otherwise FALSE |
A <= B |
all primitive types |
TRUE if expression A is less than or equal to expression B; otherwise FALSE |
A > B |
all primitive types |
TRUE if expression A is greater than expression B] otherwise FALSE |
A >= B |
all primitive types |
TRUE if expression A is greater than or equal to expression B otherwise FALSE |
A IS NULL |
all types |
TRUE if expression A evaluates to NULL otherwise FALSE |
A IS NOT NULL |
all types |
FALSE if expression A evaluates to NULL otherwise TRUE |
A LIKE B |
strings |
TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, |
A RLIKE B |
strings |
NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to TRUE and so does 'foobar' rlike '^f.*r$'. |
A REGEXP B |
strings |
Same as RLIKE |
Arithmetic Operators |
Operand types |
Description |
A + B |
all number types |
Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float. |
A - B |
all number types |
Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A * B |
all number types |
Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. |
A / B |
all number types |
Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division. |
A % B |
all number types |
Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A & B |
all number types |
Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A | B |
all number types |
Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A ^ B |
all number types |
Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
~A |
all number types |
Gives the result of bitwise NOT of A. The type of the result is the same as the type of A. |
Logical Operators |
Operands types |
Description |
A AND B |
boolean |
TRUE if both A and B are TRUE, otherwise FALSE |
A && B |
boolean |
Same as A AND B |
A OR B |
boolean |
TRUE if either A or B or both are TRUE, otherwise FALSE |
A || B |
boolean |
Same as A OR B |
NOT A |
boolean |
TRUE if A is FALSE, otherwise FALSE |
!A |
boolean |
Same as NOT A |
Operator |
Operand types |
Description |
A[n] |
A is an Array and n is an int |
returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' |
M[key] |
M is a Map<K, V> and key has type K |
returns the value corresponding to the key in the map for example, if M is a map comprising of |
S.x |
S is a struct |
returns the x field of S, for example, for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct. |
Return Type |
Function Name (Signature) |
Description |
BIGINT |
round(double a) |
returns the rounded BIGINT value of the double |
BIGINT |
floor(double a) |
returns the maximum BIGINT value that is equal or less than the double |
BIGINT |
ceil(double a) |
returns the minimum BIGINT value that is equal or greater than the double |
double |
rand(), rand(int seed) |
returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. |
string |
concat(string A, string B,...) |
returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. |
string |
substr(string A, int start) |
returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' |
string |
substr(string A, int start, int length) |
returns the substring of A starting from start position with the given length, for example, |
string |
upper(string A) |
returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR' |
string |
ucase(string A) |
Same as upper |
string |
lower(string A) |
returns the string resulting from converting all characters of B to lower case, for example, lower('fOoBaR') results in 'foobar' |
string |
lcase(string A) |
Same as lower |
string |
trim(string A) |
returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar' |
string |
ltrim(string A) |
returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' |
string |
rtrim(string A) |
returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' |
string |
regexp_replace(string A, string B, string C) |
returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' |
int |
size(Map<K.V>) |
returns the number of elements in the map type |
int |
size(Array<T>) |
returns the number of elements in the array type |
value of <type> |
cast(<expr> as <type>) |
converts the results of the expression expr to <type>, for example, cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed. |
string |
from_unixtime(int unixtime) |
convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" |
string |
to_date(string timestamp) |
Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" |
int |
year(string date) |
Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 |
int |
month(string date) |
Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 |
int |
day(string date) |
Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 |
string |
get_json_object(string json_string, string path) |
Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid. |
Return Type |
Aggregation Function Name (Signature) |
Description |
BIGINT |
count(*), count(expr), count(DISTINCT expr[, expr_.]) |
count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL. |
DOUBLE |
sum(col), sum(DISTINCT col) |
returns the sum of the elements in the group or the sum of the distinct values of the column in the group |
DOUBLE |
avg(col), avg(DISTINCT col) |
returns the average of the elements in the group or the average of the distinct values of the column in the group |
DOUBLE |
min(col) |
returns the minimum value of the column in the group |
DOUBLE |
max(col) |
returns the maximum value of the column in the group |
Hive's SQL provides the basic SQL operations. These operations work on tables or partitions. These operations are: