Hive Tutorial(上)(Hive 入門指導)

用戶指導

Hive 指導


概念

Hive是什麼?

Hive是一個以Apache Hadoop爲基礎的數據倉儲基礎設施。Hadoop爲數據的存儲和運行在商業機器上提供了可擴展和高容錯的性能。 html

Hive的設計目標是使得數據彙總更加簡單和針對大容量數據的查詢和分析。它提供SWL來使得用戶能夠更簡單地查詢、彙總和數據分析。同時,Hive的SQL爲用戶提供了多種地方來融合他們本身的方法實現自定義分析,例如User Defined Functions (UDFs)。java

Hive不是什麼?

Hive不是爲事務聯機處理設計的。它是用於處理傳統數據倉儲任務。數據庫

得到和開始

至於如何配置Hive,HiveServer2和Beeline的細節,請參考GettingStarted指南。express

Books about Hive 展現了一些能夠幫助更好開始Hive的書籍。apache

在接下來的部分咱們將提供一份關於系統性能的指導。咱們開始描述data types,tables和partitions(跟傳統關係型數據庫類似)的概念和經過舉例幫助瞭解Hive的能力。json

數據單元

爲了使得粒度合適,Hive數據採用下面展現的組織結構:c#

  • Databases:命名空間方法用來避免tables,views,partitions,columns等等的命名衝突。Databases也能夠用於增強用戶或者一組用戶的安全性。
  • Tables: 擁有相同的schema被當作是同種數據單元。下面是page_views表的例子,每一行包括如下的列(schema):
    • timestamp—當網頁被瀏覽時UNIX timestamp一致的INT類型的數據
    • userid  —用來識別瀏覽該頁面的用戶的BIGINT類型的數據
    • page_url —獲取網頁位置的STRING類型的數據
    • referer_url—用於獲取用戶所在當前頁面的位置的STRING類型的數據。
    • IP—用於獲取頁面請求時的IP地址。
  • Partitions:每一個頁面都擁有一個或者多個Key來決定數據如何存儲。Partitions——除了存儲單元——也容許用戶高效地識別知足指定標準的行。例如,STRING類型的 data_partition和STRING 類型的country_partition。每個獨一無二的partition key值定義一個Table的partition。例如,US時間的「2009-12-23」是page_views table的一個partition。所以,若是你只想分析2009-12-23的「US」數據,你能夠運行在該table相關的partition上,從而提升分析效率。然而須要說明的是,只是由於有一個partition名字爲2009-12-23並不意味着它包含全部或者只是該日期的數據。partition用時間命名只是爲了方面。維持partition名字和數據內容之間的映射關係是用戶的工做。Partition列是虛擬列,自己不是數據的一部分而是在加載中派生出來的。
  • Buckets (or Clusters):基於對錶中一些列的hash方法得出值來將每一個partition中的數據分到不一樣的Buckets中。例如,page_views表可能經過userid表中不一樣於其餘partition列的列來bucket。這能夠用於有效地獲取樣本。

須要說明是對於表來講partitioned和bucketed不是必需的,但這些抽象化概念容許系統在查詢操做中篩選掉大量數據來提升查詢速度。api

類型系統

Hive支持原始和複雜數據類型,正以下面多描述的。能夠在Hive Data Types中查看更多信息。數組

原始類型

  • 數據類型是跟表中的列相關的,支持下面的原始類型:
  • Integers
    • TINYINT—1個字節的整型
    • SMALLINT—2個字節的整型
    • INT—4個字節的整型
    • BIGINT—8個字節的整型
  • Boolean type
    • BOOLEAN—TRUE/FALSE
  • Floating point numbers
    • FLOAT—單精度
    • DOUBLE—雙精度
  • Fixed point numbers
    • DECIMAL—a fixed point value of user defined scale and precision用戶定義的大小和精度的固定的點值
  • String types
    • STRING—指定字符集的字符串
    • VARCHAR—指定字符集最大長度的字符串
    • CHAR—指定字符集和長度的字符串
  • Date and time types
    • TIMESTAMP— 納秒精度的特定時間點
    • DATE—日期
  • Binary types
    • BINARY—字節序列

類型的層次結構以下(父類是全部子類實例的超類型):安全

  • Type
    • Primitive Type
      • Number
        • DOUBLE
          • FLOAT
            • BIGINT
              • INT
                • SMALLINT
                  • TINYINT
          • STRING
      • BOOLEAN

類型層級定義了類型在查詢語言中的隱性轉換。隱性轉換容許子類轉換成父類。因此當一個查詢表達式須要type1可是數據是type2,type1在層級結構中是type2的父類,那麼type2能夠轉換成type1.須要說明的是類型層級容許STRING轉換成DOUBLE。

明確的類型轉換能夠用下面部分#Built In Functions中的cast操做符來實現。

複雜類型

複雜類型能夠用原始類型和其餘組合類型來組合:

  • Structs:類型裏面的元素能夠用.符號來得到。舉個例子,一個列c的類型是STRUCT{a INT;b INT},那麼裏面的a可用c.a來訪問。
  • Maps (key-value tuples): 元素可用[‘元素名’]來訪問。例如,在一個map M中包含一個一個鍵值對‘group’->gid,那麼gid的值能夠用M[‘group’]來得到。
  • Arrays (indexable lists): 數組中元素必須是同種類型。元素能夠經過[index]來得到。舉個例子,A數組擁有元素[‘a’,’b’,’c’],那麼A[1]會返回‘b’。

使用原始數據類型和創造複雜類型的架構,任意級別的嵌套類型均可以被創造。例如,對於一個類型,用戶可能包含下面的字段:

  • gender—which is a STRING.
  • active—which is a BOOLEAN.

內置操做符和方法

下面列出的操做符和方法不必定是最新的(Hive Operators and UDFs裏面有更多最新信息)在 Beeline 或者 Hive CLI, 使用這些命令行得到最新文檔:

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

區分大小寫

全部的Hive關鍵詞都是區分大小寫,包括Hive操做和方法名。

內置操做符

  • 關係操做符—經過與傳遞進來的值進行比較返回TRUE or FALSE

Relational Operator   

Operand   types   

Description

A = B      

all primitive types

TRUE if expression A is equivalent to expression B; otherwise FALSE

A != B

all primitive types   

TRUE if expression A is not equivalent to expression B; otherwise FALSE

A < B

all primitive types

TRUE if expression A is less than expression B; otherwise FALSE

A <= B

all primitive types

TRUE if expression A is less than or equal to expression B; otherwise FALSE

A > B

all primitive types

TRUE if expression A is greater than expression B] otherwise FALSE

A >= B          

all primitive types

TRUE if expression A is greater than or equal to expression B otherwise FALSE

A IS NULL

all types

TRUE if expression A evaluates to NULL otherwise FALSE

A IS NOT NULL

all types

FALSE if expression A evaluates to NULL otherwise TRUE

A LIKE B         

strings        

TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semicolon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b'

A RLIKE B

strings

NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to TRUE and so does 'foobar' rlike '^f.*r$'.

A REGEXP B

strings

Same as RLIKE

  • 算術操做符—下面的運算符支持各類常見的算術運算符。全部的返回值都爲number類型的。

Arithmetic Operators

Operand types

Description

A + B

all number types

Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float.

A - B

all number types

Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A * B

all number types

Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.

A / B

all number types

Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division.

A % B

all number types

Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A & B

all number types

Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A | B

all number types

Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A ^ B

all number types

Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

~A

all number types

Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.

  • 邏輯運算符 — 下面的操做符支持建立邏輯表達式。全部的結果根據操做對象的布爾值來返回true/false。

Logical Operators

Operands types

Description

A AND B

boolean

TRUE if both A and B are TRUE, otherwise FALSE

A && B

boolean

Same as A AND B

A OR B

boolean

TRUE if either A or B or both are TRUE, otherwise FALSE

A || B

boolean

Same as A OR B

NOT A

boolean

TRUE if A is FALSE, otherwise FALSE

!A

boolean

Same as NOT A

  • 複雜類型中的操做符—下面的操做符提供機制來得到複雜類型中的元素,

Operator

Operand types

Description

A[n]

A is an Array and n is an int

returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'

M[key]

M is a Map<K, V> and key has type K

returns the value corresponding to the key in the map for example, if M is a map comprising of
{'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'

S.x

S is a struct

returns the x field of S, for example, for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct.

內置方法

Return Type

Function Name (Signature)

Description

BIGINT

round(double a)

returns the rounded BIGINT value of the double

BIGINT

floor(double a)

returns the maximum BIGINT value that is equal or less than the double

BIGINT

ceil(double a)

returns the minimum BIGINT value that is equal or greater than the double

double

rand(), rand(int seed)

returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic.

string

concat(string A, string B,...)

returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them.

string

substr(string A, int start)

returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar'

string

substr(string A, int start, int length)

returns the substring of A starting from start position with the given length, for example,
substr('foobar', 4, 2) results in 'ba'

string

upper(string A)

returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR'

string

ucase(string A)

Same as upper

string

lower(string A)

returns the string resulting from converting all characters of B to lower case, for example, lower('fOoBaR') results in 'foobar'

string

lcase(string A)

Same as lower

string

trim(string A)

returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar'

string

ltrim(string A)

returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar '

string

rtrim(string A)

returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar'

string

regexp_replace(string A, string B, string C)

returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb'

int

size(Map<K.V>)

returns the number of elements in the map type

int

size(Array<T>)

returns the number of elements in the array type

value of <type>

cast(<expr> as <type>)

converts the results of the expression expr to <type>, for example, cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed.

string

from_unixtime(int unixtime)

convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00"

string

to_date(string timestamp)

Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"

int

year(string date)

Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970

int

month(string date)

Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11

int

day(string date)

Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1

string

get_json_object(string json_string, string path)

Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.

  • Hive中的統計函數:

Return Type

Aggregation Function Name (Signature)

Description

BIGINT

count(*), count(expr), count(DISTINCT expr[, expr_.])

count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

DOUBLE

sum(col), sum(DISTINCT col)

returns the sum of the elements in the group or the sum of the distinct values of the column in the group

DOUBLE

avg(col), avg(DISTINCT col)

returns the average of the elements in the group or the average of the distinct values of the column in the group

DOUBLE

min(col)

returns the minimum value of the column in the group

DOUBLE

max(col)

returns the maximum value of the column in the group

語言能力

Hive's SQL 提供基礎 SQL操做. 這些操做是用在表和partition上,這些操做是:

    • 用WHERE來過濾表中的行
    • 用SELECT從表中來選擇肯定的行
    • 兩表聯合
    • Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.(翻譯不出來T T)
    • 將查詢的結果存儲到另外一個表中
    • 將表中內容下載到本地目錄
    • 將查詢結果存儲到hadoop dfs目錄
    • 管理tables和partitions(新建,移除和更改)
    • 插入使用自定義 map/reduce做業使用的語言寫的自定義腳本

下面是原文


 

Tutorial

Hive Tutorial


Concepts

What Is Hive

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive's SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).  

What Hive Is NOT

Hive is not designed for online transaction processing.  It is best used for traditional data warehousing tasks.

Getting Started

For details on setting up Hive, HiveServer2, and Beeline, please refer to the GettingStarted guide.

Books about Hive lists some books that may also be helpful for getting started with Hive.

In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables, and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of Hive with the help of some examples.

Data Units

In the order of granularity - Hive data is organized into:

  • Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on.  Databases can also be used to enforce security for a user or group of users.
  • Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):
    • timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was viewed.
    • userid —which is of BIGINT type that identifies the user who viewed the page.
    • page_url—which is of STRING type that captures the location of the page.
    • referer_url—which is of STRING that captures the location of the page from where the user arrived at the current page.
    • IP—which is of STRING type that captures the IP address from where the page request was made.
  • Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions—apart from being storage units—also allow the user to efficiently identify the rows that satisfy a specified criteria; for example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example, all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table, thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience; it is the user's job to guarantee the relationship between partition name and data content! Partition columns are virtual columns, they are not part of the data itself but are derived on load.
  • Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.

Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.

Type System

Hive supports primitive and complex data types, as described below. See Hive Data Types for additional information.

Primitive Types

  • Types are associated with the columns in the tables. The following Primitive types are supported:
  • Integers
    • TINYINT—1 byte integer
    • SMALLINT—2 byte integer
    • INT—4 byte integer
    • BIGINT—8 byte integer
  • Boolean type
    • BOOLEAN—TRUE/FALSE
  • Floating point numbers
    • FLOAT—single precision
    • DOUBLE—Double precision
  • Fixed point numbers
    • DECIMAL—a fixed point value of user defined scale and precision
  • String types
    • STRING—sequence of characters in a specified character set
    • VARCHAR—sequence of characters in a specified character set with a maximum length
    • CHAR—sequence of characters in a specified character set with a defined length
  • Date and time types
    • TIMESTAMP— a specific point in time, up to nanosecond precision
    • DATE—a date
  • Binary types
    • BINARY—a sequence of bytes

The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):

  • Type
    • Primitive Type
      • Number
        • DOUBLE
          • FLOAT
            • BIGINT
              • INT
                • SMALLINT
                  • TINYINT
          • STRING
      • BOOLEAN

This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Note that the type hierarchy allows the implicit conversion of STRING to DOUBLE.

Explicit type conversion can be done using the cast operator as shown in the #Built In Functions section below.

Complex Types

Complex Types can be built up from primitive types and other composite types using:

  • Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a
  • Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']
  • Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.

Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields:

  • gender—which is a STRING.
  • active—which is a BOOLEAN.

Built In Operators and Functions

The operators and functions listed below are not necessarily up to date. (Hive Operators and UDFs has more current information.) In Beeline or the Hive CLI, use these commands to show the latest documentation:

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

Case-insensitive

All Hive keywords are case-insensitive, including the names of Hive operators and functions.

Built In Operators

  • Relational Operators—The following operators compare the passed operands and generate a TRUE or FALSE value, depending on whether the comparison between the operands holds or not.

Relational Operator

Operand types

Description

A = B

all primitive types

TRUE if expression A is equivalent to expression B; otherwise FALSE

A != B

all primitive types

TRUE if expression A is not equivalent to expression B; otherwise FALSE

A < B

all primitive types

TRUE if expression A is less than expression B; otherwise FALSE

A <= B

all primitive types

TRUE if expression A is less than or equal to expression B; otherwise FALSE

A > B

all primitive types

TRUE if expression A is greater than expression B] otherwise FALSE

A >= B

all primitive types

TRUE if expression A is greater than or equal to expression B otherwise FALSE

A IS NULL

all types

TRUE if expression A evaluates to NULL otherwise FALSE

A IS NOT NULL

all types

FALSE if expression A evaluates to NULL otherwise TRUE

A LIKE B

strings

TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semicolon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b'

A RLIKE B

strings

NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to TRUE and so does 'foobar' rlike '^f.*r$'.

A REGEXP B

strings

Same as RLIKE

  • Arithmetic Operators—The following operators support various common arithmetic operations on the operands. All of them return number types.

Arithmetic Operators

Operand types

Description

A + B

all number types

Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float.

A - B

all number types

Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A * B

all number types

Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.

A / B

all number types

Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division.

A % B

all number types

Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A & B

all number types

Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A | B

all number types

Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A ^ B

all number types

Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

~A

all number types

Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.

  • Logical Operators — The following operators provide support for creating logical expressions. All of them return boolean TRUE or FALSE depending upon the boolean values of the operands.

Logical Operators

Operands types

Description

A AND B

boolean

TRUE if both A and B are TRUE, otherwise FALSE

A && B

boolean

Same as A AND B

A OR B

boolean

TRUE if either A or B or both are TRUE, otherwise FALSE

A || B

boolean

Same as A OR B

NOT A

boolean

TRUE if A is FALSE, otherwise FALSE

!A

boolean

Same as NOT A

  • Operators on Complex Types—The following operators provide mechanisms to access elements in Complex Types

Operator

Operand types

Description

A[n]

A is an Array and n is an int

returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'

M[key]

M is a Map<K, V> and key has type K

returns the value corresponding to the key in the map for example, if M is a map comprising of
{'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all'] returns 'foobar'

S.x

S is a struct

returns the x field of S, for example, for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct.

Built In Functions

Return Type

Function Name (Signature)

Description

BIGINT

round(double a)

returns the rounded BIGINT value of the double

BIGINT

floor(double a)

returns the maximum BIGINT value that is equal or less than the double

BIGINT

ceil(double a)

returns the minimum BIGINT value that is equal or greater than the double

double

rand(), rand(int seed)

returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic.

string

concat(string A, string B,...)

returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them.

string

substr(string A, int start)

returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar'

string

substr(string A, int start, int length)

returns the substring of A starting from start position with the given length, for example,
substr('foobar', 4, 2) results in 'ba'

string

upper(string A)

returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR'

string

ucase(string A)

Same as upper

string

lower(string A)

returns the string resulting from converting all characters of B to lower case, for example, lower('fOoBaR') results in 'foobar'

string

lcase(string A)

Same as lower

string

trim(string A)

returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar'

string

ltrim(string A)

returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar '

string

rtrim(string A)

returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar'

string

regexp_replace(string A, string B, string C)

returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb'

int

size(Map<K.V>)

returns the number of elements in the map type

int

size(Array<T>)

returns the number of elements in the array type

value of <type>

cast(<expr> as <type>)

converts the results of the expression expr to <type>, for example, cast('1' as BIGINT) will convert the string '1' to it integral representation. A null is returned if the conversion does not succeed.

string

from_unixtime(int unixtime)

convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00"

string

to_date(string timestamp)

Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01"

int

year(string date)

Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970

int

month(string date)

Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11

int

day(string date)

Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1

string

get_json_object(string json_string, string path)

Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.

  • The following built in aggregate functions are supported in Hive:

Return Type

Aggregation Function Name (Signature)

Description

BIGINT

count(*), count(expr), count(DISTINCT expr[, expr_.])

count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL.

DOUBLE

sum(col), sum(DISTINCT col)

returns the sum of the elements in the group or the sum of the distinct values of the column in the group

DOUBLE

avg(col), avg(DISTINCT col)

returns the average of the elements in the group or the average of the distinct values of the column in the group

DOUBLE

min(col)

returns the minimum value of the column in the group

DOUBLE

max(col)

returns the maximum value of the column in the group

Language Capabilities

Hive's SQL provides the basic SQL operations. These operations work on tables or partitions. These operations are:

    • Ability to filter rows from a table using a WHERE clause.
    • Ability to select certain columns from the table using a SELECT clause.
    • Ability to do equi-joins between two tables.
    • Ability to evaluate aggregations on multiple "group by" columns for the data stored in a table.
    • Ability to store the results of a query into another table.
    • Ability to download the contents of a table to a local (for example,, nfs) directory.
    • Ability to store the results of a query in a hadoop dfs directory.
    • Ability to manage tables and partitions (create, drop and alter).
    • Ability to plug in custom scripts in the language of choice for custom map/reduce jobs.
相關文章
相關標籤/搜索