hive 的判斷條件（if、coalesce、case）

時間 2019-11-16

標籤 hive 判斷條件 coalesce case 欄目 Hadoop 简体版

原文原文鏈接

CONDITIONAL FUNCTIONS IN HIVE

Hive supports three types of conditional functions. These functions are listed below:

IF( Test Condition, True Value, False Value )
The IF condition evaluates the 「Test Condition」 and if the 「Test Condition」 is true, then it returns the 「True Value」. Otherwise, it returns the False Value.
Example: IF(1=1, 'working', 'not working') returns 'working'

COALESCE( value1,value2,... )

The COALESCE function returns the fist not NULL value from the list of values. If all the values in the list are NULL, then it returns NULL.
Example: COALESCE(NULL,NULL,5,NULL,4) returns 5

CASE Statement

The syntax for the case statement is:
php

CASE [ expression ]
WHEN condition1 THEN result1
WHEN condition2 THEN result2
...
WHEN conditionn THEN resultn
ELSE result
END

Here expression is optional. It is the value that you are comparing to the list of conditions. (ie: condition1, condition2, ... conditionn).

All the conditions must be of same datatype. Conditions are evaluated in the order listed. Once a condition is found to be true, the case statement will return the result and not evaluate the conditions any further.

All the results must be of same datatype. This is the value returned once a condition is found to be true.

IF no condition is found to be true, then the case statement will return the value in the ELSE clause. If the ELSE clause is omitted and no condition is found to be true, then the case statement will return NULL

Example:

java

CASE Fruit
WHEN 'APPLE' THEN 'The owner is APPLE'
WHEN 'ORANGE' THEN 'The owner is ORANGE'
ELSE 'It is another Fruit'
END

The other form of CASE is

mysql

CASE
WHEN Fruit = 'APPLE' THEN 'The owner is APPLE'
WHEN Fruit = 'ORANGE' THEN 'The owner is ORANGE'
ELSE 'It is another Fruit'
END

----------------------------------------------正則表達式

1.Hive CLI（hive命令行 command line）
hive命令行選項：
-d k=v (定義變量) -e "" -f filename -h host -p port -v (控制檯顯示執行的hql)

hive交互模式：
set;顯示hive中的全部變量，例如set mapred.reduce.tasks=32;
set k=v :若是k不存在，不會報錯
! shell command :hive交互模式執行shell，例如： ! echo aa
dfs command ：hive交互模式執行hadoop fs 的命令，和hadoop fs 命令相同

set:輸出hive設置的變量

數據類型：
Integers
    TINYINT - 1 byte integer
    SMALLINT - 2 byte integer
    INT - 4 byte integer
    BIGINT - 8 byte integer
Boolean type
    BOOLEAN - TRUE/FALSE
Floating point numbers
    FLOAT - single precision
    DOUBLE - Double precision
String type
    STRING - sequence of characters in a specified character set
Complex Types
Structs STRUCT {a INT; b INT} c.a          struct_type : STRUCT < col_name : data_type [COMMENT col_comment], ...>
Maps M['group']
Arrays The elements in the array have to be in the same type ['a', 'b', 'c'], A[1] retruns 'b'.
union: UNIONTYPE<data_type, data_type, ...>
SELECT create_union(0, key), create_union(if(key<100, 0, 1), 2.0, value), create_union(1, "a", struct(2, "b")) FROM src LIMIT 2;
union_type
   : UNIONTYPE < data_type, data_type, ... >

TIMESTAMP
Note: Only available starting with Hive 0.8.0
BINARY
BINARY (Note: Only available starting with Hive 0.8.0)
操做符（Built in operators）：
Relational Operators
A LIKE B _%，sql中的相同
A RLIKE B ：正則表達式like，例如 'foo' rlike 'f.*' 返回true，與A REGEXP B 相同

Arithmetic Operators
A % B A & B（按位與）、A | B （按位或）、A ^ B （按位異或）、~A （按位非）
Logical Operators
A AND B A OR B !A NOT A

Operators on Complex Types
A[n] A is an Array and n is an int
M[key] M is a Map<K, V> and key has type K
S.x S is a struct
函數：
round(double a)BIGINT
floor(double a) BIGINT
ceil(double a) BIGINT
rand(), rand(int seed) double
concat(string A, string B,...)
substr(string A, int start, int length)
upper(string A)
lower(string A)
trim(string A)
ltrim(string A)
rtrim(string A)
regexp_replace(string A, string B, string C)
size(Map<K.V>) returns the number of elements in the map type
size(Array<T>)
cast(<expr> as <type>) 和mysql中的相同
from_unixtime(int unixtime)
to_date(string timestamp) to_date("1970-01-01 00:00:00") = "1970-01-01"
year(string date) year("1970-01-01") = 1970
month(string date) month("1970-11-01") = 11
day(string date) day("1970-11-01") = 1 ,至關於dayofmonth()
hour()/minute()/second()
weekofyear(string date)

get_json_object(string json_string, string path)

aggregate functions
count(*), count(expr), count(DISTINCT expr[, expr_.])
sum(col), sum(DISTINCT col)
avg(col), avg(DISTINCT col)
min(col)
max(col)

建立表：
    CREATE [EXTERNAL] TABLE [if not exists] page_view(viewTime INT comment '', userid BIGINT,
                    page_url STRING, referrer_url STRING,
                    friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
                    ip STRING COMMENT 'IP Address of the User')
    COMMENT 'This is the page view table'
    PARTITIONED BY(dt STRING comment '', country STRING)
    CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS 表按userid分爲32個桶，每一個桶中，數據按viewTime進行排序
    ROW FORMAT DELIMITED
            FIELDS TERMINATED BY '1'
            COLLECTION ITEMS TERMINATED BY '2'
            MAP KEYS TERMINATED BY '3'
   STORED AS SEQUENCEFILE/TEXTFILE/RCFILE/INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
   [LOCATION hdfs_path]
   [AS select_statement];


   注意：
1.The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
2.tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
3.Table names and column names are case insensitive but SerDe and property names are case sensitive.
4.The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
如何往分桶的表中插入數據：
set hive.enforce.bucketing = true;
FROM user_id
INSERT OVERWRITE TABLE user_info_bucketed
PARTITION (ds='2009-02-25')
SELECT userid, firstname, lastname WHERE ds='2009-02-25';

修改表：Alter Table/Partition Statements
修改分區：
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...，該語法1次不能添加多個分區
partition_spec: (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
ALTER TABLE table_name DROP [IF EXISTS] partition_spec, partition_spec,...
ALTER TABLE table_name [PARTITION partitionSpec] SET LOCATION "new location"
Alter Table/Partition Protections
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP;
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
修改表字段：
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
Alter Table Properties(給表中增長本身的元數據)
ALTER TABLE table_name SET TBLPROPERTIES table_properties
Alter Table (Un)Archive
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP;
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;
視圖：
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...

DROP VIEW [IF EXISTS] view_name
ALTER VIEW view_name SET TBLPROPERTIES table_properties
注意：
1.此視圖只是邏輯上的，目前不支持物化視圖
2.若是視圖的基表被刪除，視圖的schema不會被改變，使用視圖時會出錯。
3.視圖是隻讀的。
函數：
add files ...添加jar到hive的classpath
CREATE TEMPORARY FUNCTION function_name AS class_name （使用的類必須包含在classpath中）
DROP TEMPORARY FUNCTION [IF EXISTS] function_name
索引：
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS index_type
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[
   [ ROW FORMAT ...] STORED AS ...
   | STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]

DROP INDEX [IF EXISTS] index_name ON table_name

詳情見：https://cwiki.apache.org/confluence/display/Hive/IndexDev#CREATE_INDEX
Show/Describe Statements
show databases/tables [like 'RegExp'] ，此處爲正則表達式
show partitions tableName [PARTITION(ds='2010-03-03')]
show tblproperties tableName Hive 0.10.0
SHOW FUNCTIONS "a.*"
SHOW [FORMATTED] (INDEX|INDEXES) ON table_with_index [(FROM|IN) db_name]
SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name],輸出全部列，包括分區列 Version information As of Hive 0.10
desc database xl_netdisk_ods;
加載數據：
     1:數據加載到表：
Standard syntax:
   LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]，若是不使用overwrite，文件名不衝突的狀況下原先數據依然存在，不然將被替換
   INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
   INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

注意：
1.INSERT OVERWRITE will overwrite any existing data in the table or partition unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0)
INSERT INTO will append to the table or partition keeping the existing data in tact. (Note: INSERT INTO syntax is only available starting in version 0.8)
     2:數據加載到目錄：
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...

Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1

     3.例子：
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
   LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
   LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
   INSERT OVERWRITE TABLE user_active SELECT user.*FROM user WHERE user.active = 1;

In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example.

    INSERT OVERWRITE TABLE pv_users
    SELECT u.*
    FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
    WHERE pv.date = '2008-03-03';

however, the following query is not allowed

    INSERT OVERWRITE TABLE pv_gender_agg
    SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) 一個select中只能有一個distinct
    FROM pv_users
    GROUP BY pv_users.gender;

hive中in，exists子查詢的替代：
SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM B);
能夠被重寫爲：
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key)
Left Semi join(左半鏈接) 當第二個（底端）輸入中有匹配行時，Left Semi Join 邏輯運算符返回第一個（頂端）輸入中的每行。若是Argument列內不存在任何聯接謂詞，則每行都是一個匹配行。
Multi Table/File Inserts
    FROM pv_users
    INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender

    INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;

    FROM page_view_stg pvs
    INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
   SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
    INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
   SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
    INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
   SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';
Dynamic-partition Insert（動態分區插入，自動建立分區，解決了須要預先知曉分區的問題，往多個分區插入數據，不須要多個job做業，0.6以後版本的功能）
FROM page_view_stg pvs
    INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) 注意此處沒有寫明country的值，會自動建立分區並插入值
   SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
動態分區，只能爲分區的最後一個列，只能是最後一個子分區，不能是這樣（dt, country='US') ,動態分區須要在select指定分區列，靜態分區不須要。若是被插入的分區已經存在，數據被重寫
不然不被重寫。若是分區列的值爲null或‘’，數據會被插入到默認分區__HIVE_DEFAULT_PARTITION__，此數據被認爲是壞數據。
注意：每個節點(mapper or reducer)建立的分區數不能超過100個，經過DML建立的總的分區數不能超過1000個，文件數不能超過10w個，均可以經過參數配置而改變。hive默認不容許全部的分區
都是動態的，能夠經過改變hive.exec.dynamic.partition.mode=nonstrict來改變這種情況

hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country
DISTRIBUTE BY ds, country; 每一個map或reduce產生的分區數超過100，能夠將分區均衡，映射到不一樣的reduce，使用 distribute by
This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.
查詢：
注意：
1.sort by，order by區別，sort by只保證單個reduce中有序，order by保證總體有序，總體有序是在犧牲性能的狀況下保證，reduce的數量爲1
2.sort按照字段的類型進行排序，若是用字符串存儲數值，則須要將其轉換爲非字符串類型以後再排序
3.若是想全部的相同的key在同一個reduce中，使用 cluster by ..,或使用 distribute by .. sort by ..,前者是後者的縮寫形式，通常distribute by後面的列爲sort by的前綴
4.分區的使用和mysql同樣，自動選擇分區。分區的選擇，where中和join以後的on中可使用

鏈接：
1.多個join轉爲1個map/reduce做業。Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses e.g.SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
2.在join中的每個map/reduce階段，緩存前面的表，後面的表的數據依次流入，若是有多個做業，緩存以前的結果，後面的表數據依次流入reduce，若是想要改變緩存的表，給一個hint，
例如：SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)，原本b表中數據依次流入，改成a表中數據依次流入
3.Joins occur BEFORE WHERE CLAUSES，注意左、右、外鏈接的問題，條件放在on中和where中結果不相同，只要知道join和where的順序，以及外鏈接的含義，結果很容易理解
先進行on條件篩選，再鏈接，以後where，on後的篩選條件主要是針對從表，對主表不起做用，由於是外關聯，主表數據都會輸出，對於主表的篩選條件應該放在where後面，若是
以爲主表不須要關聯的數據太多，可使用子查詢，先過濾主表中無用數據
4.只是用一個map完成join，注意有限制。 SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key，對於a的每個map，b徹底讀入內存，只須要map就能夠完成join操做。 The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed
若是join的表都很大，可是在join的字段上進行了分桶，並且一個的桶數是另外一個的倍數，則也能夠進行mapjoin
側視圖(lateral view)：用來列轉行
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
取出樣例數據:tablesample
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname]),將數據分爲y桶，取出第x桶，若是建表時表沒被分桶，則會掃描全表，進行分桶，colname能夠爲rand()
block_sample: TABLESAMPLE (n PERCENT) 取出n%的數據量，不是%n行
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
SELECT * FROM source TABLESAMPLE(0.1 PERCENT) s;
虛擬列：
INPUT__FILE__NAME，BLOCK__OFFSET__INSIDE__FILE
查看語句執行過程
explain [extended] query

生成採樣數據：
INSERT OVERWRITE TABLE pv_gender_sum_sample
    SELECT pv_gender_sum.*
    FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32 [ON userid]);
Union all 略
Array Operations
Array columns in tables can only be created programmatically currently.
SELECT pv.friends[2] FROM page_views pv; SELECT pv.userid, size(pv.friends)FROM page_view pv;
Map(Associative Arrays) Operations
Maps provide collections similar to associative arrays. Such structures can only be created programmatically currently.
INSERT OVERWRITE page_views_map
SELECT pv.userid, pv.properties['page type'] FROM page_views pv;SELECT size(pv.properties) FROM page_view pv;

distribute by 和 cluster by 的區別：
Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so the partition columns and sort columns can be different
Altering Tables
ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');
Dropping Tables and Partitions
ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')

Hive內置函數：
查看函數：show functions;desc function [extended] fname;

運算符：
A [NOT] BETWEEN B AND C (as of version 0.9.0)
函數
1.數學函數：
round(double a[, int d]):int/double ，pow(a,b),sqrt(),bin(a):返回2進制形式，hex():16進制形式，conv(BIGINT num, int from_base, int to_base)
abs(), pmod(int a, int b),返回正餘數，a%b若是餘數爲負，返回負餘數，degrees(double a) 弧度轉爲度，radians(double a)，e(),pi(),sign()符號函數
std(),stddev()
2.Collection Functions
size(Map<K.V>),size(Array<T>),map_keys(Map<K.V>),map_values(Map<K.V>),array_contains(Array<T>, value),sort_array(Array<T>)：按天然順序(as of version 0.9.0)
array(n0, n1...) - Creates an array with the given elements
3.Type Conversion Functions
cast(expr as <type>)
4.Date Functions
from_unixtime(bigint unixtime[, string format])
unix_timestamp(string date, string pattern)
weekofyear(string date),
datediff(string enddate, string startdate),
date_add(string startdate, int days),
date_sub(string startdate, int days)
from_utc_timestamp(timestamp, string timezone)
to_utc_timestamp(timestamp, string timezone)
date_format

5.Conditional Functions
if(boolean testCondition, T valueTrue, T valueFalseOrNull)
COALESCE(a1, a2, ...) - Returns the first non-null argument，若是參數全爲null，返回null。D.J.[kəʊəles] 能夠用來替代ifnull，
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END When a = b, returns c; when a = d, return e; else return f
CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END When a = true, returns b; when c = true, return d; else return e

6.String Functions
concat_ws(string SEP, string A, string B...) ，可使用自定義分隔符
find_in_set(string str, string strList) find_in_set('ab', 'abc,b,ab,c,def') returns 3
format_number(number x, int d) Formats the number X to a format like '#,###,###.##', rounded to D decimal places(as of Hive 0.10.0)
get_json_object(string json_string, string path) ，跟對象的名字爲$
json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.
in_file(string str, string filename)
instr(string str, string substr) ,locate(string substr, string str[, int pos])
lpad(string str, int len, string pad) lpad('a',3,'b'):bba，rpad(),ltrim(),rtrim(),trim()
ngrams(array<array<string>>, int N, int K, int pf) =================
parse_url(string urlString, string partToExtract [, string keyToExtract]) Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.
parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') : 'facebook.com',parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') returns 'v1'.
parse_url_tuple(url, partname1, partname2, ..., partnameN) - extracts N (N>=1) parts from a URL.
SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;
SELECT parse_url_tuple(a.fullurl, 'HOST', 'PATH', 'QUERY', 'REF', 'PROTOCOL', 'FILE', 'AUTHORITY', 'USERINFO', 'QUERY:k1') as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a;
printf(String format, Obj... args) (as of Hive 0.9.0)
regexp_extract('foothebar', 'foo(.*?)(bar)', n) 抽取第n組的數據，例如regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.'
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
repeat(string str, int n) Repeat str n times
reverse(string A)
sentences(string str, string lang, string locale) :array<array<string>>
space(int n)
split(string str, string pat)：array Split str around pat (pat is a regular expression)
str_to_map(text[, delimiter1, delimiter2]) Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ',' for delimiter1 and '=' for delimiter2.
substr(string|binary A, int start, int len)或substring(string|binary A, int start, int len)
translate(string input, string from, string to)
upper(),lower()
groupconcat()
map_keys()
map_values()

Misc. Functions
varies java_method(class, method[, arg1[, arg2..]])Synonym for reflect (as of Hive 0.9.0)
varies reflect(class, method[, arg1[, arg2..]]) Use this UDF to call Java methods by matching the argument signature (uses reflection). (as of Hive 0.7.0)

XPath Functions（從xml格式中獲取數據）
xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string

Built-in Aggregate Functions (UDAF)
max(),min(),count(),avg(),sum()
double variance(col), var_pop(col) Returns the variance of a numeric column in the group 方差
double var_samp(col) Returns the unbiased sample variance of a numeric column in the group 樣本方差
double stddev_pop(col) Returns the standard deviation of a numeric column in the group 標準差
double stddev_samp(col) Returns the unbiased sample standard deviation of a numeric column in the group 樣本標準差
double covar_pop(col1, col2) Returns the population covariance of a pair of numeric columns in the group 協方差
double covar_samp(col1, col2) Returns the sample covariance of a pair of a numeric columns in the group 樣本協方差

Built-in Table-Generating Functions (UDTF)
Array Type explode(array<TYPE> a) For each element in a, explode() generates a row containing that element
    No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
    UDTF's can't be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
    GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported
stack(INT n, v_1, v_2, ..., v_k) Breaks up v_1, ..., v_k into n rows. Each row will have k/n columns. n must be constant. sql

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。