Hive 函數小集合

(1)explode()函數(配合使用:Lateral View):

explode(array)函數接受array類型的參數,其做用剛好與collect_set相反,實現將array類型數據單列轉多行或多列。explode(ARRAY)  列表中的每一個元素生成一行;
explode(MAP) map中每一個key-value對,生成一行,key爲一列,value爲一列;

限制:

一、No other expressions are allowed in SELECT
   SELECT pageid, explode(adid_list) AS myCol... is not supported;php

二、UDTF's can't be nested
   SELECT explode(explode(adid_list)) AS myCol... is not supported;三、GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
   SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported;html

二、lateral view
可以使用lateral view解除以上限制,語法以下:

fromClause: FROM baseTable (lateralView)*
lateralView: LATERAL VIEW explode(expression) tableAlias AS columnAlias (',' columnAlias)*java

解釋一下:
Lateral view 其實就是用來和像相似explode這種UDTF函數聯用的Lateral view 會將UDTF生成的結果放到一個虛擬表中,而後這個虛擬表會和輸入行即每一個game_id進行join 來達到鏈接UDTF外的select字段的目的。mysql

案例:table名稱爲pageAds
SELECT pageid, adid
FROM pageAds LATERAL VIEW explode(adid_list) adTable AS adid;正則表達式

3.多個lateral view
from語句後面能夠帶多個lateral view語句sql

案例:
表名:baseTableshell

from後只有一個lateral view:
SELECT myCol1, col2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1;express

多個lateral view:
SELECT myCol1, myCol2 FROM baseTable
LATERAL VIEW explode(col1) myTable1 AS myCol1
LATERAL VIEW explode(col2) myTable2 AS myCol2;apache

四、Outer Lateral Views
若是array類型的字段爲空,但依然需返回記錄,可以使用outer關鍵詞。json

好比:select * from src LATERAL VIEW explode(array()) C AS a limit 10;
這條語句中的array字段是個空列表,這條語句無論src表中是否有記錄,結果都是空的。

而:select * from src LATERAL VIEW OUTER explode(array()) C AS a limit 10;
結果中的記錄數爲src表的記錄數,只是a字段爲NULL。
好比:

238 val_238 NULL
86 val_86 NULL
311 val_311 NULL
27 val_27 NULL
165 val_165 NULL
409 val_409 NULL

1.列轉行

1.1 問題引入:
如何將
a       b       1,2,3
c       d       4,5,6
變爲:
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6

1.2 原始數據: 

test.txt
a b 1,2,3
c d 4,5,6

 1.3 解決方法1:

drop table test_jzl_20140701_test;
建表:
create table test_jzl_20140701_test
(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by ' '
stored as textfile;

加載數據:
load data local inpath '/home/jiangzl/shell/test.txt' into table test_jzl_20140701_test;

查看錶中全部數據:
select * from test_jzl_20140701_test;  
a       b       1,2,3
c       d       4,5,6

遍歷數組中的每一列
select col1,col2,name 
from test_jzl_20140701_test  
lateral view explode(split(col3,',')) col3 as name;
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6

解決方法2:

drop table test_jzl_20140701_test1;
建表:
create table test_jzl_20140701_test1
(
col1 string,
col2 string,
col3 array<int>
)
row format delimited 
fields terminated by ' '
collection items terminated by ','   //定義數組的分隔符
stored as textfile;

加載數據:
load data local inpath '/home/jiangzl/shell/test.txt' into table test_jzl_20140701_test1;

查看錶中全部數據:
select * from test_jzl_20140701_test1; 
a       b       [1,2,3]
c       d       [4,5,6]

遍歷數組中的每一列:
select col1,col2,name 
from test_jzl_20140701_test1  
lateral view explode(col3) col3 as name;
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6

1.4補充知識點: 

select * from test_jzl_20140701_test; 
a       b       1,2,3
c       d       4,5,6

select t.list[0],t.list[1],t.list[2] 
from (
      select (split(col3,',')) list from test_jzl_20140701_test
     )t;

OK
1       2       3
4       5       6

--查看數組長度
select size(split(col3,',')) list from test_jzl_20140701_test;
3
3

列轉行2: explode(array);

select explode(array('A','B','C'));
A
B
C
select explode(array('A','B','C')) as col;
col
A
B
C
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
A
B
C
select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;
col
A
B
C

explode(map):

select explode(map('A',10,'B',20,'C',30));
A	10
B	20
C	30
select explode(map('A',10,'B',20,'C',30)) as (key,value);
key value
A	10
B	20
C	30
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf;
A	10
B	20
C	30
select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;
key value
A	10
B	20
C	30

New Add函數: inline(): inline和explode函數均可以將單列擴展成多列或者多行。

inline的參數形式:inline(ARRAY<STRUCT[,STRUCT]>)
inline通常結合lateral view使用。

select t1.col1 as name,t1.col2 as sub1
from employees 
lateral view inline(array(struct(name,subordinates[0]))) t1

inline 嵌套多個struct:

select t1.col1 as name,t1.col2 as sub
from employees 
lateral view inline(array(struct(name,subordinates[0]),
                          struct(name,subordinates[1]))) t1
where t1.col2 is not null

還能夠給inline的字段取別名:

select t1.name,t1.sub
from employees 
lateral view inline(array(struct(name,subordinates[0]),
                          struct(name,subordinates[1]))) t1 as name,sub
where t1.sub is not null

(2)Collect_Set()/Collect_List()函數;

collect_set/collect_list函數:
collect_set(col)函數只接受基本數據類型,函數只能接受一列參數,它的主要做用是將某字段的值進行去重彙總,產生array類型字段。
collect_list函數返回的類型是array< ? >類型,?表示該列的類型 
PS:collect_list()不去重彙總,collect_set()去重彙總;

 

2.行轉列:

2.1問題引入:
hive如何將
a       b       1
a       b       2
a       b       3
c       d       4
c       d       5
c       d       6
變爲:
a       b       1,2,3
c       d       4,5,6

2,2原始數據: 

test.txt
a       b       1 
a       b       2 
a       b       3 
c       d       4 
c       d       5 
c       d       6

2.3 解決方法1: 

drop table tmp_jiangzl_test;
建表:
create table tmp_jiangzl_test
(
col1 string,
col2 string,
col3 string
)
row format delimited fields terminated by '\t'
stored as textfile;
加載數據:
load data local inpath '/home/jiangzl/shell/test.txt' into table tmp_jiangzl_test;
處理:
select col1,col2,concat_ws(',',collect_set(col3)) 
from tmp_jiangzl_test  
group by col1,col2;

Collect_List()/Collect_Set()幾點備註:

1.咱們使用了concat_ws函數,可是該函數僅支持string和array< string > 因此對於該列不是string的列,須要先轉爲string類型. "而collect_list函數返回的類型是array< ? >類型,?表示該列的類型", 那怎麼轉爲string類型?

select id,concat_ws(',',collect_list(cast (name as string))) from table group by id

本例中,concat_ws函數的做用是用逗號將全部的string組合到一塊兒,用逗號分割顯示。
這樣,就實現了將列轉行的功效,最終實現了咱們須要的輸出結果。

2.當collect_list()/collect_set()的參數列中包含空值(NULL)時,collect_list/collect_set函數會忽略空值,而只處理非空的值,此時需加個空值判斷。以下面的例子:

I am trying to collect a column with NULLs along with some values in that column...But collect_list ignores the NULLs and collects only the ones with values in it. Is there a way to retrieve the NULLs along with other values ?

SELECT col1, col2, collect_list(col3) as col3
FROM (SELECT * FROM table_1 ORDER BY col1, col2, col3)
GROUP BY col1, col2;

Actual col3 values:

0.9
NULL
NULL
0.7
0.6

Resulting col3 values:

[0.9, 0.7, 0.6]

I was hoping that there is a hive solution that looks like this [0.9, NULL, NULL, 0.7, 0.6] after applying the collect_list.

Answer:

This function works like this, but I've found the following workaround. Add a case when statement to your query to check and keep NULLs.

SELECT col1, 
    col2, 
    collect_list(CASE WHEN col3 IS NULL THEN 'NULL' ELSE col3 END) as col3
--或者 
--  collect_list(coalesce(col3, "NULL") as col3 
FROM (SELECT * FROM table_1 ORDER BY col1, col2, col3)
GROUP BY col1, col2

Now, because you had a string element ('NULL') the whole result set is an array of strings. At the end just convert the array of strings to an array of double values.

 

(3)日期類型函數;

(3.1)Unix_timestamp():Gets current time stamp using the default time zone. 
函數返回值的類型:bigint;

(1)select unix_timestamp();
1510302062

Unix_timestamp(string date) Converts time string in format yyyy-MM-dd HH:mm:ss to Unix time stamp, return 0 if fail: unix_timestamp(‘2009-03-20 11:30:01’) = 1237573801;
函數返回值的類型:bigint;

select unix_timestamp('2017-11-10 16:24:00');
1510302240
-- 修改時間格式後:
select unix_timestamp('2017-11-10');
NULL

Unix_timestamp(string date,string pattern): Convert time string with given pattern to Unix time stamp, return 0 if fail: unix_timestamp(‘2009-03-20’, ‘yyyy-MM-dd’) = 1237532400;
函數返回值的類型:bigint;

select unix_timestamp('2017-11-10','yyyy-MM-dd');
1510243200

select unix_timestamp('2017-11-10 12:30:10','yyyy-MM-dd HH:mm:ss');
1510288210

(3.2)from_unixtime(bigint unixtime[, string format]): Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of 「1970-01-01 00:00:00」;
函數返回值的類型:string;

select from_unixtime(1510243200),from_unixtime(1510243200,'yyyy-MM-dd');
       2017-11-10 00:00:00       2017-11-10

select from_unixtime(1510288210),from_unixtime(1510288210,'yyyy-MM-dd'),from_unixtime(1510288210,'yyyy-MM-dd HH:mm:ss');
       2017-11-10 12:30:10        2017-11-10                                  2017-11-10 12:30:10

(3.3)to_date(string timestamp): Returns the date part of a timestamp string: 
                                            to_date(「1970-01-01 00:00:00」) = 「1970-01-01」;
函數返回值的類型:Hive 2.1.0版本以前 string, Hive 2.1.0版本以後: date;

select to_date('2017-11-10 16:48:20');
2017-11-10

(3.4)year(string date): Returns the year part of a date or a timestamp string: 
                               year(「1970-01-01 00:00:00」) = 1970, year(「1970-01-01」) = 1970;
函數返回值的類型:int;

select year("2017-11-10"),year("2017-11-10 16:54:30");
       2017               2017

(3.5)month(string date): Returns the month part of a date or a timestamp string: 
                                  month(「1970-11-01 00:00:00」) = 11, month(「1970-11-01」) = 11
函數返回值的類型:int;

select month("2017-11-10"),month("2017-11-10 16:54:30");
       11                  11

(3.6)day(string date) dayofmonth(date): Return the day part of a date or a timestamp string: 
                                                        day(「1970-11-01 00:00:00」) = 1, day(「1970-11-01」) = 1
函數返回值的類型:int;

select day("2017-11-10 16:54:30"),day("2017-11-10");
       10                         10
select dayofmonth("2017-11-10 17:02:30"),dayofmonth("2017-11-10");
       10                                10

(3.7)Hour(string date): Returns the hour of the timestamp: 
                                hour(‘2009-07-30 12:58:59′) = 12, hour(’12:58:59’) = 12
函數返回值的類型:int;

select hour("2017-11-10 17:02:30"),hour("17:02:30");
       17                          17

(3.8)minute(string date):Returns the minute of the timestamp; 函數返回值的類型:int;

select minute("2017-11-10 17:02:30"),minute("17:02:30");
       2                              2

(3.9)second(string date):Returns the second of the timestamp; 函數返回值的類型:int;

select second("2017-11-10 17:02:30"),second("17:02:30");
       30                             30

(4.0)weekofyear(string date): Return the week number of a timestamp string: 
                                    weekofyear(「1970-11-01 00:00:00」) = 44, weekofyear(「1970-11-01」) = 44
函數返回值的類型:int;

select weekofyear("2017-11-10 17:02:30"),weekofyear("2017-11-10");
       45                                 45

(4.1)datediff(string enddate, string startdate): Return the number of days from startdate to enddate:                                                                   datediff(‘2009-03-01’, ‘2009-02-27’) = 2;
函數返回值的類型:int;

select datediff("2017-11-11","2017-11-10");
       1

(4.2)date_add(date/timestamp/string startdate, tinyint/smallint/int days): 
      Add a number of days to startdate: date_add(‘2008-12-31’, 1) = ‘2009-01-01’
函數返回值的類型:Hive 2.1.0版本以前 string, Hive 2.1.0版本以後: date;

select date_add("2017-11-10",1);
2017-11-11

(4.3)date_sub(date/timestamp/string startdate, tinyint/smallint/int days):
      Subtract a number of days to startdate: date_sub(‘2008-12-31’, 1) = ‘2008-12-30’;
函數返回值的類型:Hive 2.1.0版本以前 string, Hive 2.1.0版本以後: date;

select date_sub("2017-11-11",1);
2017-11-10

(4.4)from_utc_timestamp({any primitive type}*, string timezone): 很差理解;
     Assumes given timestamp is UTC and converts to given timezone (as of Hive 0.8.0);

     Coverts a timestamp* in UTC to a given timezone(as of Hive 0.8.0).*
     timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, 
     float/double and decimal. Fractional values are considered as seconds. 
     Integer values are considered as milliseconds.. 
     E.g:from_utc_timestamp(2592000.0,'PST');  -->1970-01-31 00:00:00;
          from_utc_timestamp(2592000000,'PST');  -->1970-01-31 00:00:00;
     and from_utc_timestamp(timestamp '1970-01-30 16:00:00','PST'); -->1970-01-30 08:00:00
函數返回值的類型:timestamp;

select from_utc_timestamp(2592000.0,'PST');
1970-01-31 00:00:00
select from_utc_timestamp(2592000.0,'PST');
1970-01-31 00:00:00
select from_utc_timestamp(timestamp '1970-01-30 16:00:00','PST');
1970-01-30 08:00:00

(4.5)to_utc_timestamp({any primitive type}*, string timezone): 很差理解;
     Assumes given timestamp is in given timezone and converts to UTC (as of Hive 0.8.0);

     Coverts a timestamp* in a given timezone to UTC (as of Hive 0.8.0). 
   * timestamp is a primitive type, including timestamp/date, tinyint/smallint/int/bigint, 
     float/double and decimal. Fractional values are considered as seconds. 
     Integer values are considered as milliseconds.. 
     E.g: to_utc_timestamp(2592000.0,'PST');   -->1970-01-31 16:00:00
           to_utc_timestamp(2592000000,'PST'); -->1970-01-31 16:00:00
     and to_utc_timestamp(timestamp '1970-01-30 16:00:00','PST');  -->1970-01-31 00:00:00
函數返回值的類型:timestamp;

select to_utc_timestamp(2592000.0,'PST');
1970-01-31 16:00:00
select to_utc_timestamp(2592000000,'PST');
1970-01-31 16:00:00
select to_utc_timestamp(timestamp '1970-01-30 16:00:00','PST');
1970-01-31 00:00:00

(4.6)current_date(): Returns the current date at the start of query evaluation (as of Hive 1.2.0). 
                            All calls of current_date within the same query return the same value.
函數返回值的類型:date;

select current_date();
2017-11-10

(4.7)current_timestamp(): Returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value.
函數返回值的類型:timestamp;

select current_timestamp();
2017-11-10 17:49:59.992

(4.8)dd_months(string start_date, int num_months): Returns the date that is num_months after start_date (as of Hive 1.1.0). start_date is a string, date or timestamp. num_months is an integer. The time part of start_date is ignored. If start_date is the last day of the month or if the resulting month has fewer days than the day component of start_date, then the result is the last day of the resulting month. Otherwise, the result has the same day component as start_date.
函數返回值的類型:string;

select add_months('2017-10-01',1);
2017-11-01
select add_months('2017-10-31',1);
2017-11-30
select add_months('2017-02-28',1);
2017-03-31

(4.9)last_day(string date): Returns the last day of the month which the date belongs to 
      (as of Hive 1.1.0). date is a string in the format ‘yyyy-MM-dd HH:mm:ss’ or ‘yyyy-MM-dd’. 
      The time part of date is ignore;
函數返回值的類型:string;

select last_day('2017-11-10'),last_day("2017-11-10 17:48:20");
       2017-11-30	          2017-11-30

(5.0)next_day(string start_date, string day_of_week): Returns the first date which is later than start_date and named as day_of_week (as of Hive 1.2.0). start_date is a string/date/timestamp. day_of_week is 2 letters, 3 letters or full name of the day of the week (e.g. Mo, tue, FRIDAY). The time part of start_date is ignored.  Example: next_day(‘2015-01-14’, ‘TU’) = 2015-01-20.
函數返回值的類型:string;

select next_day('2017-11-10','Wed');
2017-11-15

(5.1)trunc(string date, string format): Returns date truncated to the unit specified by the format (as of Hive 1.2.0).  Supported formats: MONTH/MON/MM, YEAR/YYYY/YY. 
                 Example: trunc(‘2015-03-17’, ‘MM’) = 2015-03-01.
函數返回值的類型:string;

select trunc('2017-11-10','YYYY');
2017-01-01

select trunc('2017-11-10','MM');
2017-11-01

(5.2)months_between(date1, date2): Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If date1 is later than date2, then the result is positive. If date1 is earlier than date2, then the result is negative. If date1 and date2 are either the same days of the month or both last days of months, then the result is always an integer.Otherwise the UDF calculates the fractional portion of the result based on a 31-day month and considers the difference in time components date1 and date2. date1 and date2 type can be date, timestamp or string in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH:mm:ss’. The result is rounded to 8 decimal places. Example: months_between(‘1997-02-28 10:30:00’, ‘1996-10-30’) = 3.94959677;
函數返回值的類型:double;

select months_between('2017-10-01','2017-11-01');
-1.0
select months_between('2017-11-01','2017-10-01');
1.0
select months_between('2017-11-01','2017-11-01');
0.0
select months_between('2017-11-02','2017-11-01');
0.03225806
select months_between('2017-10-31','2017-11-01');
-0.03225806
select months_between('2017-11-30','2017-10-31');
1.0


(5.3)date_format(date/timestamp/string ts, string fmt): Converts a date/timestamp/string to a value of string in the format specified by the date format fmt (as of Hive 1.2.0). 
Supported formats are Java SimpleDateFormat 
formats – https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html.(見下圖)
The second argument fmt should be constant.
Example: date_format(‘2015-04-08’, ‘y’) = ‘2015’. 
date_format can be used to implement other UDFs, 
e.g.: dayname(date) is date_format(date, ‘EEEE’);  -->返回的是date是星期幾(英文全拼)
dayofyear(date) is date_format(date, ‘D’); -->返回的是date是一年中的第幾天;
函數返回值的類型:string;


 

(4)Mathematical Functions:

The following built-in mathematical functions are supported in Hive; most return NULL when the argument(s) are NULL:
(4.1)round(double a): Returns the rounded BIGINT value of the double;
函數返回值的類型:bigint;

select round(5.5),round(5.4);
       6           5

(4.2)round(double a, int d):Returns the double rounded to d decimal places;
函數返回值的類型:double;

select round(5.52,1),round(5.52,2),round(5.52,3);
        5.5           5.52          5.52

(4.2)+New add: bround(double a):銀行家舍入法(1~4:舍,6~9:進,5->前位數是偶:舍,5->前位數是奇:進);
函數返回值的類型:double;

select bround(8.25,1),bround(8.35,1);
       8.2            8.4

(4.2)+New Add2: bround(double a, int d):

select bround(8.225,2),bround(8.235,2);
       8.22            8.24

(4.3)floor(double a):Returns the maximum BIGINT value that is equal or less than the double;
函數返回值的類型:bigint;

select floor(5.0),floor(5.2),floor(5.9);
       5           5          5

(4.4)ceil(double a), ceiling(double a): 
Returns the minimum BIGINT value that is equal or greater than the double;
函數返回值的類型:bigint;

select ceil(5.0),ceil(5.01),ceil(5.99);
       5          6          6

(4.5)rand(), rand(int seed):
Returns a random number (that changes from row to row) that is distributed uniformly from 0 to 1. Specifiying the seed will make sure the generated random number sequence is deterministic.
函數返回值的類型:double;

select 1,rand();
第一次運行結果:1	0.16656453881197997
第二次運行結果:1	0.9930518648319836
第三次運行結果:1	0.4714714776339659
第四次運行結果:1	0.4895444194153318

select 1,rand(2);
第一次運行結果:1	0.7311469360199058
第二次運行結果:1	0.7311469360199058

(4.6)exp(double a),exp(decimal a): Returns e^a where e is the base of the natural logarithm;
函數返回值的類型:double;

select exp(0.1);
1.1051709180756477

select exp(2);
7.38905609893065

select exp(2.1);
8.166169912567652

(4.7)ln(double a), ln(DECIMAL a): Returns the natural logarithm of the argument;
函數返回值的類型:double;

select ln(2);
0.6931471805599453

select ln(2.3);
0.8329091229351039

(4.8)log10(double a), log10(DECIMAL a): Returns the base-10 logarithm of the argument
函數返回值的類型:double;

(4.9)log2(double a), log2(DECIMAL a): Returns the base-10 logarithm of the argument
函數返回值的類型:double;

(5.0)log(double base, double a), log(DECIMAL base, DECIMAL a): 
      Returns the base-10 logarithm of the argument
函數返回值的類型:double;

select log(2,4)
2.0

select log(2.1,8.2);
2.835999790572199

(5.1)pow(double a,double p),power(double a, double p): Return a^p;-->a的p次方;

(5.2)sqrt(double a), sqrt(DECIMAL a): Returns the square root of a;
函數返回值的類型:double;

(5.2)sqrt(double a), sqrt(DECIMAL a): Returns the square root of a;
函數返回值的類型:double;

select sqrt(100),sqrt(0.01);
       10         0.1

(5.3)bin(BIGINT a): Returns the number in binary format;   將"十進制"數據轉換爲"二進制"數;
函數返回值的類型:string;

select bin(10);
       1010

(5.4)hex(BIGINT a) hex(string a) hex(BINARY a): If the argument is an int, hex returns the number as a string in hex format. Otherwise if the number is a string, it converts each character into its hex representation and returns the resulting string. 
•十六進制函數 : hex;  
說明: 若是變量是int類型,那麼返回a的十六進制表示;若是變量是string類型,則返回該字符串的十六進制表示;  函數返回值的類型:string;

select hex(10);
A(或者a)
select hex("A");
41
select hex("a");
61
select hex('ab');
6162

(5.5)unhex(string a): Inverse of hex. Interprets each pair of characters as a hexidecimal number and converts to the character represented by the number.
反轉十六進制函數 : unhex;  說明: 返回該十六進制字符串所代碼的字符串
函數返回值的類型:string;

select unhex('616263'),unhex(616263);
       abc             abc

(5.6)conv(BIGINT num, int from_base, int to_base), conv(STRING num, int from_base, int to_base):
Converts a number from a given base to another;
進制轉換函數 : conv();
說明: 將數值num從from_base進制轉化到to_base進制
函數返回值的類型:string;

select conv(17,10,16);  -- 將17 從10進制 轉換爲16進制數;
11
select conv(17,10,2);   -- 將17 從10進制 轉換爲2進制數;
10001

(5.7)abs(double a): Returns the absolute value;
絕對值函數 : abs; 說明: 返回數值a的絕對值
函數返回值的類型:double;

select abs(-3.9);
3.9
select abs(3.9);
3.9

(5.8)pmod(int a, int b) pmod(double a, double b):
     Returns the positive value of a mod b;
正取餘函數 : pmod; 說明: 返回的a除以b的餘數;
函數返回值的類型:int double;

select pmod(9,4);
1
select pmod(-9,4);
3

(5.9)sin(double a), sin(DECIMAL a): Returns the sine of a (a is in radians);
正弦函數 : sin;  說明: 返回a的正弦值
函數返回值的類型:double;

select sin(0.8);
0.7173560908995228

(6.0)sin(double a), asin(DECIMAL a): Returns the arc sin of x if -1<=a<=1 or null otherwise;
反正弦函數 : asin;  說明: 返回a的反正弦值
函數返回值的類型:double;

select asin(0.7173560908995228);
0.8

(6.1)cos(double a), cos(DECIMAL a): Returns the cosine of a (a is in radians);
餘弦函數 : cos; 說明: 返回a的餘弦值;
函數返回值的類型:double;

select cos(0.9);
0.6216099682706644

(6.2)acos(double a), acos(DECIMAL a): Returns the arc cosine of x if -1<=a<=1 or null otherwise;
反餘弦函數 : acos; 說明: 返回a的反餘弦值;
函數返回值的類型:double;

select acos(0.6216099682706644);
0.9

(6.3)tan(double a), tan(DECIMAL a): Returns the tangent of a (a is in radians);
正切函數 : tan;  說明: 返回a的正切值;
函數返回值的類型:double;

select tan(1);
1.5574077246549023

(6.4)atan(double a), atan(DECIMAL a): Returns the arctangent of a;
反正切函數 : atan; 說明: 返回a的反正切值;
函數返回值的類型:double;

select atan(1.5574077246549023);
1.0

(6.5)degrees(double a), degrees(DECIMAL a):
     Converts value of a from radians to degree; -->將弧度值轉換角度值;
函數返回值的類型:double;

select degrees(30);
1718.8733853924698

(6.6)radians(double a): 
     Converts value of a from degrees to radians; -->將角度值轉換成弧度值;
函數返回值的類型:double;

select radians(30);
0.5235987755982988

(6.7)positive(int a), positive(double a): Returns a;
函數返回值的類型: int double;

select positive(-10),positive(10);
        -10           10

(6.8)negative(int a), negative(double a): Returns -a;
函數返回值的類型: int double;

select negative(-10),negative(10);
       10             -10

(6.9)sign(double a), sign(DECIMAL a): Returns the sign of a as ‘1.0’ or ‘-1.0’
     若是a是正數則返回1.0,是負數則返回-1.0,不然返回0.0
函數返回值的類型: float;

select sign(-10),sign(0),sign(10);
       -1.0       0.0     1.0

(7.0)e(): Returns the value of e;  數學常數e;
函數返回值的類型: double;

select e();
2.718281828459045

(7.1)pi(): Returns the value of pi;  數學常數pi;
函數返回值的類型: double;

select pi();
3.141592653589793

(7.2)factorial(int a): 求a的階乘 ;
函數返回值的類型: bigint;

select factorial(2);  <=> 2*1;
2
select factorial(3);  <=> 3*2*1;
6
select factorial(4);  <=> 4*3*2*1;
24

(7.3)cbrt(double a): 求a的立方根;
函數返回值的類型: double;

select cbrt(8),cbrt(27);
       2        3

(7.4)shiftleft(tinyint/smallint/int a, int b): 按位左移;
    Purpose: Shifts an integer value left by a specified number of bits. As the most significant 
    bit is taken out of the original value, it is discarded and the least significant bit becomes 0. 
    In computer science terms, this operation is a "logical shift".
Usage notes:
The final value has either the same number of 1 bits as the original value, or fewer. Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces a result of zero.
Specifying a second argument of zero leaves the original value unchanged. Shifting any value by 0 returns the original value. Shifting any value by 1 is the same as multiplying it by 2, as long as the value is small enough; larger values eventually become negative when shifted, as the sign bit is set. Starting with the value 1 and shifting it left by N positions gives the same result as 2 to the Nth power, or pow(2,N).

Return type: Same as the input value
Added in: CDH 5.5.0 (Impala 2.3.0)

select shiftleft(1,0); /* 00000001 -> 00000001 */
+-----------------+
| shiftleft(1, 0) |
+-----------------+
| 1               |
+-----------------+

select shiftleft(1,3); /* 00000001 -> 00001000 */
+-----------------+
| shiftleft(1, 3) |
+-----------------+
| 8               |
+-----------------+

select shiftleft(8,2); /* 00001000 -> 00100000 */
+-----------------+
| shiftleft(8, 2) |
+-----------------+
| 32              |
+-----------------+

select shiftleft(127,1); /* 01111111 -> 11111110 */
+-------------------+
| shiftleft(127, 1) |
+-------------------+
| -2                |
+-------------------+

select shiftleft(127,5); /* 01111111 -> 11100000 */
+-------------------+
| shiftleft(127, 5) |
+-------------------+
| -32               |
+-------------------+

select shiftleft(-1,4); /* 11111111 -> 11110000 */
+------------------+
| shiftleft(-1, 4) |
+------------------+
| -16              |
+------------------+

(7.5)shiftright(tinyint/smallint/int a, int b): 按位右移;
Purpose: Shifts an integer value right by a specified number of bits. As the least significant bit is taken out of the original value, it is discarded and the most significant bit becomes 0. In computer science terms, this operation is a "logical shift".
Usage notes:
Therefore, the final value has either the same number of 1 bits as the original value, or fewer. Shifting an 8-bit value by 8 positions, a 16-bit value by 16 positions, and so on produces a result of zero.
Specifying a second argument of zero leaves the original value unchanged. Shifting any value by 0 returns the original value. Shifting any positive value right by 1 is the same as dividing it by 2. Negative values become positive when shifted right.
Return type: Same as the input value
Added in: CDH 5.5.0 (Impala 2.3.0)

select shiftright(16,0); /* 00010000 -> 00000000 */
+-------------------+
| shiftright(16, 0) |
+-------------------+
| 16                |
+-------------------+

select shiftright(16,4); /* 00010000 -> 00000000 */
+-------------------+
| shiftright(16, 4) |
+-------------------+
| 1                 |
+-------------------+

select shiftright(16,5); /* 00010000 -> 00000000 */
+-------------------+
| shiftright(16, 5) |
+-------------------+
| 0                 |
+-------------------+

select shiftright(-1,1); /* 11111111 -> 01111111 */
+-------------------+
| shiftright(-1, 1) |
+-------------------+
| 127               |
+-------------------+

select shiftright(-1,5); /* 11111111 -> 00000111 */
+-------------------+
| shiftright(-1, 5) |
+-------------------+
| 7                 |
+-------------------+

(7.6)shiftrightunsigned(tinyint/smallint/int a, int b): 無符號按位右移(<<<);
      shiftrightunsigned(bigint a,int b)
函數返回值類型:int, bigint;

select shiftrightunsigned(16,4);
1

(7.7) greatest(v1,v2,v3,...):求最大值,若是其中包含空值,則返回空值;
      Returns the greatest value of the list of values (as of Hive 1.1.0). 
      Fixed to return NULL when one or more arguments are NULL, and strict type restriction relaxed, 
      consistent with ">" operator (as of Hive 2.0.0).

select greatest(2,3,1);
3
greatest(2,3,NULL);
NULL
select greatest('1','2','3');
3
select greatest('1','2','a');
a
select greatest('1','aa','a');
aa 
select greatest('zz','aa','a');
aa
select greatest('1a','aa','a');
aa

(7.8)least(v1,v2,v3,...):求最小值;
     Returns the least value of the list of values (as of Hive 1.1.0). Fixed to return NULL when one 
     or more arguments are NULL, and strict type restriction relaxed, consistent with "<" operator 
     (as of Hive 2.0.0).

select least(1,2,3); --> 1 
select least(-100,2,-3); --> -100
select least('1','2','3'); --> 1
select least('1','2','a'); --> 1
select least('b','aa','a'); --> a

(5)String Functions;

(5.1)ascii(string str):Returns the numeric value of the first character of str;返回str中首個ASCII字符串的整數值; 返回值類型:int;

select ascii('a'); --> 97 
select ascii('ab'); --> 97
select ascii('ba'); --> 98

(5.2)base64(binary bin):  -- 後續再補充這個函數; 返回值類型:string;
     Converts the argument from binary to a base- 64 string (as of 0.12.0);
     將二進制bin轉換成64位的字符串;
If your parameter is already in binary, just use:  base64(bin_field);
Otherwise, if it is in text format and you want to convert it to Binary UTF-8 then to base 64, combine:  base64(encode(text_field, 'UTF-8'));

select base64(encode('a','UTF-8'));
YQ==
select base64(binary('1'));
MQ==

unbase64(string str): Converts the argument from a base 64 string to BINARY. (As of Hive 0.12.0.).
將64位的字符串轉換二進制值; 
返回值類型:string;

select unbase64("MQ==");
1

(5.3)chr(bigint|double A): Returns the ASCII character having the binary equivalent to A (as of Hive 1.3.0 and 2.1.0). If A is larger than 256 the result is equivalent to chr(A % 256). Example: select chr(88); returns 「X」.string; 返回值類型:string;

select chr(97);
a

(5.4)concat(string|binary A, string|binary B…):Returns the string or bytes resulting from concatenating the strings or bytes passed in as parameters in order. e.g. concat(‘foo’, ‘bar’) results in ‘foobar’. Note that this function can take any number of input strings.  返回值類型:string;

select concat('a','b');
ab

(5.5)concat_ws(string SEP, string A, string B…):Like concat() above, but with custom separator SEP.

select concat_ws('-','a','b');
a-b

(5.6)concat_ws(string SEP, array<string>):Like concat_ws() above, but taking an array of strings. (as of Hive 0.9.0); 返回值類型:string;

(5.7)/(5.8):

(5.9)decode(binary bin, string charset):Decodes the first argument into a String using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.);
使用指定的字符集charset將二進制值bin解碼成字符串,支持的字符集有:'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16',若是任意輸入參數爲NULL都將返回NULL;
返回值類型:string;   --> 暫時不知道如何運用;
(6.0)encode(string src, string charset):Encodes the first argument into a BINARY using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’). If either argument is null, the result will also be null. (As of Hive 0.12.0.);
使用指定的字符集charset將字符串編碼成二進制值,支持的字符集有:'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16',若是任一輸入參數爲NULL都將返回NULL;
返回值類型:binary;    --> 暫時不知道如何運用;
(6.1)elt(N int,str1 string,str2 string,str3 string,…): Return string at index number. 
For example elt(2,’hello’,’world’) returns ‘world’. Returns NULL if N is less than 1 or greater than the number of arguments.  
(See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_elt)
返回值類型:string;

select elt(1,'face','book'); -->face
select elt(0,'face','book'); -->NULL
select elt(3,'face','book'); -->NULL

(6.2)field(val T,val1 T,val2 T,val3 T,…):Returns the index of val in the val1,val2,val3,… list or 0 if not found. For example field(‘world’,’say’,’hello’,’world’) returns 3.All primitive types are supported, arguments are compared using str.equals(x). If val is NULL, the return value is 0.  
(See https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_field)  

返回值類型:int;

select field('world','say','hello','world'); -->3
select field('world','say','hello'); --> 0 
-- 只顯示第一個出現的位置;
select field('hi','hello','hi','how','hi'); --> 2

(6.3)find_in_set(string str, string strlist):Returns the first occurance of str in strList where strList is a comma-delimited string. Returns null if either argument is null. Returns 0 if the first argument contains any commas. e.g. find_in_set(‘ab’, ‘abc,b,ab,c,def’) returns 3;  返回值類型:int;

select find_in_set('ab', 'abc,b,ab,c,def'); --> 3
select find_in_set('a,b', 'abc,b,ab,c,def'); --> 0
select find_in_set('a,b', 'abc,NULL,ab,c,def'); --> 0
select find_in_set('NULL', 'abc,NULL,ab,c,def'); --> 2 
select find_in_set(NULL, 'abc,NULL,ab,c,def'); --> NULL

(6.4)format_number(number x, int d):Formats the number X to a format like ‘#,###,###.##’, rounded to D decimal places, and returns the result as a string. If D is 0, the result has no decimal point or fractional part. (as of Hive 0.10.0);  返回值類型:string;

select format_number(100.1232,2); --> 100.12
select format_number(100.1262,2); --> 100.13

(6.5)get_json_object(string json_string, string path):Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid.NOTE: The json path can only have the characters [0-9a-z_], i.e., no upper-case or special characters. Also, the keys *cannot start with numbers.* This is due to restrictions on Hive column names.從指定路徑上的JSON字符串抽取出JSON對象,並返回這個對象的JSON格式,若是輸入的JSON是非法的將返回NULL,注意此路徑上JSON字符串只能由數字 字母 下劃線組成且不能有大寫字母和特殊字符,且key不能由數字開頭,這是因爲Hive對列名的限制;

返回值類型:string;

select  get_json_object('{"store":
                                  {"fruit":\[{"weight":8,"type":"apple"},
                                             {"weight":9,"type":"pear"}],
                                   "bicycle":{"price":19.95,"color":"red"}
                                  },
                         "email":"amy@only_for_json_udf_test.net",
                         "owner":"amy"
                         }
                        ','$.owner'
                        );
amy


select  get_json_object('{"store":
                                  {"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
                                   "bicycle":{"price":19.95,"color":"red"}
                                  },
                         "email":"amy@only_for_json_udf_test.net",
                         "owner":"amy"
                         }
                        ','$.ownEr'
                        )
                         ;
NULL

(6.6)in_file(string str, string filename):Returns true if the string str appears as an entire line in filename. 若是文件名爲filename的文件中有一行數據與字符串str匹配成功就返回true;返回值類型:boolean; --暫時沒用到;

(6.7):initcap(string A):Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace. (As of Hive 1.1.0.);將字符串A轉換第一個字母大寫其他字母的字符串;     返回值類型:string;

select initcap('iniCapinp');
Inicapinp

(6.8)instr(string str, string substr):Returns the position of the first occurrence of substr in str. Returns null if either of the arguments are null and returns 0 if substr could not be found in str. Be aware that this is not zero based. The first character in str has index 1.查找字符串str中子字符串substr出現的位置,若是查找失敗將返回0,若是任一參數爲Null將返回null,注意位置爲從1開始的;
返回值類型:int;

select instr('ambghabxyabef','ab'); --> 6

(6.9)length(string A):Returns the length of the string; 返回值類型:int;

select length('hello'); --> 5
select length('你好'); --> 2

(7.0)levenshtein(string A, string B):Returns the Levenshtein distance between two strings (as of Hive 1.2.0). For example, levenshtein(‘kitten’, ‘sitting’) results in 3. 計算兩個字符串之間的差別大小 ;具體的比較方式是按字符串中的字符的順序一個一個進行比較,相同位置上的字符不一樣則加1;
返回值類型:int;

select levenshtein('a','ab'); --> 1 
select levenshtein('ab','a'); --> 1 
select levenshtein('a','abc'); --> 2 
select levenshtein('abc','ad'); --> 2 
select levenshtein('abcg','adef'); --> 3
select levenshtein('abcgh','adehf'); --> 4
select levenshtein('abegh','adehf'); --> 3 
select levenshtein('abeh','adehf'); --> 2

(7.1)locate(string substr, string str[, int pos]):Returns the position of the first occurrence of substr in str after position pos; 查找字符串str中的pos位置後字符串substr第一次出現的位置; 返回值類型:int;

select locate('bar','foobarbar',5); --> 7

(7.2)lower(string A) lcase(string A): Returns the string resulting from converting all characters of B to lower case. For example, lower(‘fOoBaR’) results in ‘foobar’.  返回值類型:string;

(7.3)upper(string A) ucase(string A):Returns the string resulting from converting all characters of A to upper case e.g. upper(‘fOoBaR’) results in ‘FOOBAR’;  返回值類型:string;

(7.4)lpad(string str, int len, string pad):Returns str, left-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.從左邊開始對字符串str使用字符串pad填充,最終len長度爲止,若是字符串str自己長度比len大的話,將去掉多餘的部分; 返回值類型:string;

select lpad('hellomike',10,'a'); --> ahellomike
select lpad('hellomike',15,'a'); --> aaaaaahellomike
select lpad('hellomike',15,'ab'); --> abababhellomike
select lpad('hellomike',16,'ab'); --> abababahellomike

(7.5)rpad(string str, int len, string pad):Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.從右邊開始對字符串str使用字符串pad填充,最終len長度爲止,若是字符串str自己長度比len大的話,將去掉多餘的部分;  返回值類型:string;

select rpad('hellomike',16,'ab'); --> hellomikeabababa
select rpad('hellomike',8,'ab'); --> hellomik

(7.6)ltrim(string A):Returns the string resulting from trimming spaces from the beginning(left hand side) of A e.g. ltrim(‘ foobar ‘) results in ‘foobar ‘; 去掉字符串A前面的空格; 返回值類型:string;

select ltrim('   hello  x'); --> hello  x
select rtrim('   hello  ');
空格空格hello

(7.7)rtrim(string A):Returns the string resulting from trimming spaces from the end(right hand side) of A e.g. rtrim(‘ foobar ‘) results in ‘ foobar’;  去掉字符串後面出現的空格; 返回值類型:string;

select ltrim('   hello  x'); --> hello  x
select rtrim('   hello  ');
空格空格hello

(7.8)parse_url(string urlString, string partToExtract [, string keyToExtract]):
Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. 
EG: parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘HOST’) ;
Returns ‘facebook.com’.
Also a value of a particular key in QUERY can be extracted by providing the key as the third argument.
EG: parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘QUERY’, ‘k1’) ; returns ‘v1’.
返回從URL中抽取指定部分的內容,參數url是URL字符串,而參數partToExtract是要抽取的部分,這個參數包含(HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO,例如:parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') ='facebook.com',若是參數partToExtract值爲QUERY則必須指定第三個參數key;  
如:parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') =‘v1’;

返回值類型:string;

SELECT parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1');
v1

(7.9)printf(String format, Obj… args):Returns the input formatted according do printf-style format strings (as of Hive 0.9.0); 按照printf風格格式輸出字符串; 返回值類型:string;

SELECT printf("Hello World %d %s", 100, "days")FROM src LIMIT 1;
"Hello World 100 days"

(8.0)regexp_extract(string subject, string pattern, int index):Returns the string extracted using the pattern. e.g. regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) returns ‘bar.’ Note that some care is necessary in using predefined character classes: using ‘\s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc. The ‘index’ parameter is the Java regex Matcher group() method index. See docs/api/java/util/regex/Matcher.html for more information on the ‘index’ or Java regex group() method. 抽取字符串subject中符合正則表達式pattern的第index個部分的子字符串,注意些預約義字符的使用,如第二個參數若是使用'\s'將被匹配到s,'\\s'纔是匹配空格;
參數解釋:
其中:
str是被解析的字符串
regexp 是正則表達式
idx是返回結果 取表達式的哪一部分  默認值爲1。
0表示把整個正則表達式對應的結果所有返回
1表示返回正則表達式中第一個() 對應的結果 以此類推 
注意點:
要注意的是idx的數字不能大於表達式中()的個數。
不然報錯; ps:返回值類型:string;

如:
select regexp_extract('x=a3&x=18abc&x=2&y=3&x=4','x=([0-9]+)([a-z]+)',0) from default.dual;
獲得的結果爲:
x=18abc
select regexp_extract('x=a3&x=18abc&x=2&y=3&x=4','x=([0-9]+)([a-z]+)',1) from default.dual;
獲得的結果爲:
18
select regexp_extract('x=a3&x=18abc&x=2&y=3&x=4','x=([0-9]+)([a-z]+)',2) from default.dual;
獲得的結果爲:
abc
咱們當前的語句只有2個()表達式 因此當idx>=3的時候 就會報錯
FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments '2': org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer)  on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@2cf5e0f0 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {x=a3&x=18abc&x=2&y=3&x=4:java.lang.String, x=([0-9]+)[a-z]:java.lang.String, 2:java.lang.Integer} of size 3

(8.1)regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT):
Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT, e.g. regexp_replace(「foobar」, 「oo|ar」, 「」) returns ‘fb.’ Note that some care is necessary in using predefined character classes: using ‘\s’ as the second argument will match the letter s; ‘s’ is necessary to match whitespace, etc. 按照Java正則表達式PATTERN將字符串INTIAL_STRING中符合條件的部分紅REPLACEMENT所指定的字符串,如裏REPLACEMENT這空的話,抽符合正則的部分將被去掉  如:regexp_replace("foobar", "oo|ar", "") = 'fb.' 注意些預約義字符的使用,如第二個參數若是使用'\s'將被匹配到s,'\\s'纔是匹配空格; ps:返回值類型:string;

select regexp_replace("foobar", "oo|ar", "");
fb

(8.2)repeat(string str, int n): Repeat str n times; 返回值類型:string;

select repeat('hello',3);
hellohellohello

(8.3)replace(string A, string OLD, string NEW): Returns the string A with all non-overlapping occurrences of OLD replaced with NEW (as of Hive 1.3.0 and 2.1.0). 
Example: select replace(「ababab」, 「abab」, 「Z」); returns 「Zab」.
返回值類型:string;

select replace('Ethan-Avner','Avner','Weng');
Ethan-Weng

(8.4)reverse(string A): Returns the reversed string; 反轉字符串; 返回值類型:string;

select reverse('abcd');
dcba

(8.5)sentences(string str, string lang, string locale):Tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The ‘lang’ and ‘locale’ are optional arguments. 
e.g. sentences(‘Hello there! How are you?’) returns ( (「Hello」, 「there」), (「How」, 「are」, 「you」) )
字符串str將被轉換成單詞數組,如:sentences('Hello there! How are you?') =( ("Hello", "there"), ("How", "are", "you") )
返回值類型:array<array>;

select sentences('Hello there! How are you?');
[["Hello","there"],["How","are","you"]]

(8.6)soundex(string A): Returns soundex code of the string (as of Hive 1.2.0). 
For example, soundex(‘Miller’) results in M460. 將普通字符串轉換成soundex字符串;
返回值類型:string;

select soundex('Miller');
M460

(8.7)space(int n): Return a string of n spaces; 返回n個空格的字符串;  返回值類型:string;

select space(2); --> ' '(2個空格);

(8.8)split(string str, string pat): Split str around pat (pat is a regular expression); 
按照正則表達式pat來分割字符串str,並將分割後的數組字符串的形式返回;
返回值類型:array;

SELECT split('oneAtwoBthreeC', '[ABC]') FROM src LIMIT 1;
["one", "two", "three"]

(8.9)str_to_map(text[, delimiter1, delimiter2]): Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ‘,’ for delimiter1 and ‘=’ for delimiter2. 將字符串str按照指定分隔符轉換成Map,第一個參數是須要轉換字符串,第二個參數是鍵值對之間的分隔符,默認爲逗號;第三個參數是鍵值之間的分隔符,默認爲"=";
返回值類型:map<string,string> ;

select str_to_map('a=1,b=2,c=3,d=4',',','=');
{"a":"1","b":"2","c":"3","d":"4"}
select str_to_map('a1b2c3d4',',','=');
{"a1b2c3d4":null}

str_to_map(text[, delimiter1, delimiter2]):
Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ',' for delimiter1 and '=' for delimiter2.

使用兩個分隔符將文本拆分爲鍵值對。 Delimiter1將文本分紅K-V對,Delimiter2分割每一個K-V對。
對於delimiter1默認分隔符是',',對於delimiter2默認分隔符是'='。

案例1:hive> select str_to_map('aaa:11&bbb:22', '&', ':')
-- {"bbb":"22","aaa":"11"}

案例2: hive> select str_to_map('aaa:11&bbb:22', '&', ':')['aaa']
-- 11

(9.0)substr(string|binary A, int start) substring(string|binary A, int start):
Returns the substring or slice of the byte array of A starting from start position till the end of string A e.g. substr(‘foobar’, 4) results in ‘bar’; 對於字符串A,從start位置開始截取字符串並返回;
返回值類型: string;

select substr('abcd',2);
bcd

(9.1)substr(string|binary A, int start, int len) substring(string|binary A, int start, int len):
Returns the substring or slice of the byte array of A starting from start position with length len. 
e.g. substr(‘foobar’, 4, 1) results in ‘b’;
對於二進制/字符串A,從start位置開始截取長度爲length的字符串並返回;
返回值類型: string;

select substr('abcd',1,3);
abc

(9.2)translate(string|char|varchar input, string|char|varchar from, string|char|varchar to):
Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well (available as of Hive 0.10.0; char/varchar support added as of Hive 0.14.0.); 將input出如今from中的字符串替換成to中的字符串 如:translate("MOBIN","BIN","M")="MOM";
返回值類型: string;

select translate('abcdef', 'ada', '192') ;
1bc9ef

(6)Collection Functions 集合函數;

(6.1)size(Map):Returns the number of elements in the map type; 求map的長度; 返回值類型: int;

select size(map('100','tom','101','mary'));
2

size(Array):Returns the number of elements in the array type; 求數組的長度; 返回值類型: int;

select size(array('100','101','102','103'));
4

(6.2)map_keys(Map): Returns an unordered array containing the keys of the input map;
返回map中的全部key;  返回值類型: array;

select map_keys(map('100','tom','101','mary'));
["100","101"]

map_values(Map): Returns an unordered array containing the values of the input map;
返回map中的全部value; 返回值類型: array;

select map_values(map('100','tom','101','mary'));
["tom","mary"]

(6.3)array_contains(Array, value): Returns TRUE if the array contains value;如該數組Array<T>包含value返回true,不然返回false; 返回值類型: boolean;

select array_contains(array(1,2,3),2);
true;
select array_contains(array('100','101','102','103'),'101');
true

(6.4)sort_array(Array): Sorts the input array in ascending order according to the natural ordering of the array elements and returns it (as of version 0.9.0); 按天然順序對數組進行排序並返回;
返回值類型: array;

select sort_array(array('100','104','102','103'));
["100","102","103","104"]

(7)Built-in Aggregate Functions (UDAF);

(7.1)collect_set(col):Returns a set of objects with duplicate elements eliminated;
     collect_list(col):Returns a list of objects with duplicates. (As of Hive 0.13.0.)
-- 參見頭部;
(7.2)ntile(INTEGER x): Divides an ordered partition into x groups called buckets and assigns a bucket number to each row in the partition. This allows easy calculation of tertiles, quartiles, deciles, percentiles and other common summary statistics. (As of Hive 0.11.0.); --> 暫時不知道怎麼用;
返回值類型:integer;

(7.3)histogram_numeric(col, b):Computes a histogram of a numeric column in the group using b non-uniformly spaced bins. The output is an array of size b of double-valued (x,y) coordinates that represent the bin centers and heights; 返回值類型: array<struct {‘x’,’y’}>;

SELECT histogram_numeric(val, 3) FROM src;
[{"x":100,"y":14.0},{"x":200,"y":22.0},{"x":290.5,"y":11.0}]

(8)Built-in Table-Generating Functions (UDTF) : 表生成函數;

Normal user-defined functions, such as concat(), take in a single input row and output a single output row. In contrast, table-generating functions transform a single input row to multiple output rows. 
Using the syntax "SELECT udtf(col) AS colAlias..." has a few limitations:
No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF's can't be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported

(8.1)inline(ARRAY<STRUCT[,STRUCT]>): Explodes an array of structs to multiple rows. Returns a row-set with N columns (N = number of top level elements in the struct), one row per struct from the array. (As of Hive 0.10.);  
將結構體數組提取出來並插入到表中; 
返回值類型: inline(ARRAY<STRUCT[,STRUCT]>);

select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02')));
A	10	2015-01-01
B	20	2016-02-02
select inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) as (col1,col2,col3);
col1  col2  col3
A	  10	2015-01-01
B	  20	2016-02-02
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf;
A	10	2015-01-01
B	20	2016-02-02
select tf.* from (select 0) t lateral view inline(array(struct('A',10,date '2015-01-01'),struct('B',20,date '2016-02-02'))) tf as col1,col2,col3;
col1  col2  col3
A	  10	2015-01-01
B	  20	2016-02-02

(8.2)explode(ARRAY<T> a): explode() takes in an array as an input and outputs the elements of the array as separate rows. UDTF’s can be used in the SELECT expression list and as a part of LATERAL VIEW. 對於a中的每一個元素,將生成一行且包含該元素;返回值類型: explode(ARRAY<T> a);參見本文頭部;
explode(ARRAY):Returns one row for each element from the array..;每行對應數組中的一個元素;
explode(MAP): Returns one row for each key-value pair from the input map with two columns in each row: one for the key and another for the value. (As of Hive 0.8.0.).
每行對應每一個map鍵-值,其中一個字段是map的鍵,另外一個字段是map的值;

(8.3)posexplode(ARRAY<T> a):Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array. 與explode相似,不一樣的是還返回各元素在數組中的位置;

select posexplode(array('A','B','C'));
0	A
1	B
2	C
select posexplode(array('A','B','C')) as (pos,val);
pos val
0	A
1	B
2	C
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf;
0	A
1	B
2	C
select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val;
pos val
0	A
1	B
2	C

(8.4)stack(int r,T1 V1,...,Tn/r Vn): Breaks up n values V1,...,Vn into r rows. Each row will have n/r columns. r must be constant. 把N列轉換成R行,每行有N/R個字段,其中R必須是個常數;

select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01');
A	10	2015-01-01
B	20	2016-01-01
select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') as (col0,col1,col2);
col1  col2  col3
A	  10	2015-01-01
B	  20	2016-01-01
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf;
A	10	2015-01-01
B	20	2016-01-01
select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf as col0,col1,col2;
col1  col2  col3
A	  10	2015-01-01
B	  20	2016-01-01

(8.5)json_tuple(string jsonStr,string k1,...,string kn): Takes JSON string and a set of n keys, and returns a tuple of n values. This is a more efficient version of the get_json_object UDF because it can get multiple keys with just one call. 從一個JSON字符串中獲取多個鍵並做爲一個元組返回,與get_json_object不一樣的是此函數能一次獲取多個鍵值;

json_tuple() UDTF於Hive 0.7版本引入,它輸入一個names(keys)的set集合和一個JSON字符串,返回一個使用函數的元組。從一個JSON字符串中獲取一個以上元素時,這個方法比GET_JSON_OBJECT更有效。在任何一個JSON字符串被解析屢次的狀況下,查詢時只解析一次會更有效率,這就是JSON_TRUPLE的目的。JSON_TUPLE做爲一個UDTF,你須要使用LATERAL VIEW語法。

select a.timestamp, 
       get_json_object(a.appevents, '$.eventid'), 
       get_json_object(a.appenvets, '$.eventname') 
from log a;
能夠轉換爲:
select a.timestamp, 
       b.*
from log a 
lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;

(8.6)parse_url_tuple(string urlStr,string p1,...,string pn):Takes URL string and a set of n URL parts, and returns a tuple of n values. This is similar to the parse_url() UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY>.

parse_url_tuple() UDTF與parse_url()相似,可是能夠抽取指定URL的多個部分,返回一個元組。將key添加在QUERY關鍵字與:後面,例如:parse_url_tuple('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY:k1', 'QUERY:k2')返回一個具備v1和v2的元組。這個方法比屢次調用parse_url() 更有效率。全部的輸入和輸出類型均爲string。

SELECT b.*
FROM src 
LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;

(9)Functions for Text Analytics

(9.1)context_ngrams(array<array>, array, int K, int pf):Returns the top-k contextual N-grams from a set of tokenized sentences, given a string of 「context」. See StatisticsAndDataMining
(https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining) for more information. 
context_ngram()容許你預算指定上下文(數組)來去查找子序列從給定一個字符串上下文中,返回出現頻率在Top-K內,符合指定pattern的詞彙
返回值類型:array<struct<string,double>>;

SELECT context_ngrams(sentences(lower(tweet)), 2, 100 , 1000) FROM twitter;

(9.2)ngrams(array<array<string>>, int N, int K, int pf): Returns the top-k N-grams from a set of tokenized sentences, such as those returned by the sentences() UDAF. See StatisticsAndDataMining 
(https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining) for more information. 
for more information. 
從給定一個字符串上下文中,返回出現次數TOP K的的子序列,n表示子序列的長度,
返回值類型:array<struct<string,double>>;

SELECT ngrams(sentences(lower(tweet)), 2, 100, 1000) FROM twitter;
相關文章
相關標籤/搜索