Hive函數介紹以及內置函數

時間 2019-12-01

標籤 hive 函數介紹以及內置欄目 Hadoop 简体版

原文原文鏈接

一、Hive函數介紹以及內置函數查看php

內容較多，見《Hive官方文檔》 cwiki.apache.org/confluence/…java

1）查看系統自帶的函數 hive> show functions; 2）顯示自帶的函數的用法 hive> desc function upper; 3）詳細顯示自帶的函數的用法 hive> desc function extended upper; 二、經常使用函數介紹關係運算一、等值比較: = 語法：A=B 操做類型：全部基本類型描述: 若是表達式A與表達式B相等，則爲TRUE；不然爲FALSE hive> select 1 from tableName where 1=1;node

二、不等值比較: <> 語法: A <> B 操做類型: 全部基本類型描述: 若是表達式A爲NULL，或者表達式B爲NULL，返回NULL；若是表達式A與表達式B不相等，則爲TRUE；不然爲FALSE hive> select 1 from tableName where 1 <> 2;nginx

三、小於比較: < 語法: A < B 操做類型：全部基本類型描述: 若是表達式A爲NULL，或者表達式B爲NULL，返回NULL；若是表達式A小於表達式B，則爲TRUE；不然爲FALSE hive> select 1 from tableName where 1 < 2;正則表達式

四、小於等於比較: <= 語法: A <= B 操做類型: 全部基本類型描述: 若是表達式A爲NULL，或者表達式B爲NULL，返回NULL；若是表達式A小於或者等於表達式B，則爲TRUE；不然爲FALSE hive> select 1 from tableName where 1 < = 1;sql

五、大於比較: > 語法: A > B 操做類型: 全部基本類型描述: 若是表達式A爲NULL，或者表達式B爲NULL，返回NULL；若是表達式A大於表達式B，則爲TRUE；不然爲FALSE hive> select 1 from tableName where 2 > 1;數據庫

六、大於等於比較: >= 語法: A >= B 操做類型: 全部基本類型描述: 若是表達式A爲NULL，或者表達式B爲NULL，返回NULL；若是表達式A大於或者等於表達式B，則爲TRUE；不然爲FALSE hive> select 1 from tableName where 1 >= 1; 1 注意：String的比較要注意(經常使用的時間比較能夠先 to_date 以後再比較) hive> select * from tableName; OK 2011111209 00:00:00 2011111209 hive> select a, b, a<b, a>b, a=b from tableName; 2011111209 00:00:00 2011111209 false true false 七、空值判斷: IS NULL 語法: A IS NULL 操做類型: 全部類型描述: 若是表達式A的值爲NULL，則爲TRUE；不然爲FALSE hive> select 1 from tableName where null is null;express

八、非空判斷: IS NOT NULL 語法: A IS NOT NULL 操做類型: 全部類型描述: 若是表達式A的值爲NULL，則爲FALSE；不然爲TRUE hive> select 1 from tableName where 1 is not null;apache

九、LIKE比較: LIKE 語法: A LIKE B 操做類型: strings 描述: 若是字符串A或者字符串B爲NULL，則返回NULL；若是字符串A符合表達式B 的正則語法，則爲TRUE；不然爲FALSE。B中字符」_」表示任意單個字符，而字符」%」表示任意數量的字符。 hive> select 1 from tableName where 'football' like 'foot%';json

hive> select 1 from tableName where 'football' like 'foot____';

注意：否認比較時候用NOT A LIKE B hive> select 1 from tableName where NOT 'football' like 'fff%';

十、JAVA的LIKE操做: RLIKE 語法: A RLIKE B 操做類型: strings 描述: 若是字符串A或者字符串B爲NULL，則返回NULL；若是字符串A符合JAVA正則表達式B的正則語法，則爲TRUE；不然爲FALSE。 hive> select 1 from tableName where 'footbar' rlike '^f.r $'; 1 注意：判斷一個字符串是否全爲數字： hive>select 1 from tableName where '123456' rlike '^\\d+$ '; 1 hive> select 1 from tableName where '123456aa' rlike '^\d+'; 1 數學運算：一、加法操做: + 語法: A + B 操做類型：全部數值類型說明：返回A與B相加的結果。結果的數值類型等於A的類型和B的類型的最小父類型（詳見數據類型的繼承關係）。好比，int + int 通常結果爲int類型，而 int + double 通常結果爲double類型 hive> select 1 + 9 from tableName; 10 hive> create table tableName as select 1 + 1.2 from tableName; hive> describe tableName; _c0 double 二、減法操做: - 語法: A – B 操做類型：全部數值類型說明：返回A與B相減的結果。結果的數值類型等於A的類型和B的類型的最小父類型（詳見數據類型的繼承關係）。好比，int – int 通常結果爲int類型，而 int – double 通常結果爲double類型 hive> select 10 – 5 from tableName; 5 hive> create table tableName as select 5.6 – 4 from tableName; hive> describe tableName; _c0 double 三、乘法操做: 語法: A * B 操做類型：全部數值類型說明：返回A與B相乘的結果。結果的數值類型等於A的類型和B的類型的最小父類型（詳見數據類型的繼承關係）。注意，若是A乘以B的結果超過默認結果類型的數值範圍，則須要經過cast將結果轉換成範圍更大的數值類型 hive> select 40 * 5 from tableName; 200 四、除法操做: / 語法: A / B 操做類型：全部數值類型說明：返回A除以B的結果。結果的數值類型爲double hive> select 40 / 5 from tableName; 8.0 注意：hive中最高精度的數據類型是double,只精確到小數點後16位，在作除法運算的時候要特別注意 hive>select ceil(28.0/6.999999999999999999999) from tableName limit 1; 結果爲4 hive>select ceil(28.0/6.99999999999999) from tableName limit 1; 結果爲5 五、取餘操做: % 語法: A % B 操做類型：全部數值類型說明：返回A除以B的餘數。結果的數值類型等於A的類型和B的類型的最小父類型（詳見數據類型的繼承關係）。 hive> select 41 % 5 from tableName; 1 hive> select 8.4 % 4 from tableName; 0.40000000000000036 注意：精度在hive中是個很大的問題，相似這樣的操做最好經過round指定精度 hive> select round(8.4 % 4 , 2) from tableName; 0.4 六、位與操做: & 語法: A & B 操做類型：全部數值類型說明：返回A和B按位進行與操做的結果。結果的數值類型等於A的類型和B的類型的最小父類型（詳見數據類型的繼承關係）。 hive> select 4 & 8 from tableName; 0 hive> select 6 & 4 from tableName; 4 七、位或操做: | 語法: A | B 操做類型：全部數值類型說明：返回A和B按位進行或操做的結果。結果的數值類型等於A的類型和B的類型的最小父類型（詳見數據類型的繼承關係）。 hive> select 4 | 8 from tableName; 12 hive> select 6 | 8 from tableName; 14 八、位異或操做: ^ 語法: A ^ B 操做類型：全部數值類型說明：返回A和B按位進行異或操做的結果。結果的數值類型等於A的類型和B的類型的最小父類型（詳見數據類型的繼承關係）。 hive> select 4 ^ 8 from tableName; 12 hive> select 6 ^ 4 from tableName; 2 9．位取反操做: ~ 語法: ~A 操做類型：全部數值類型說明：返回A按位取反操做的結果。結果的數值類型等於A的類型。 hive> select ~6 from tableName; -7 hive> select ~4 from tableName; -5 邏輯運算：一、邏輯與操做: AND 語法: A AND B 操做類型：boolean 說明：若是A和B均爲TRUE，則爲TRUE；不然爲FALSE。若是A爲NULL或B爲NULL，則爲NULL hive> select 1 from tableName where 1=1 and 2=2; 1 二、邏輯或操做: OR 語法: A OR B 操做類型：boolean 說明：若是A爲TRUE，或者B爲TRUE，或者A和B均爲TRUE，則爲TRUE；不然爲FALSE hive> select 1 from tableName where 1=2 or 2=2; 1 三、邏輯非操做: NOT 語法: NOT A 操做類型：boolean 說明：若是A爲FALSE，或者A爲NULL，則爲TRUE；不然爲FALSE hive> select 1 from tableName where not 1=2; 1 數值計算一、取整函數: round * 語法: round(double a) 返回值: BIGINT 說明: 返回double類型的整數值部分（遵循四捨五入） hive> select round(3.1415926) from tableName; 3 hive> select round(3.5) from tableName; 4 hive> create table tableName as select round(9542.158) from tableName; hive> describe tableName; _c0 bigint 二、指定精度取整函數: round * 語法: round(double a, int d) 返回值: DOUBLE 說明: 返回指定精度d的double類型 hive> select round(3.1415926,4) from tableName; 3.1416 三、向下取整函數: floor * 語法: floor(double a) 返回值: BIGINT 說明: 返回等於或者小於該double變量的最大的整數 hive> select floor(3.1415926) from tableName; 3 hive> select floor(25) from tableName; 25 四、向上取整函數: ceil * 語法: ceil(double a) 返回值: BIGINT 說明: 返回等於或者大於該double變量的最小的整數 hive> select ceil(3.1415926) from tableName; 4 hive> select ceil(46) from tableName; 46 五、向上取整函數: ceiling * 語法: ceiling(double a) 返回值: BIGINT 說明: 與ceil功能相同 hive> select ceiling(3.1415926) from tableName; 4 hive> select ceiling(46) from tableName; 46 六、取隨機數函數: rand * 語法: rand(),rand(int seed) 返回值: double 說明: 返回一個0到1範圍內的隨機數。若是指定種子seed，則會等到一個穩定的隨機數序列 hive> select rand() from tableName; 0.5577432776034763 hive> select rand() from tableName; 0.6638336467363424 hive> select rand(100) from tableName; 0.7220096548596434 hive> select rand(100) from tableName; 0.7220096548596434 七、天然指數函數: exp 語法: exp(double a) 返回值: double 說明: 返回天然對數e的a次方 hive> select exp(2) from tableName; 7.38905609893065 天然對數函數: ln 語法: ln(double a) 返回值: double 說明: 返回a的天然對數 1 hive> select ln(7.38905609893065) from tableName; 2.0 八、以10爲底對數函數: log10 語法: log10(double a) 返回值: double 說明: 返回以10爲底的a的對數 hive> select log10(100) from tableName; 2.0 九、以2爲底對數函數: log2 語法: log2(double a) 返回值: double 說明: 返回以2爲底的a的對數 hive> select log2(8) from tableName; 3.0 十、對數函數: log 語法: log(double base, double a) 返回值: double 說明: 返回以base爲底的a的對數 hive> select log(4,256) from tableName; 4.0 十一、冪運算函數: pow 語法: pow(double a, double p) 返回值: double 說明: 返回a的p次冪 hive> select pow(2,4) from tableName; 16.0 十二、冪運算函數: power 語法: power(double a, double p) 返回值: double 說明: 返回a的p次冪,與pow功能相同 hive> select power(2,4) from tableName; 16.0 1三、開平方函數: sqrt 語法: sqrt(double a) 返回值: double 說明: 返回a的平方根 hive> select sqrt(16) from tableName; 4.0 1四、二進制函數: bin 語法: bin(BIGINT a) 返回值: string 說明: 返回a的二進制代碼表示 hive> select bin(7) from tableName; 111 1五、十六進制函數: hex 語法: hex(BIGINT a) 返回值: string 說明: 若是變量是int類型，那麼返回a的十六進制表示；若是變量是string類型，則返回該字符串的十六進制表示 hive> select hex(17) from tableName; 11 hive> select hex(‘abc’) from tableName; 616263 1六、反轉十六進制函數: unhex 語法: unhex(string a) 返回值: string 說明: 返回該十六進制字符串所代碼的字符串 hive> select unhex(‘616263’) from tableName; abc hive> select unhex(‘11’) from tableName;

hive> select unhex(616263) from tableName; abc 1七、進制轉換函數: conv 語法: conv(BIGINT num, int from_base, int to_base) 返回值: string 說明: 將數值num從from_base進制轉化到to_base進制 hive> select conv(17,10,16) from tableName; 11 hive> select conv(17,10,2) from tableName; 10001 1八、絕對值函數: abs 語法: abs(double a) abs(int a) 返回值: double int 說明: 返回數值a的絕對值 hive> select abs(-3.9) from tableName; 3.9 hive> select abs(10.9) from tableName; 10.9 1九、正取餘函數: pmod 語法: pmod(int a, int b),pmod(double a, double b) 返回值: int double 說明: 返回正的a除以b的餘數 hive> select pmod(9,4) from tableName; 1 hive> select pmod(-9,4) from tableName; 3 20、正弦函數: sin 語法: sin(double a) 返回值: double 說明: 返回a的正弦值 hive> select sin(0.8) from tableName; 0.7173560908995228 2一、反正弦函數: asin 語法: asin(double a) 返回值: double 說明: 返回a的反正弦值 hive> select asin(0.7173560908995228) from tableName; 0.8 2二、餘弦函數: cos 語法: cos(double a) 返回值: double 說明: 返回a的餘弦值 hive> select cos(0.9) from tableName; 0.6216099682706644 2三、反餘弦函數: acos 語法: acos(double a) 返回值: double 說明: 返回a的反餘弦值 hive> select acos(0.6216099682706644) from tableName; 0.9 2四、positive函數: positive 語法: positive(int a), positive(double a) 返回值: int double 說明: 返回a hive> select positive(-10) from tableName; -10 hive> select positive(12) from tableName; 12 2五、negative函數: negative 語法: negative(int a), negative(double a) 返回值: int double 說明: 返回-a hive> select negative(-5) from tableName; 5 hive> select negative(8) from tableName; -8 日期函數一、UNIX時間戳轉日期函數: from_unixtime *** 語法: from_unixtime(bigint unixtime[, string format]) 返回值: string 說明: 轉化UNIX時間戳（從1970-01-01 00:00:00 UTC到指定時間的秒數）到當前時區的時間格式 hive> select from_unixtime(1323308943,'yyyyMMdd') from tableName; 20111208 二、獲取當前UNIX時間戳函數: unix_timestamp *** 語法: unix_timestamp() 返回值: bigint 說明: 得到當前時區的UNIX時間戳 hive> select unix_timestamp() from tableName; 1323309615 三、日期轉UNIX時間戳函數: unix_timestamp *** 語法: unix_timestamp(string date) 返回值: bigint 說明: 轉換格式爲"yyyy-MM-dd HH:mm:ss"的日期到UNIX時間戳。若是轉化失敗，則返回0。 hive> select unix_timestamp('2011-12-07 13:01:03') from tableName; 1323234063 四、指定格式日期轉UNIX時間戳函數: unix_timestamp *** 語法: unix_timestamp(string date, string pattern) 返回值: bigint 說明: 轉換pattern格式的日期到UNIX時間戳。若是轉化失敗，則返回0。 hive> select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from tableName; 1323234063 五、日期時間轉日期函數: to_date *** 語法: to_date(string timestamp) 返回值: string 說明: 返回日期時間字段中的日期部分。 hive> select to_date('2011-12-08 10:03:01') from tableName; 2011-12-08 六、日期轉年函數: year *** 語法: year(string date) 返回值: int 說明: 返回日期中的年。 hive> select year('2011-12-08 10:03:01') from tableName; 2011 hive> select year('2012-12-08') from tableName; 2012 七、日期轉月函數: month *** 語法: month (string date) 返回值: int 說明: 返回日期中的月份。 hive> select month('2011-12-08 10:03:01') from tableName; 12 hive> select month('2011-08-08') from tableName; 8 八、日期轉天函數: day **** 語法: day (string date) 返回值: int 說明: 返回日期中的天。 hive> select day('2011-12-08 10:03:01') from tableName; 8 hive> select day('2011-12-24') from tableName; 24 九、日期轉小時函數: hour *** 語法: hour (string date) 返回值: int 說明: 返回日期中的小時。 hive> select hour('2011-12-08 10:03:01') from tableName; 10 十、日期轉分鐘函數: minute 語法: minute (string date) 返回值: int 說明: 返回日期中的分鐘。 hive> select minute('2011-12-08 10:03:01') from tableName; 3 十一、日期轉秒函數: second 語法: second (string date) 返回值: int 說明: 返回日期中的秒。 hive> select second('2011-12-08 10:03:01') from tableName; 1 十二、日期轉周函數: weekofyear 語法: weekofyear (string date) 返回值: int 說明: 返回日期在當前的週數。 hive> select weekofyear('2011-12-08 10:03:01') from tableName; 49 1三、日期比較函數: datediff *** 語法: datediff(string enddate, string startdate) 返回值: int 說明: 返回結束日期減去開始日期的天數。 hive> select datediff('2012-12-08','2012-05-09') from tableName; 213 1四、日期增長函數: date_add *** 語法: date_add(string startdate, int days) 返回值: string 說明: 返回開始日期startdate增長days天后的日期。 hive> select date_add('2012-12-08',10) from tableName; 2012-12-18 1五、日期減小函數: date_sub *** 語法: date_sub (string startdate, int days) 返回值: string 說明: 返回開始日期startdate減小days天后的日期。 hive> select date_sub('2012-12-08',10) from tableName; 2012-11-28 條件函數一、If函數: if *** 語法: if(boolean testCondition, T valueTrue, T valueFalseOrNull) 返回值: T 說明: 當條件testCondition爲TRUE時，返回valueTrue；不然返回valueFalseOrNull hive> select if(1=2,100,200) from tableName; 200 hive> select if(1=1,100,200) from tableName; 100 二、非空查找函數: COALESCE 語法: COALESCE(T v1, T v2, …) 返回值: T 說明: 返回參數中的第一個非空值；若是全部值都爲NULL，那麼返回NULL hive> select COALESCE(null,'100','50') from tableName; 100 三、條件判斷函數：CASE *** 語法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END 返回值: T 說明：若是a等於b，那麼返回c；若是a等於d，那麼返回e；不然返回f hive> Select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName; mary hive> Select case 200 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName; tim 四、條件判斷函數：CASE **** 語法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END 返回值: T 說明：若是a爲TRUE,則返回b；若是c爲TRUE，則返回d；不然返回e hive> select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName; mary hive> select case when 1=1 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName; tom 字符串函數一、字符串長度函數：length 語法: length(string A) 返回值: int 說明：返回字符串A的長度 hive> select length('abcedfg') from tableName; 7 二、字符串反轉函數：reverse 語法: reverse(string A) 返回值: string 說明：返回字符串A的反轉結果 hive> select reverse('abcedfg') from tableName; gfdecba 三、字符串鏈接函數：concat *** 語法: concat(string A, string B…) 返回值: string 說明：返回輸入字符串鏈接後的結果，支持任意個輸入字符串 hive> select concat('abc','def’,'gh')from tableName; abcdefgh 四、帶分隔符字符串鏈接函數：concat_ws *** 語法: concat_ws(string SEP, string A, string B…) 返回值: string 說明：返回輸入字符串鏈接後的結果，SEP表示各個字符串間的分隔符 hive> select concat_ws(',','abc','def','gh')from tableName; abc,def,gh 五、字符串截取函數：substr,substring **** 語法: substr(string A, int start),substring(string A, int start) 返回值: string 說明：返回字符串A從start位置到結尾的字符串 hive> select substr('abcde',3) from tableName; cde hive> select substring('abcde',3) from tableName; cde hive> select substr('abcde',-1) from tableName; （和ORACLE相同） e 六、字符串截取函數：substr,substring **** 語法: substr(string A, int start, int len),substring(string A, int start, int len) 返回值: string 說明：返回字符串A從start位置開始，長度爲len的字符串 hive> select substr('abcde',3,2) from tableName; cd hive> select substring('abcde',3,2) from tableName; cd hive>select substring('abcde',-2,2) from tableName; de 七、字符串轉大寫函數：upper,ucase **** 語法: upper(string A) ucase(string A) 返回值: string 說明：返回字符串A的大寫格式 hive> select upper('abSEd') from tableName; ABSED hive> select ucase('abSEd') from tableName; ABSED 八、字符串轉小寫函數：lower,lcase *** 語法: lower(string A) lcase(string A) 返回值: string 說明：返回字符串A的小寫格式 hive> select lower('abSEd') from tableName; absed hive> select lcase('abSEd') from tableName; absed 九、去空格函數：trim *** 語法: trim(string A) 返回值: string 說明：去除字符串兩邊的空格 hive> select trim(' abc ') from tableName; abc 十、左邊去空格函數：ltrim 語法: ltrim(string A) 返回值: string 說明：去除字符串左邊的空格 hive> select ltrim(' abc ') from tableName; abc 十一、右邊去空格函數：rtrim 語法: rtrim(string A) 返回值: string 說明：去除字符串右邊的空格 hive> select rtrim(' abc ') from tableName; abc 十二、正則表達式替換函數：regexp_replace 語法: regexp_replace(string A, string B, string C) 返回值: string 說明：將字符串A中的符合java正則表達式B的部分替換爲C。注意，在有些狀況下要使用轉義字符,相似oracle中的regexp_replace函數。 hive> select regexp_replace('foobar', 'oo|ar', '') from tableName; fb 1三、正則表達式解析函數：regexp_extract 語法: regexp_extract(string subject, string pattern, int index) 返回值: string 說明：將字符串subject按照pattern正則表達式的規則拆分，返回index指定的字符。 hive> select regexp_extract('foothebar', 'foo(.?)(bar)', 1) from tableName; the hive> select regexp_extract('foothebar', 'foo(.?)(bar)', 2) from tableName; bar hive> select regexp_extract('foothebar', 'foo(.?)(bar)', 0) from tableName; foothebar strong>注意，在有些狀況下要使用轉義字符，下面的等號要用雙豎線轉義，這是java正則表達式的規則。 select data_field, regexp_extract(data_field,'.?bgStart\=([^&]+)',1) as aaa, regexp_extract(data_field,'.?contentLoaded_headStart\=([^&]+)',1) as bbb, regexp_extract(data_field,'.?AppLoad2Req\=([^&]+)',1) as ccc from pt_nginx_loginlog_st where pt = '2012-03-26' limit 2; 1四、URL解析函數：parse_url **** 語法: parse_url(string urlString, string partToExtract [, string keyToExtract]) 返回值: string 說明：返回URL中指定的部分。partToExtract的有效值爲：HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. hive> select parse_url ('www.tableName.com/path1/p.php…', 'HOST') from tableName; www.tableName.com hive> select parse_url ('www.tableName.com/path1/p.php…', 'QUERY', 'k1') from tableName; v1 1五、json解析函數：get_json_object **** 語法: get_json_object(string json_string, string path) 返回值: string 說明：解析json的字符串json_string,返回path指定的內容。若是輸入的json字符串無效，那麼返回NULL。 hive> select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} },"email":"amy@only_for_json_udf_test.net","owner":"amy"}','$.owner') from tableName;

1六、空格字符串函數：space
語法: space(int n) 返回值: string 說明：返回長度爲n的字符串 hive> select space(10) from tableName; hive> select length(space(10)) from tableName; 10 1七、重複字符串函數：repeat *** 語法: repeat(string str, int n) 返回值: string 說明：返回重複n次後的str字符串 hive> select repeat('abc',5) from tableName; abcabcabcabcabc 1八、首字符ascii函數：ascii 語法: ascii(string str) 返回值: int 說明：返回字符串str第一個字符的ascii碼 hive> select ascii('abcde') from tableName; 97 1九、左補足函數：lpad 語法: lpad(string str, int len, string pad) 返回值: string 說明：將str進行用pad進行左補足到len位 hive> select lpad('abc',10,'td') from tableName; tdtdtdtabc 注意：與GP，ORACLE不一樣，pad 不能默認 20、右補足函數：rpad 語法: rpad(string str, int len, string pad) 返回值: string 說明：將str進行用pad進行右補足到len位 hive> select rpad('abc',10,'td') from tableName; abctdtdtdt 2一、分割字符串函數: split **** 語法: split(string str, string pat) 返回值: array 說明: 按照pat字符串分割str，會返回分割後的字符串數組 hive> select split('abtcdtef','t') from tableName; ["ab","cd","ef"] 2二、集合查找函數: find_in_set 語法: find_in_set(string str, string strList) 返回值: int 說明: 返回str在strlist第一次出現的位置，strlist是用逗號分割的字符串。若是沒有找該str字符，則返回0 hive> select find_in_set('ab','ef,ab,de') from tableName; 2 hive> select find_in_set('at','ef,ab,de') from tableName; 0 集合統計函數一、個數統計函數: count *** 語法: count(), count(expr), count(DISTINCT expr[, expr_.]) 返回值: int 說明: count()統計檢索出的行的個數，包括NULL值的行；count(expr)返回指定字段的非空值的個數；count(DISTINCT expr[, expr_.])返回指定字段的不一樣的非空值的個數 hive> select count(*) from tableName; 20 hive> select count(distinct t) from tableName; 10 二、總和統計函數: sum *** 語法: sum(col), sum(DISTINCT col) 返回值: double 說明: sum(col)統計結果集中col的相加的結果；sum(DISTINCT col)統計結果中col不一樣值相加的結果 hive> select sum(t) from tableName; 100 hive> select sum(distinct t) from tableName; 70 三、平均值統計函數: avg *** 語法: avg(col), avg(DISTINCT col) 返回值: double 說明: avg(col)統計結果集中col的平均值；avg(DISTINCT col)統計結果中col不一樣值相加的平均值 hive> select avg(t) from tableName; 50 hive> select avg (distinct t) from tableName; 30 四、最小值統計函數: min *** 語法: min(col) 返回值: double 說明: 統計結果集中col字段的最小值 hive> select min(t) from tableName; 20 五、最大值統計函數: max *** 語法: maxcol) 返回值: double 說明: 統計結果集中col字段的最大值 hive> select max(t) from tableName; 120 六、非空集合整體變量函數: var_pop 語法: var_pop(col) 返回值: double 說明: 統計結果集中col非空集合的整體變量（忽略null）七、非空集合樣本變量函數: var_samp 語法: var_samp (col) 返回值: double 說明: 統計結果集中col非空集合的樣本變量（忽略null）八、整體標準偏離函數: stddev_pop 語法: stddev_pop(col) 返回值: double 說明: 該函數計算整體標準偏離，並返回整體變量的平方根，其返回值與VAR_POP函數的平方根相同九、樣本標準偏離函數: stddev_samp 語法: stddev_samp (col) 返回值: double 說明: 該函數計算樣本標準偏離 10．中位數函數: percentile 語法: percentile(BIGINT col, p) 返回值: double 說明: 求準確的第pth個百分位數，p必須介於0和1之間，可是col字段目前只支持整數，不支持浮點數類型十一、中位數函數: percentile 語法: percentile(BIGINT col, array(p1 [, p2]…)) 返回值: array 說明: 功能和上述相似，以後後面能夠輸入多個百分位數，返回類型也爲array，其中爲對應的百分位數。 select percentile(score,<0.2,0.4>) from tableName；取0.2，0.4位置的數據十二、近似中位數函數: percentile_approx 語法: percentile_approx(DOUBLE col, p [, B]) 返回值: double 說明: 求近似的第pth個百分位數，p必須介於0和1之間，返回類型爲double，可是col字段支持浮點類型。參數B控制內存消耗的近似精度，B越大，結果的準確度越高。默認爲10,000。當col字段中的distinct值的個數小於B時，結果爲準確的百分位數 1三、近似中位數函數: percentile_approx 語法: percentile_approx(DOUBLE col, array(p1 [, p2]…) [, B]) 返回值: array 說明: 功能和上述相似，以後後面能夠輸入多個百分位數，返回類型也爲array，其中爲對應的百分位數。 1四、直方圖: histogram_numeric 語法: histogram_numeric(col, b) 返回值: array<struct {‘x’,‘y’}> 說明: 以b爲基準計算col的直方圖信息。 hive> select histogram_numeric(100,5) from tableName; [{"x":100.0,"y":1.0}] 複合類型構建操做一、Map類型構建: map **** 語法: map (key1, value1, key2, value2, …) 說明：根據輸入的key和value對構建map類型 hive> Create table mapTable as select map('100','tom','200','mary') as t from tableName; hive> describe mapTable; t map<string ,string> hive> select t from tableName; {"100":"tom","200":"mary"} 二、Struct類型構建: struct 語法: struct(val1, val2, val3, …) 說明：根據輸入的參數構建結構體struct類型 hive> create table struct_table as select struct('tom','mary','tim') as t from tableName; hive> describe struct_table; t struct<col1:string ,col2:string,col3:string> hive> select t from tableName; {"col1":"tom","col2":"mary","col3":"tim"} 三、array類型構建: array 語法: array(val1, val2, …) 說明：根據輸入的參數構建數組array類型 hive> create table arr_table as select array("tom","mary","tim") as t from tableName; hive> describe tableName; t array hive> select t from tableName; ["tom","mary","tim"] 複雜類型訪問操做 **** 一、array類型訪問: A[n] 語法: A[n] 操做類型: A爲array類型，n爲int類型說明：返回數組A中的第n個變量值。數組的起始下標爲0。好比，A是個值爲['foo', 'bar']的數組類型，那麼A[0]將返回'foo',而A[1]將返回'bar' hive> create table arr_table2 as select array("tom","mary","tim") as t from tableName; hive> select t[0],t[1] from arr_table2; tom mary tim 二、map類型訪問: M[key] 語法: M[key] 操做類型: M爲map類型，key爲map中的key值說明：返回map類型M中，key值爲指定值的value值。好比，M是值爲{'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'}的map類型，那麼M['all']將會返回'foobar' hive> Create table map_table2 as select map('100','tom','200','mary') as t from tableName; hive> select t['200'],t['100'] from map_table2; mary tom 三、struct類型訪問: S.x 語法: S.x 操做類型: S爲struct類型說明：返回結構體S中的x字段。好比，對於結構體struct foobar {int foo, int bar}，foobar.foo返回結構體中的foo字段 hive> create table str_table2 as select struct('tom','mary','tim') as t from tableName; hive> describe tableName; t struct<col1:string ,col2:string,col3:string> hive> select t.col1,t.col3 from str_table2; tom tim 複雜類型長度統計函數 **** 1.Map類型長度函數: size(Map<k .V>) 語法: size(Map<k .V>) 返回值: int 說明: 返回map類型的長度 hive> select size(t) from map_table2; 2 2.array類型長度函數: size(Array) 語法: size(Array) 返回值: int 說明: 返回array類型的長度 hive> select size(t) from arr_table2; 4 3.類型轉換函數 *** 類型轉換函數: cast 語法: cast(expr as ) 返回值: Expected "=" to follow "type" 說明: 返回轉換後的數據類型 hive> select cast('1' as bigint) from tableName; 1

三、hive當中的lateral view 與 explode以及reflect和窗口函數

一、使用explode函數將hive表中的Map和Array字段數據進行拆分

lateral view用於和split、explode等UDTF一塊兒使用的，能將一行數據拆分紅多行數據，在此基礎上能夠對拆分的數據進行聚合，lateral view首先爲原始表的每行調用UDTF，UDTF會把一行拆分紅一行或者多行，lateral view在把結果組合，產生一個支持別名表的虛擬表。其中explode還能夠用於將hive一列中複雜的array或者map結構拆分紅多行

需求：如今有數據格式以下 zhangsan child1,child2,child3,child4 k1:v1,k2:v2 lisi child5,child6,child7,child8 k3:v3,k4:v4

字段之間使用\t分割，需求將全部的child進行拆開成爲一列

將map的key和value也進行拆開，成爲以下結果 +-----------+-------------+--+ | mymapkey | mymapvalue | +-----------+-------------+--+ | k1 | v1 | | k2 | v2 | | k3 | v3 | | k4 | v4 | +-----------+-------------+--+

第一步：建立hive數據庫建立hive數據庫 hive (default)> create database hive_explode; hive (default)> use hive_explode;

第二步：建立hive表，而後使用explode拆分map和array hive (hive_explode)> create table t3(name string,children array,address Map<string,string>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':' stored as textFile;

第三步：加載數據 node03執行如下命令建立表數據文件 mkdir -p /export/servers/hivedatas/ cd /export/servers/hivedatas/ vim maparray zhangsan child1,child2,child3,child4 k1:v1,k2:v2 lisi child5,child6,child7,child8 k3:v3,k4:v4

hive表當中加載數據 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/maparray' into table t3;

第四步：使用explode將hive當中數據拆開將array當中的數據拆分開 hive (hive_explode)> SELECT explode(children) AS myChild FROM t3;

將map當中的數據拆分開

hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM t3;

二、使用explode拆分json字符串需求：如今有一些數據格式以下： a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

其中字段與字段之間的分隔符是 | 咱們要解析獲得全部的monthSales對應的值爲如下這一列（行轉列） 4900 2090 6987 第一步：建立hive表 hive (hive_explode)> create table explode_lateral_view > (area string, > goods_id string, > sale_info string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '|' > STORED AS textfile;

第二步：準備數據並加載數據準備數據以下 cd /export/servers/hivedatas vim explode_json

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

加載數據到hive表當中去 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/explode_json' overwrite into table explode_lateral_view;

第三步：使用explode拆分Array

hive (hive_explode)> select explode(split(goods_id,',')) as goods_id from explode_lateral_view;

第四步：使用explode拆解Map hive (hive_explode)> select explode(split(area,',')) as area from explode_lateral_view;

5．建立hive表並導入數據建立hive表並加載數據 hive (hive_explode)> create table person_info( name string, constellation string, blood_type string) row format delimited fields terminated by "\t"; 加載數據 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/constellation.txt' into table person_info; 第五步：拆解json字段 hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{')) as sale_info from explode_lateral_view;

而後咱們想用get_json_object來獲取key爲monthSales的數據： hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{')),'$.monthSales') as sale_info from explode_lateral_view;

而後掛了FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions UDTF explode不能寫在別的函數內若是你這麼寫，想查兩個字段，select explode(split(area,',')) as area,good_id from explode_lateral_view; 會報錯FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id' 使用UDTF的時候，只支持一個字段，這時候就須要LATERAL VIEW出場了

三、配合LATERAL VIEW使用配合lateral view查詢多個字段 hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2; 其中LATERAL VIEW explode(split(goods_id,','))goods至關於一個虛擬表，與原表explode_lateral_view笛卡爾積關聯。也能夠多重使用 hive (hive_explode)> select goods_id2,sale_info,area2 from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2;也是三個表笛卡爾積的結果

最終，咱們能夠經過下面的句子，把這個json格式的一行數據，徹底轉換成二維表的方式展示

hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'}'),' $.source') as source, get_json_object(concat('{',sale_info_1,'}'),'$ .monthSales') as monthSales, get_json_object(concat('{',sale_info_1,'}'),' $.userCount') as monthSales, get_json_object(concat('{',sale_info_1,'}'),'$ .score') as monthSales from explode_lateral_view LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{'))sale_info as sale_info_1; 總結： Lateral View一般和UDTF一塊兒出現，爲了解決UDTF不容許在select字段的問題。 Multiple Lateral View能夠實現相似笛卡爾乘積。 Outer關鍵字能夠把不輸出的UDTF的空結果，輸出成NULL，防止丟失數據。四、行轉列 1．相關函數說明 CONCAT(string A/col, string B/col…)：返回輸入字符串鏈接後的結果，支持任意個輸入字符串; CONCAT_WS(separator, str1, str2,...)：它是一個特殊形式的 CONCAT()。第一個參數剩餘參數間的分隔符。分隔符能夠是與剩餘參數同樣的字符串。若是分隔符是 NULL，返回值也將爲 NULL。這個函數會跳過度隔符參數後的任何 NULL 和空字符串。分隔符將被加到被鏈接的字符串之間; COLLECT_SET(col)：函數只接受基本數據類型，它的主要做用是將某字段的值進行去重彙總，產生array類型字段。 2．數據準備表6-6 數據準備 name constellation blood_type 孫悟空白羊座 A 老王射手座 A 宋宋白羊座 B 豬八戒白羊座 A 鳳姐射手座 A 3．需求把星座和血型同樣的人歸類到一塊兒。結果以下：射手座,A 老王|鳳姐白羊座,A 孫悟空|豬八戒白羊座,B 宋宋 4．建立本地constellation.txt，導入數據 node03服務器執行如下命令建立文件，注意數據使用\t進行分割 cd /export/servers/hivedatas vim constellation.txt

孫悟空白羊座 A 老王射手座 A 宋宋白羊座 B
豬八戒白羊座 A 鳳姐射手座 A 6．按需求查詢數據 hive (hive_explode)> select t1.base, concat_ws('|', collect_set(t1.name)) name from (select name, concat(constellation, "," , blood_type) base from person_info) t1 group by t1.base; 五、列轉行 1．函數說明 EXPLODE(col)：將hive一列中複雜的array或者map結構拆分紅多行。 LATERAL VIEW 用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias 解釋：用於和split, explode等UDTF一塊兒使用，它可以將一列數據拆成多行數據，在此基礎上能夠對拆分後的數據進行聚合。 2．數據準備 cd /export/servers/hivedatas vim movie.txt 數據字段之間使用\t進行分割《疑犯追蹤》懸疑,動做,科幻,劇情《Lie to me》懸疑,警匪,動做,心理,劇情《戰狼2》戰爭,動做,災難 3．需求將電影分類中的數組數據展開。結果以下：《疑犯追蹤》懸疑《疑犯追蹤》動做《疑犯追蹤》科幻《疑犯追蹤》劇情《Lie to me》懸疑《Lie to me》警匪《Lie to me》動做《Lie to me》心理《Lie to me》劇情《戰狼2》戰爭《戰狼2》動做《戰狼2》災難 4．建立hive表並導入數據建立hive表 create table movie_info( movie string, category array) row format delimited fields terminated by "\t" collection items terminated by ",";

加載數據 load data local inpath "/export/servers/hivedatas/movie.txt" into table movie_info;

5．按需求查詢數據 select movie, category_name from movie_info lateral view explode(category) table_tmp as category_name; 六、reflect函數 reflect函數能夠支持在sql中調用java中的自帶函數，秒殺一切udf函數。使用java.lang.Math當中的Max求兩列中最大值建立hive表 create table test_udf(col1 int,col2 int) row format delimited fields terminated by ','; 準備數據並加載數據 cd /export/servers/hivedatas vim test_udf 1,2 4,3 6,4 7,5 5,6 加載數據 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf' overwrite into table test_udf; 使用java.lang.Math當中的Max求兩列當中的最大值 hive (hive_explode)> select reflect("java.lang.Math","max",col1,col2) from test_udf; 不一樣記錄執行不一樣的java內置函數建立hive表 hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ','; 準備數據 cd /export/servers/hivedatas vim test_udf2

java.lang.Math,min,1,2 java.lang.Math,max,2,3

加載數據 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf2' overwrite into table test_udf2;

執行查詢 hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;

判斷是否爲數字使用apache commons中的函數，commons下的jar已經包含在hadoop的classpath中，因此能夠直接使用。使用方式以下： select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123") 七、窗口函數與分析函數 hive當中也帶有不少的窗口函數以及分析函數，主要用於如下這些場景（1）用於分區排序（2）動態Group By （3）Top N （4）累計計算（5）層次查詢一、建立hive表並加載數據建立表 hive (hive_explode)> create table order_detail( user_id string,device_id string,user_type string,price double,sales int )row format delimited fields terminated by ','; 加載數據

cd /export/servers/hivedatas vim order_detail

zhangsan,1,new,67.1,2 lisi,2,old,43.32,1 wagner,3,new,88.88,3 liliu,4,new,66.0,1 qiuba,5,new,54.32,1 wangshi,6,old,77.77,2 liwei,7,old,88.44,3 wutong,8,new,56.55,6 lilisi,9,new,88.88,5 qishili,10,new,66.66,5 加載數據 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/order_detail' into table order_detail;

二、窗口函數 FIRST_VALUE：取分組內排序後，截止到當前行，第一個值 LAST_VALUE：取分組內排序後，截止到當前行，最後一個值 LEAD(col,n,DEFAULT) ：用於統計窗口內往下第n行值。第一個參數爲列名，第二個參數爲往下第n行（可選，默認爲1），第三個參數爲默認值（當往下第n行爲NULL時候，取默認值，如不指定，則爲NULL） LAG(col,n,DEFAULT) ：與lead相反，用於統計窗口內往上第n行值。第一個參數爲列名，第二個參數爲往上第n行（可選，默認爲1），第三個參數爲默認值（當往上第n行爲NULL時候，取默認值，如不指定，則爲NULL）三、OVER從句一、使用標準的聚合函數COUNT、SUM、MIN、MAX、AVG 二、使用PARTITION BY語句，使用一個或者多個原始數據類型的列三、使用PARTITION BY與ORDER BY語句，使用一個或者多個數據類型的分區或者排序列四、使用窗口規範，窗口規範支持如下格式：

當ORDER BY和窗口從句都缺失, 窗口規範默認是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

OVER從句支持如下函數，可是並不支持和窗口一塊兒使用它們。 Ranking函數: Rank, NTile, DenseRank, CumeDist, PercentRank. Lead 和 Lag 函數.

使用窗口函數進行統計求銷量使用窗口函數sum over統計銷量

hive (hive_explode)> select user_id, user_type, sales, --分組內全部行 sum(sales) over(partition by user_type) AS sales_1 , sum(sales) over(order by user_type) AS sales_2 , --默認爲從起點到當前行，若是sales相同，累加結果相同 sum(sales) over(partition by user_type order by sales asc) AS sales_3, --從起點到當前行，結果與sales_3不一樣。根據排序前後不一樣，可能結果累加不一樣 sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) AS sales_4, --當前行+往前3行 sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) AS sales_5, --當前行+往前3行+日後1行 sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) AS sales_6, --當前行+日後全部行
sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) AS sales_7 from order_detail order by user_type, sales, user_id;

統計以後求得結果以下： +-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+ | user_id | user_type | sales | sales_1 | sales_2 | sales_3 | sales_4 | sales_5 | sales_6 | sales_7 | +-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+ | liliu | new | 1 | 23 | 23 | 2 | 2 | 2 | 4 | 22 | | qiuba | new | 1 | 23 | 23 | 2 | 1 | 1 | 2 | 23 | | zhangsan | new | 2 | 23 | 23 | 4 | 4 | 4 | 7 | 21 | | wagner | new | 3 | 23 | 23 | 7 | 7 | 7 | 12 | 19 | | lilisi | new | 5 | 23 | 23 | 17 | 17 | 15 | 21 | 11 | | qishili | new | 5 | 23 | 23 | 17 | 12 | 11 | 16 | 16 | | wutong | new | 6 | 23 | 23 | 23 | 23 | 19 | 19 | 6 | | lisi | old | 1 | 6 | 29 | 1 | 1 | 1 | 3 | 6 | | wangshi | old | 2 | 6 | 29 | 3 | 3 | 3 | 6 | 5 | | liwei | old | 3 | 6 | 29 | 6 | 6 | 6 | 6 | 3 | +-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+

注意: 結果和ORDER BY相關,默認爲升序若是不指定ROWS BETWEEN,默認爲從起點到當前行; 若是不指定ORDER BY，則將分組內全部值累加;

關鍵是理解ROWS BETWEEN含義,也叫作WINDOW子句： PRECEDING：往前 FOLLOWING：日後 CURRENT ROW：當前行 UNBOUNDED：無界限（起點或終點） UNBOUNDED PRECEDING：表示從前面的起點 UNBOUNDED FOLLOWING：表示到後面的終點其餘COUNT、AVG，MIN，MAX，和SUM用法同樣。

求分組後的第一個和最後一個值first_value與last_value 使用first_value和last_value求分組後的第一個和最後一個值 select user_id, user_type, ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user, first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user, last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user, last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user from order_detail;

四、分析函數一、ROW_NUMBER()：從1開始，按照順序，生成分組內記錄的序列,好比，按照pv降序排列，生成分組內天天的pv名次,ROW_NUMBER()的應用場景很是多，再好比，獲取分組內排序第一的記錄;獲取一個session中的第一條refer等。二、RANK() ：生成數據項在分組中的排名，排名相等會在名次中留下空位三、DENSE_RANK() ：生成數據項在分組中的排名，排名相等會在名次中不會留下空位四、CUME_DIST ：小於等於當前值的行數/分組內總行數。好比，統計小於等於當前薪水的人數，所佔總人數的比例五、PERCENT_RANK ：分組內當前行的RANK值-1/分組內總行數-1 六、NTILE(n) ：用於將分組數據按照順序切分紅n片，返回當前切片值，若是切片不均勻，默認增長第一個切片的分佈。NTILE不支持ROWS BETWEEN，好比 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。 RANK、ROW_NUMBER、DENSE_RANK OVER的使用使用這幾個函數，能夠實現分組求topN 需求：按照用戶類型進行分類，求取銷售量最大的前N條數據 select user_id,user_type,sales, RANK() over (partition by user_type order by sales desc) as r, ROW_NUMBER() over (partition by user_type order by sales desc) as rn, DENSE_RANK() over (partition by user_type order by sales desc) as dr from order_detail;

+-----------+------------+--------+----+-----+-----+--+ | user_id | user_type | sales | r | rn | dr | +-----------+------------+--------+----+-----+-----+--+ | wutong | new | 6 | 1 | 1 | 1 | | qishili | new | 5 | 2 | 2 | 2 | | lilisi | new | 5 | 2 | 3 | 2 | | wagner | new | 3 | 4 | 4 | 3 | | zhangsan | new | 2 | 5 | 5 | 4 | | qiuba | new | 1 | 6 | 6 | 5 | | liliu | new | 1 | 6 | 7 | 5 | | liwei | old | 3 | 1 | 1 | 1 | | wangshi | old | 2 | 2 | 2 | 2 | | lisi | old | 1 | 3 | 3 | 3 | +-----------+------------+--------+----+-----+-----+--+

使用NTILE求取百分比咱們可使用NTILE來將咱們的數據分紅多少份，而後求取百分比使用NTILE將數據進行分片 select user_type,sales, --分組內將數據分紅2片 NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2, --分組內將數據分紅3片
NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3, --分組內將數據分紅4片
NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4, --將全部數據分紅4片 NTILE(4) OVER(ORDER BY sales) AS all_nt4 from order_detail order by user_type, sales;

獲得結果以下： +------------+--------+------+------+------+----------+--+ | user_type | sales | nt2 | nt3 | nt4 | all_nt4 | +------------+--------+------+------+------+----------+--+ | new | 1 | 1 | 1 | 1 | 1 | | new | 1 | 1 | 1 | 1 | 1 | | new | 2 | 1 | 1 | 2 | 2 | | new | 3 | 1 | 2 | 2 | 3 | | new | 5 | 2 | 2 | 3 | 4 | | new | 5 | 2 | 3 | 3 | 3 | | new | 6 | 2 | 3 | 4 | 4 | | old | 1 | 1 | 1 | 1 | 1 | | old | 2 | 1 | 2 | 2 | 2 | | old | 3 | 2 | 3 | 3 | 2 | +------------+--------+------+------+------+----------+--+

使用NTILE求取sales前20%的用戶id select user_id from (select user_id, NTILE(5) OVER(ORDER BY sales desc) AS nt from order_detail )A where nt=1;

五、加強的聚合Cuhe和Grouping和Rollup 這幾個分析函數一般用於OLAP中，不能累加，並且須要根據不一樣維度上鑽和下鑽的指標統計，好比，分小時、天、月的UV數。

GROUPING SETS 在一個GROUP BY查詢中，根據不一樣的維度組合進行聚合，等價於將不一樣維度的GROUP BY結果集進行UNION ALL, 其中的GROUPING__ID，表示結果屬於哪個分組集合。需求：按照user_type和sales分別進行分組求取數據 0: jdbc:hive2://node03:10000>select user_type, sales, count(user_id) as pv, GROUPING__ID from order_detail group by user_type,sales GROUPING SETS(user_type,sales) ORDER BY GROUPING__ID;

求取結果以下： +------------+--------+-----+---------------+--+ | user_type | sales | pv | grouping__id | +------------+--------+-----+---------------+--+ | old | NULL | 3 | 1 | | new | NULL | 7 | 1 | | NULL | 6 | 1 | 2 | | NULL | 5 | 2 | 2 | | NULL | 3 | 2 | 2 | | NULL | 2 | 2 | 2 | | NULL | 1 | 3 | 2 | +------------+--------+-----+---------------+--+ 需求：按照user_type，sales，以及user_type + salse 分別進行分組求取統計數據

0: jdbc:hive2://node03:10000>select user_type, sales, count(user_id) as pv, GROUPING__ID from order_detail group by user_type,sales GROUPING SETS(user_type,sales,(user_type,sales)) ORDER BY GROUPING__ID; 求取結果以下： +------------+--------+-----+---------------+--+ | user_type | sales | pv | grouping__id | +------------+--------+-----+---------------+--+ | old | NULL | 3 | 1 | | new | NULL | 7 | 1 | | NULL | 1 | 3 | 2 | | NULL | 6 | 1 | 2 | | NULL | 5 | 2 | 2 | | NULL | 3 | 2 | 2 | | NULL | 2 | 2 | 2 | | old | 3 | 1 | 3 | | old | 2 | 1 | 3 | | old | 1 | 1 | 3 | | new | 6 | 1 | 3 | | new | 5 | 2 | 3 | | new | 3 | 1 | 3 | | new | 1 | 2 | 3 | | new | 2 | 1 | 3 | +------------+--------+-----+---------------+--+

六、使用cube 和ROLLUP 根據GROUP BY的維度的全部組合進行聚合。 cube進行聚合需求：不進行任何的分組，按照user_type進行分組，按照sales進行分組，按照user_type+sales進行分組求取統計數據 0: jdbc:hive2://node03:10000>select user_type, sales, count(user_id) as pv, GROUPING__ID from order_detail group by user_type,sales WITH CUBE ORDER BY GROUPING__ID;

+------------+--------+-----+---------------+--+ | user_type | sales | pv | grouping__id | +------------+--------+-----+---------------+--+ | NULL | NULL | 10 | 0 | | new | NULL | 7 | 1 | | old | NULL | 3 | 1 | | NULL | 6 | 1 | 2 | | NULL | 5 | 2 | 2 | | NULL | 3 | 2 | 2 | | NULL | 2 | 2 | 2 | | NULL | 1 | 3 | 2 | | old | 3 | 1 | 3 | | old | 2 | 1 | 3 | | old | 1 | 1 | 3 | | new | 6 | 1 | 3 | | new | 5 | 2 | 3 | | new | 3 | 1 | 3 | | new | 2 | 1 | 3 | | new | 1 | 2 | 3 | +------------+--------+-----+---------------+--+ ROLLUP進行聚合 rollup是CUBE的子集，以最左側的維度爲主，從該維度進行層級聚合。 select user_type, sales, count(user_id) as pv, GROUPING__ID from order_detail group by user_type,sales WITH ROLLUP ORDER BY GROUPING__ID;

+------------+--------+-----+---------------+--+ | user_type | sales | pv | grouping__id | +------------+--------+-----+---------------+--+ | NULL | NULL | 10 | 0 | | old | NULL | 3 | 1 | | new | NULL | 7 | 1 | | old | 3 | 1 | 3 | | old | 2 | 1 | 3 | | old | 1 | 1 | 3 | | new | 6 | 1 | 3 | | new | 5 | 2 | 3 | | new | 3 | 1 | 3 | | new | 2 | 1 | 3 | | new | 1 | 2 | 3 | +------------+--------+-----+---------------+--+