Hive 在多維統計分析中的應用 & 技巧總結

本文原地址:https://my.oschina.net/leejun2005/blog/121945html

多維統計通常分兩種,咱們看看 Hive 中如何解決:
java

一、同屬性的多維組合統計正則表達式

(1)問題:
有以下數據,字段內容分別爲:url, catePath0, catePath1, catePath2, unitparamssql

https://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 
http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 
http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 
https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 
http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 
http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 
http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 
http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 
http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

apache

(2)需求:
計算 catePath0, catePath1, catePath2 這三種維度組合下,各個 url 對應的 pv、uv,如:json

0 1 23 1 1 
0 1 25 1 1 
0 1 8 1 1 
0 1 ALL 3 3 
0 3 8 1 1 
0 3 98 1 1 
0 3 ALL 2 1 
0 5 118 1 1 
0 5 18 1 1 
0 5 81 1 1 
0 5 ALL 3 2 
0 ALL ALL 8 3 
ALL ALL ALL 8 3 

cookie

(3)解決思路:
hive 中同屬性多維統計問題一般用 union all 組合出各類維度而後 group by 進行求解:app

create EXTERNAL table IF NOT EXISTS t_log (	url string, c0 string, c1 string, c2 string, unitparams string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' location '/tmp/decli/1';select * from (		select host, c0, c1, c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all		select host, c0, c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all		select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all		select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9) test;select c0, c1, c2, count(host) PV, count(distinct(host)) UV from (		select host, c0, c1, c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all		select host, c0, c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all		select host, c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9
	union all		select host, 'ALL' c0, 'ALL' c1, 'ALL' c2 from t_log t0 
		LATERAL VIEW parse_url_tuple(url, 'HOST') t1 as host 
		where get_json_object(t0.unitparams, '$.store.fruit[0].weight') != 9) test group by c0, c1, c2;


二、不一樣屬性的多維組合統計ide

這種場景下咱們通常選擇 Multi Table/File Inserts,下面選自《programming hive》P124函數

Making Multiple Passes over the Same Data
Hive has a special syntax for producing multiple aggregations from a single pass
through a source of data, rather than rescanning it for each aggregation. This change
can save considerable processing time for large input data sets. We discussed the details
previously in Chapter 5.
For example, each of the following two queries creates a table from the same source
table, history:
hive> INSERT OVERWRITE TABLE sales
    > SELECT * FROM history WHERE action='purchased';
hive> INSERT OVERWRITE TABLE credits
    > SELECT * FROM history WHERE action='returned';
This syntax is correct, but inefficient. The following rewrite achieves the same thing,
but using a single pass through the source history table:
hive> FROM history
    > INSERT OVERWRITE sales   SELECT * WHERE action='purchased'
    > INSERT OVERWRITE credits SELECT * WHERE action='returned';

FROM pv_users    INSERT OVERWRITE TABLE pv_gender_sum        SELECT pv_users.gender, count_distinct(pv_users.userid)        GROUP BY pv_users.gender    INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
        SELECT pv_users.age, count_distinct(pv_users.userid)        GROUP BY pv_users.age;

https://cwiki.apache.org/confluence/display/Hive/Tutorial


注意事項以及一些小技巧:

一、hive union all 的用法:不支持 top level,以及各個select字段名稱、屬性必須嚴格一致

二、結果的順序問題,能夠本身加字符控制排序

三、多重insert和union all同樣也只掃描一次,但由於要insert到多個分區,因此作了不少其餘的事情,致使消耗的時間很是長,其會產生多個job,union all 自己只有一個job

關於 insert overwrite 產生多 job 並行執行的問題:

set hive.exec.parallel=true;   //打開任務並行執行
set hive.exec.parallel.thread.number=16; //同一個sql容許最大並行度,默認爲8。
http://superlxw1234.iteye.com/blog/1703713

四、當前HIVE 不支持 not in 中包含查詢子句的語法,形如以下的HQ語句是不被支持的: 
查詢在key字段在a表中,但不在b表中的數據
select a.key from a where key not in(select key from b)  該語句在hive中不支持
能夠經過left outer join進行查詢,(假設B表中包含另外的一個字段 key1 
select a.key from a left outer join b on a.key=b.key where b.key1 is null

五、left out join 不能連續3個以上使用,必須2個一組,2個一組包裝起來使用。

select p.ssi,p.pv,p.uv,p.nuv,p.visits,'2012-06-19 17:00:00' from (	select * from (		select * from (select ssi,count(1) pv,sum(visits) visits from FactClickAnalysis  
		where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi ) p1		left outer join 
		(		select ssi,count(1) uv from (select ssi,cookieid from FactClickAnalysis 
		where logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi,cookieid ) t1 group by ssi 
		) p2 on p1.ssi=p2.ssi
	) p3	left outer join
	(		select ssi, count(1) nuv from FactClickAnalysis 
		where logTime = insertTime and logTime <= '2012-06-19 18:00:00' and logTime >= '2012-06-19 17:00:00' group by ssi 
	) p4 on p3.ssi=p4.ssi
) p

六、hive本地執行mr

http://superlxw1234.iteye.com/blog/1703546

七、hive動態分區建立過多遇到的一個錯誤

http://superlxw1234.iteye.com/blog/1677938

八、hive中巧用正則表達式的貪婪匹配

http://superlxw1234.iteye.com/blog/1751216

九、hive匹配全中文字段

用java中匹配中文的正則便可:

name rlike '^[\\u4e00-\\u9fa5]+$'

判斷一個字段是否全數字:

select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\\d+$' limit 50;  

十、hive中使用sql window函數 LAG/LEAD/FIRST/LAST

http://superlxw1234.iteye.com/blog/1600323

http://www.shaoqun.com/a/18839.aspx

十一、hive優化之------控制hive任務中的map數和reduce數

http://superlxw1234.iteye.com/blog/1582880

十二、hive中轉義$等特殊字符

http://superlxw1234.iteye.com/blog/1568739

1三、日期處理:

查看N天前的日期:

select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1;  

獲取兩個日期之間的天數/秒數/分鐘數等等:

select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400  from t_lxw_test limit 1;  

1四、刪除 Hive 臨時文件 hive.exec.scratchdir

http://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576


REF:

http://superlxw1234.iteye.com/blog/1536440
http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/
http://blog.csdn.net/azhao_dn/article/details/6921429

http://superlxw1234.iteye.com/category/228899

相關文章
相關標籤/搜索