ODS:Operation Data Store
原始數據java
DWD(數據清洗/DWI) data warehouse detail
數據明細詳情,去除空值,髒數據,超過極限範圍的
明細解析
具體表sql
DWS(寬表-用戶行爲,輕度聚合) data warehouse service ----->有多少個寬表?多少個字段
服務層--留存-轉化-GMV-復購率-日活
點贊、評論、收藏;
輕度聚合對DWD數據庫
ADS(APP/DAL/DF)-出報表結果 Application Data Store
作分析處理同步到RDS數據庫裏邊apache
數據集市:狹義ADS層; 廣義上指DWD DWS ADS 從hadoop同步到RDS的數據json
1)建立gmall數據庫vim
hive (default)> create database gmall;數組
說明:若是數據庫存在且有數據,須要強制刪除時執行:drop database gmall cascade;bash
2)使用gmall數據庫app
hive (default)> use gmall;ide
原始數據層,存放原始數據,直接加載原始日誌、數據,數據保持原貌不作處理。
① 建立啓動日誌表ods_start_log
1)建立輸入數據是lzo輸出是text,支持json解析的分區表
hive (gmall)> drop table if exists ods_start_log; CREATE EXTERNAL TABLE ods_start_log (`line` string) PARTITIONED BY (`dt` string) STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION '/warehouse/gmall/ods/ods_start_log';
Hive的LZO壓縮:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO
加載數據;
時間格式都配置成YYYY-MM-DD格式,這是Hive默認支持的時間格式
hive (gmall)> load data inpath '/origin_data/gmall/log/topic_start/2019-02-10' into table gmall.ods_start_log partition(dt="2019-02-10");
hive (gmall)> select * from ods_start_log limit 2;
② 建立事件日誌表ods_event_log
建立輸入數據是lzo輸出是text,支持json解析的分區表
drop table if exists ods_event_log; create external table ods_event_log (`line` string) partitioned by (`dt` string) stored as INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' location '/warehouse/gmall/ods/ods_event_log'; hive (gmall)> load data inpath '/origin_data/gmall/log/topic_event/2019-02-10' into table gmall.ods_event_log partition(dt="2019-02-10");
ODS層加載數據的腳本
1)在hadoop101的/home/kris/bin目錄下建立腳本
[kris@hadoop101 bin]$ vim ods_log.sh
#!/bin/bash # 定義變量方便修改 APP=gmall hive=/opt/module/hive/bin/hive # 若是是輸入的日期按照取輸入日期;若是沒輸入日期取當前時間的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi echo "===日誌日期爲 $do_date===" sql=" load data inpath '/origin_data/gmall/log/topic_start/$do_date' into table "$APP".ods_start_log partition(dt='$do_date'); load data inpath '/origin_data/gmall/log/topic_event/$do_date' into table "$APP".ods_event_log partition(dt='$do_date'); " $hive -e "$sql"
[ -n 變量值 ] 判斷變量的值,是否爲空
-- 變量的值,非空,返回true
-- 變量的值,爲空,返回false
查看date命令的使用,[kris@hadoop101 ~]$ date --help
增長腳本執行權限 [kris@hadoop101 bin]$ chmod 777 ods_log.sh 腳本使用 [kris@hadoop101 module]$ ods_log.sh 2019-02-11 查看導入數據 hive (gmall)> select * from ods_start_log where dt='2019-02-11' limit 2; select * from ods_event_log where dt='2019-02-11' limit 2; 腳本執行時間 企業開發中通常在每日凌晨30分~1點
對ODS層數據進行清洗(去除空值,髒數據,超過極限範圍的數據,行式存儲改成列存儲,改壓縮格式)
DWD解析過程,臨時過程,兩個臨時表: dwd_base_event_log、dwd_base_start_log
建12張表外部表: 以日期分區,dwd_base_event_log在這張表中根據event_name將event_json中的字段經過get_json_object函數一個個解析開來;
DWD層建立基礎明細表
明細表用於存儲ODS層原始錶轉換過來的明細數據。
1) 建立啓動日誌基礎明細表:
drop table if exists dwd_base_start_log; create external table dwd_base_start_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `event_name` string, `event_json` string, `server_time` string) partitioned by(`dt` string) stored as parquet location "/warehouse/gmall/dwd/dwd_base_start_log"
其中event_name和event_json用來對應事件名和整個事件。這個地方將原始日誌1對多的形式拆分出來了。操做的時候咱們須要將原始日誌展平,須要用到UDF和UDTF。
2)建立事件日誌基礎明細表
drop table if exists dwd_base_event_log; create external table dwd_base_event_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `event_name` string, `event_json` string, `server_time` string) partitioned by(`dt` string) stored as parquet location "/warehouse/gmall/dwd/dwd_base_event_log"
自定義UDF函數(解析公共字段)
UDF:解析公共字段 + 事件et(json數組)+ 時間戳
自定義UDTF函數(解析具體事件字段) process 1進多出(可支持多進多出)
UDTF:對傳入的事件et(json數組)-->返回event_name| event_json(取出事件et裏邊的每一個具體事件--json_Array)
將jar包添加到Hive的classpath
建立臨時函數與開發好的java class關聯
hive (gmall)> add jar /opt/module/hive/hivefunction-1.0-SNAPSHOT.jar; hive (gmall)> create temporary function base_analizer as "com.atguigu.udf.BaseFieldUDF"; hive (gmall)> create temporary function flat_analizer as "com.atguigu.udtf.EventJsonUDTF"; hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict;
1)解析啓動日誌基礎明細表
insert overwrite table dwd_base_start_log partition(dt) select mid_id,user_id,version_code,version_name,lang,source,os,area,model,brand,sdk_version,gmail,height_width,app_time,network, lng,lat,event_name, event_json,server_time,dt from( select split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time, dt from ods_start_log where dt='2019-02-10' and base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>'' ) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
將ops lateral view 成event_name和event_json; +-------------+-------------------------------------------------------------------------------------------------------------------------+----------------+--+ | event_name | event_json | server_time | +-------------+-------------------------------------------------------------------------------------------------------------------------+----------------+--+ | start | {"ett":"1549683362200","en":"start","kv":{"entry":"5","loading_time":"4","action":"1","open_ad_type":"1","detail":""}} | 1549728087940 | +-------------+-------------------------------------------------------------------------------------------------------------------------+----------------+--+
1)解析事件日誌基礎明細表
insert overwrite table dwd_base_event_log partition(dt='2019-02-10') select mid_id,user_id,version_code,version_name,lang,source,os,area,model,brand,sdk_version,gmail,height_width,app_time,network, lng,lat,event_name, event_json,server_time from( select split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time from ods_event_log where dt='2019-02-10' and base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>'' ) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json;
測試 hive (gmall)> select * from dwd_base_event_log limit 2;
1)在hadoop101的/home/kris/bin目錄下建立腳本
[kris@hadoop101 bin]$ vim dwd.sh
#編寫腳本: #!/bin/bash APP=gmall hive=/opt/module/hive/bin/hive if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" add jar /opt/module/hive/hivefunction-1.0-SNAPSHOT.jar; create temporary function base_analizer as 'com.atguigu.udf.BaseFieldUDF'; create temporary function flat_analizer as 'com.atguigu.udtf.EventJsonUDTF'; set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table "$APP".dwd_base_start_log partition(dt) select mid_id,user_id,version_code,version_name,lang,source,os,area,model,brand,sdk_version,gmail,height_width,app_time,network, lng,lat,event_name, event_json,server_time,dt from( select split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time, dt from "$APP".ods_start_log where dt='$do_date' and base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>'' ) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json; insert overwrite table "$APP".dwd_base_event_log partition(dt='$do_date') select mid_id,user_id,version_code,version_name,lang,source,os,area,model,brand,sdk_version,gmail,height_width,app_time,network, lng,lat,event_name, event_json,server_time from( select split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[0] as mid_id, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[1] as user_id, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[2] as version_code, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[3] as version_name, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[4] as lang, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[5] as source, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[6] as os, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[7] as area, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[8] as model, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[9] as brand, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[10] as sdk_version, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[11] as gmail, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[12] as height_width, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[13] as app_time, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[14] as network, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[15] as lng, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[16] as lat, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[17] as ops, split(base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la'),'\t')[18] as server_time from "$APP".ods_event_log where dt='$do_date' and base_analizer(line,'mid,uid,vc,vn,l,sr,os,ar,md,ba,sv,g,hw,t,nw,ln,la')<>'' ) sdk_log lateral view flat_analizer(ops) tmp_k as event_name, event_json; " $hive -e "$sql"
[kris@hadoop101 bin]$ chmod +x dwd_base.sh [kris@hadoop101 bin]$ dwd_base.sh 2019-02-11 查詢導入結果 hive (gmall)> select * from dwd_start_log where dt='2019-02-11' limit 2; select * from dwd_comment_log where dt='2019-02-11' limit 2; 腳本執行時間 企業開發中通常在每日凌晨30分~1點
1) 商品點擊表
建表 hive (gmall)> drop table if exists dwd_display_log; CREATE EXTERNAL TABLE dwd_display_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `action` string, `newsid` string, `place` string, `extend1` string, `category` string, `server_time` string ) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_display_log/'; 導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_display_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.newsid') newsid, get_json_object(event_json,'$.kv.place') place, get_json_object(event_json,'$.kv.extend1') extend1, get_json_object(event_json,'$.kv.category') category, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='display'; 測試 hive (gmall)> select * from dwd_display_log limit 2;
2 )商品詳情頁表
建表語句 hive (gmall)> drop table if exists dwd_newsdetail_log; CREATE EXTERNAL TABLE `dwd_newsdetail_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `entry` string, `action` string, `newsid` string, `showtype` string, `news_staytime` string, `loading_time` string, `type1` string, `category` string, `server_time` string) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_newsdetail_log/'; 導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_newsdetail_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.newsid') newsid, get_json_object(event_json,'$.kv.showtype') showtype, get_json_object(event_json,'$.kv.news_staytime') news_staytime, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.type1') type1, get_json_object(event_json,'$.kv.category') category, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='newsdetail'; 測試 hive (gmall)> select * from dwd_newsdetail_log limit 2;
3 )商品列表頁表
建表語句 hive (gmall)> drop table if exists dwd_loading_log; CREATE EXTERNAL TABLE `dwd_loading_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `action` string, `loading_time` string, `loading_way` string, `extend1` string, `extend2` string, `type` string, `type1` string, `server_time` string) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_loading_log/'; 導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_loading_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.loading_way') loading_way, get_json_object(event_json,'$.kv.extend1') extend1, get_json_object(event_json,'$.kv.extend2') extend2, get_json_object(event_json,'$.kv.type') type, get_json_object(event_json,'$.kv.type1') type1, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='loading'; 測試 hive (gmall)> select * from dwd_loading_log limit 2;
4 廣告表
建表語句 hive (gmall)> drop table if exists dwd_ad_log; CREATE EXTERNAL TABLE `dwd_ad_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `entry` string, `action` string, `content` string, `detail` string, `ad_source` string, `behavior` string, `newstype` string, `show_style` string, `server_time` string) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_ad_log/'; 導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_ad_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.content') content, get_json_object(event_json,'$.kv.detail') detail, get_json_object(event_json,'$.kv.source') ad_source, get_json_object(event_json,'$.kv.behavior') behavior, get_json_object(event_json,'$.kv.newstype') newstype, get_json_object(event_json,'$.kv.show_style') show_style, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='ad'; 測試 hive (gmall)> select * from dwd_ad_log limit 2;
5 消息通知表
建表語句 hive (gmall)> drop table if exists dwd_notification_log; CREATE EXTERNAL TABLE `dwd_notification_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `action` string, `noti_type` string, `ap_time` string, `content` string, `server_time` string ) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_notification_log/'; 導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_notification_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.noti_type') noti_type, get_json_object(event_json,'$.kv.ap_time') ap_time, get_json_object(event_json,'$.kv.content') content, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='notification'; 測試 hive (gmall)> select * from dwd_notification_log limit 2;
6 用戶前臺活躍表
1)建表語句 hive (gmall)> drop table if exists dwd_active_foreground_log; CREATE EXTERNAL TABLE `dwd_active_foreground_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `active_source` string, `server_time` string) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_foreground_log/'; 2)導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_active_foreground_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.active_source') active_source, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='active_foreground'; 3)測試 hive (gmall)> select * from dwd_active_foreground_log limit 2;
7 用戶後臺活躍表
1)建表語句 hive (gmall)> drop table if exists dwd_active_background_log; CREATE EXTERNAL TABLE `dwd_active_background_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `active_source` string, `server_time` string ) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_background_log/'; 2)導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_active_background_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.active_source') active_source, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='active_background'; 3)測試 hive (gmall)> select * from dwd_active_background_log limit 2;
8 評論表
1)建表語句 hive (gmall)> drop table if exists dwd_comment_log; CREATE EXTERNAL TABLE `dwd_comment_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `comment_id` int, `userid` int, `p_comment_id` int, `content` string, `addtime` string, `other_id` int, `praise_count` int, `reply_count` int, `server_time` string ) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_comment_log/'; 2)導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_comment_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.comment_id') comment_id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.p_comment_id') p_comment_id, get_json_object(event_json,'$.kv.content') content, get_json_object(event_json,'$.kv.addtime') addtime, get_json_object(event_json,'$.kv.other_id') other_id, get_json_object(event_json,'$.kv.praise_count') praise_count, get_json_object(event_json,'$.kv.reply_count') reply_count, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='comment'; 3)測試 hive (gmall)> select * from dwd_comment_log limit 2;
9 收藏表
1)建表語句 hive (gmall)> drop table if exists dwd_favorites_log; CREATE EXTERNAL TABLE `dwd_favorites_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `id` int, `course_id` int, `userid` int, `add_time` string, `server_time` string ) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_favorites_log/'; 2)導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_favorites_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.id') id, get_json_object(event_json,'$.kv.course_id') course_id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.add_time') add_time, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='favorites'; 3)測試 hive (gmall)> select * from dwd_favorites_log limit 2;
10 點贊表
1)建表語句 hive (gmall)> drop table if exists dwd_praise_log; CREATE EXTERNAL TABLE `dwd_praise_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `id` string, `userid` string, `target_id` string, `type` string, `add_time` string, `server_time` string ) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_praise_log/'; 2)導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_praise_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.id') id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.target_id') target_id, get_json_object(event_json,'$.kv.type') type, get_json_object(event_json,'$.kv.add_time') add_time, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='praise'; 3)測試 hive (gmall)> select * from dwd_praise_log limit 2;
11 啓動日誌表
1)建表語句 hive (gmall)> drop table if exists dwd_start_log; CREATE EXTERNAL TABLE `dwd_start_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `entry` string, `open_ad_type` string, `action` string, `loading_time` string, `detail` string, `extend1` string, `server_time` string ) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_start_log/'; 2)導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_start_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.open_ad_type') open_ad_type, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.detail') detail, get_json_object(event_json,'$.kv.extend1') extend1, server_time, dt from dwd_base_start_log where dt='2019-02-10' and event_name='start'; 3)測試 hive (gmall)> select * from dwd_start_log limit 2;
12 錯誤日誌表
1)建表語句 hive (gmall)> drop table if exists dwd_error_log; CREATE EXTERNAL TABLE `dwd_error_log`( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `errorBrief` string, `errorDetail` string, `server_time` string) PARTITIONED BY (dt string) location '/warehouse/gmall/dwd/dwd_error_log/'; 2)導入數據 hive (gmall)> set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table dwd_error_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.errorBrief') errorBrief, get_json_object(event_json,'$.kv.errorDetail') errorDetail, server_time, dt from dwd_base_event_log where dt='2019-02-10' and event_name='error'; 3)測試 hive (gmall)> select * from dwd_error_log limit 2;
1)在hadoop101的/home/kris/bin目錄下建立腳本
[kris@hadoop101 bin]$ vim dwd.sh
#!/bin/bash # 定義變量方便修改 APP=gmall hive=/opt/module/hive/bin/hive # 若是是輸入的日期按照取輸入日期;若是沒輸入日期取當前時間的前一天 if [ -n "$1" ] ;then do_date=$1 else do_date=`date -d "-1 day" +%F` fi sql=" set hive.exec.dynamic.partition.mode=nonstrict; insert overwrite table "$APP".dwd_display_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.newsid') newsid, get_json_object(event_json,'$.kv.place') place, get_json_object(event_json,'$.kv.extend1') extend1, get_json_object(event_json,'$.kv.category') category, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='display'; insert overwrite table "$APP".dwd_newsdetail_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.newsid') newsid, get_json_object(event_json,'$.kv.showtype') showtype, get_json_object(event_json,'$.kv.news_staytime') news_staytime, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.type1') type1, get_json_object(event_json,'$.kv.category') category, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='newsdetail'; insert overwrite table "$APP".dwd_loading_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.loading_way') loading_way, get_json_object(event_json,'$.kv.extend1') extend1, get_json_object(event_json,'$.kv.extend2') extend2, get_json_object(event_json,'$.kv.type') type, get_json_object(event_json,'$.kv.type1') type1, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='loading'; insert overwrite table "$APP".dwd_ad_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.content') content, get_json_object(event_json,'$.kv.detail') detail, get_json_object(event_json,'$.kv.source') ad_source, get_json_object(event_json,'$.kv.behavior') behavior, get_json_object(event_json,'$.kv.newstype') newstype, get_json_object(event_json,'$.kv.show_style') show_style, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='ad'; insert overwrite table "$APP".dwd_notification_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.noti_type') noti_type, get_json_object(event_json,'$.kv.ap_time') ap_time, get_json_object(event_json,'$.kv.content') content, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='notification'; insert overwrite table "$APP".dwd_active_foreground_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.active_source') active_source, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='active_background'; insert overwrite table "$APP".dwd_active_background_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.active_source') active_source, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='active_background'; insert overwrite table "$APP".dwd_comment_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.comment_id') comment_id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.p_comment_id') p_comment_id, get_json_object(event_json,'$.kv.content') content, get_json_object(event_json,'$.kv.addtime') addtime, get_json_object(event_json,'$.kv.other_id') other_id, get_json_object(event_json,'$.kv.praise_count') praise_count, get_json_object(event_json,'$.kv.reply_count') reply_count, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='comment'; insert overwrite table "$APP".dwd_favorites_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.id') id, get_json_object(event_json,'$.kv.course_id') course_id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.add_time') add_time, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='favorites'; insert overwrite table "$APP".dwd_praise_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.id') id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.target_id') target_id, get_json_object(event_json,'$.kv.type') type, get_json_object(event_json,'$.kv.add_time') add_time, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='praise'; insert overwrite table "$APP".dwd_start_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.open_ad_type') open_ad_type, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.detail') detail, get_json_object(event_json,'$.kv.extend1') extend1, server_time, dt from "$APP".dwd_base_start_log where dt='$do_date' and event_name='start'; insert overwrite table "$APP".dwd_error_log PARTITION (dt) select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.errorBrief') errorBrief, get_json_object(event_json,'$.kv.errorDetail') errorDetail, server_time, dt from "$APP".dwd_base_event_log where dt='$do_date' and event_name='error'; " $hive -e "$sql" 2)增長腳本執行權限 [kris@hadoop101 bin]$ chmod 777 dwd.sh 3)腳本使用 [kris@hadoop101 module]$ dwd.sh 2019-02-11 4)查詢導入結果 hive (gmall)> select * from dwd_start_log where dt='2019-02-11' limit 2; select * from dwd_comment_log where dt='2019-02-11' limit 2; 5)腳本執行時間 企業開發中通常在每日凌晨30分~1點
數據在hdfs上保存時間,半或1年清理下,可下載壓縮保存下來