Hive深刻使用

時間 2019-11-12

標籤 hive 深刻使用欄目 Hadoop 简体版

原文原文鏈接

一、HiveServer2和beeline　　-->JDBC接口前端

　　1)bin/Hiveserver2python

　　　　bin/beeline正則表達式

　　　　!connect jdbc:hive2://localhost:10000 user passwd org.apache.hive.jdbc.HiveDriversql

　　2）bin/beeline -u jdbc:hive2://localhost:10000/databaseapache

　　3)JDBC方式併發

　　　　用於將分析的結果存儲在HIVE表（result），前端經過DAO代碼，進行數據查詢　　-->JDBC併發有些問題，須要處理app

二、Hive中常見的數據壓縮jvm

　　1）安裝snappy　　yum -y install snappy snappy-develoop

　　2）編譯hadoop源碼並支持snappy測試

　　　　　　mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy

　　　　　　/opt/moduels/hadoop-2.5.0-src/target/hadoop-2.5.0/lib/native目錄進行替換

　　3)設置參數測試，bin/yarn jar share/... wordcount -Dcompress=true -Dcodec=snappy inputpath outputpath

三、Hive數據存儲

　　1）數據存儲格式　　　　-->指的是文件在磁盤中的存儲方式（不是表的格式）

　　　　　　按行存儲　　SEQUENCEFILE(序列化）TEXTFILE(默認)

　　　　　　按列存儲　　RCFILE　　ORC　　PARQUET

　　2)壓縮　　stored as orc tblproperties ("orc.compress="SNAPPY");　　　　-->注意大寫

四、Hive的優化

　　1）FetchTask　　-->直接抓取而不通過MapReduce

　　　　　　設置hive.fetch.task.conversion=more

　　2)大表的拆分爲子表　　-->經過create as語句來建立子表

　　3）外部表、分區表

　　　　外部表　　-->多個項目分析同一數據，存儲路徑一般需特殊指定

　　　　分區表　　-->按照時間進行分區，可多級分區

　　4）數據格式：存儲方式和壓縮

　　5）SQL語句的優化，join,filter　　-->set hive.auto.convert.join=true;

　　　　　　Common/Shuffle/Reduce Join　　-->鏈接的階段發生在Reduce Task，大表對大表

　　　　　　Map Join　　-->鏈接階段發生在Map Task，大表對小表

　　　　　　　　大表的數據從文件中讀取

　　　　　　　　小表的數據放在內存中　　-->經過DistributedCache類實現

　　　　　　SMB Join　　-->Sort-Merge-Bucket Join　　-->hive.auto.convert.sortmerge.join，hive.optimize.bucketmapjoin，hive.optimize.bucketmapjoin.sortedmerge

　　　　　　　　在建立表時clustered by() into num buckets　　-->建立表時定義分區平均分配在num個buckets中，

　　　　　　　　　　每一個buckets中的數據按照clustered的字段進行partition和sort

　　　　　　　　在join時按照buckets進行join

　　6)Hive的執行計劃

　　　　　　explain [extended|dependency|authorization] SQL語句;

　　7)Hive的並行執行　　-->對於沒有依賴關係的job能夠並行執行

　　　　hive.exec.parallel.thread.number(<20)　　hive.exec.parallel

　　8)jvm重用　　-->Map Task/Reduce Task運行在jvm中，不需重啓，在一個jvm中運行

　　　　mapreduce.job.jvm.numtasks(不要設置太大,<9）

　　9)reduce數目　　-->mapreduce.job.reduces

　　10)推測執行　　　　-->數據傾斜有任務執行時間較長，apm默認推測此任務出現問題，另啓一個任務進行執行，以先執行完畢的結果爲準，使用SQL語句時將其關閉

　　　　hive.mapred.reduce.tasks.speculative.execution(默認爲true)

　　　　mapdreduce.map.speculative　　mapreduce.reduce.speculative

　　11）Map數目　　-->hive.merge.size.per.task(依據塊的大小來設置)

　　12)動態分區的調整　　　　-->實現分區表的自動分區

　　13）SQL語句的檢查：nonstrict,strict　　set hive.mapred.mode;

五、Hive實戰案例

　　1）日誌分析

　　　　a.建立源表

　　　　　　不規則源數據　　-->採用正則表達式分析　　or　　使用mapreduce進行數據預處理

　　　　　　　　create table tablename()

　　　　　　　　　　row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'　-->序列化反序列化類

　　　　　　　　　　with serdeproperties("input.regex"="正則表達式","output.format.string"="表達式")

　　　　　　　　　　stored as textfile;

　　　　　　(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")

create table if not exists bf_log_src(remote_addr string,remote_user string,time_local string,request string,status string,body_bytes_sent string,request_body string,http_referer string, http_user_agent string,http_x_forwarded_for string,host string)row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' with serdeproperties("input.regex"="(\"[^ ]*\") (\"-|[^ ]*\") (\"[^\]]*\") (\"[^\"]*\") (\"[0-9]*\") (\"[0-9]*\") (-|[^ ]*) (\"[^ ]*\") (\"[^\"]*\") (-|[^ ]*) (\"[^ ]*\")") stored as textfile;　　//contrib類在建立源表並導入數據時沒有問題，可是在將數據加載到子表中時卻會致使maptask失敗！不要用這個類！

　　　　b.針對不一樣業務建立不一樣的子表

　　　　　　數據的存儲格式處理-->orcfile/parquet

　　　　　　數據壓縮

　　　　　　map輸出的中間結果集進行數據壓縮　　　　-->snappy

　　　　　　使用外部表（並建立分區表）

 > create table if not exists bf_log_comm(       
 　　　　　> remote_addr string,
         > time_local string,
         > request string,
         > http_referer string)
         > row format delimited fields terminated by '\t'
         > stored as orc tblproperties ("orc.compress"="SNAPPY");
>insert into table bf_log_comm select r,t,r,h from bf_log_src;
>select * from bf_log_comm limit 5;

　　　　c.進行數據清洗

　　　　　　自定義UDF對源表數據進行處理

　　　　　　　　第一個UDF：去除引號

　　　　　　　　第二個UDF：轉換時間日期

　　　　d.SQL語句進行數據分析

　　　　　　desc function extended substring;

　　　　分析統計按時間段（小時）分組瀏覽人數降序排序

select t.hour,count(*) cnt from
         > (select substring(time_local,9,2) hour from bf_log_comm) t
         > group by t.hour
         > order by cnt desc;

　　　　分析統計ip地域，（應使用UDF進行預處理提取ip前兩個字段）

select t.prex_ip,count(*) cnt from
(select substring(remote_addr,1,7) prex_ip from bf_log_comm) t
group by t.prex_ip
order by cnt desc
limit 5;

　　　　e.使用python腳本進行數據分析

　　　　　　https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-MovieLensUserRatings

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。