3.2-3.3 Hive中常見的數據壓縮

時間 2019-11-11

原文原文鏈接

1、數據壓縮web

一、算法

數據壓縮
    數據量小
    *本地磁盤，IO
    *減小網絡IO


Hadoop做業一般是IO綁定的;
壓縮減小了跨網絡傳輸的數據的大小;
經過簡單地啓用壓縮，能夠提升整體做業性能;
要壓縮的數據必須支持可分割性；

二、何時壓縮？apache

一、Use Compressed Map Input
· Mapreduce jobs read input from HDFS
· Compress if input data is large. This will reduce disk read cost.
· Compress with splittable algorithms like Bzip2
· Or use compression with splittable file structures such as Sequence Files, RC Files etc.


二、Compress Intermediate Data
·Map output is written to disk(spill)and transferred accross the network
·Always use compression toreduce both disk write,and network transfer load
·Beneficial in performace point of view even if input and output is uncompressed
·Use faster codecs such as Snappy,LZO


三、Compress Reducer Output
.Mapreduce output used for both archiving or chaining mapreduce jobs
·Use compression to reduce disk space for archiving
·Compression is also beneficial for chaining jobsespecially with limited disk throughput resource.
·Use compression methods with higher compress ratio to save more disk space

三、Supported Codecs in Hadoopvim

Zlib→org.apache.hadoop.io.compress.DefaultCodec 
Gzip →org.apache.hadoop.io.compress.Gzipcodec 
Bzip2→org.apache.hadoop.io.compress.BZip2Codec
Lzo→com.hadoop.compression.1zo.LzoCodec 
Lz4→org.apache.hadoop.io.compress.Lz4Codec 
Snappy→org.apache.hadoop.io.compress.Snappycodec

四、Compression in MapReduce網絡

#####
Compressed Input Usage：
    File format is auto recognized with extension.
    Codec must be defined in core-site.xml.


#####
Compress 
Intermediate Data
(Map Output)：
    mapreduce.map.output.compress=True; 
    mapreduce.map.output.compress.codec=CodecName;


#####
Compress Job Output (Reducer Output)：
    mapreduce.output.fileoutputformat.compress=True; 
    mapreduce.output.fileoutputformat.compress.codec=CodecName;

五、Compression in Hiveapp

#####
Compressed 
Input Usage：
Can be defined in table definition 
STORED AS INPUTFORMAT
\"com.hadoop.mapred.DeprecatedLzoText Input Format\"


#####
Compress Intermediate Data (Map Output)：
SET hive. exec. compress. intermediate=True; 
SET mapred. map. output. compression. codec=CodecName; 
SET mapred. map. output. compression. type=BLOCK/RECORD; 
Use faster codecs such as Snappy, Lzo, LZ4
Useful for chained mapreduce jobs with lots of intermediate data such as joins.


#####
Compress Job Output (Reducer Output)：
SET hive.exec.compress.output=True; 
SET mapred.output.compression.codec=CodecName; 
SET mapred.output.compression.type=BLOCK/RECORD;

2、snappy工具

一、簡介oop

在hadoop集羣中snappy是一種比較好的壓縮工具，相對gzip壓縮速度和解壓速度有很大的優點，
並且相對節省cpu資源，但壓縮率不及gzip。它們各有各的用途。

Snappy是用C++開發的壓縮和解壓縮開發包，旨在提供高速壓縮速度和合理的壓縮率。Snappy比zlib更快，但文件相對要大20%到100%。
在64位模式的Core i7處理器上，可達每秒250~500兆的壓縮速度。

Snappy的前身是Zippy。雖然只是一個數據壓縮庫，它卻被Google用於許多內部項目程，其中就包括BigTable，MapReduce和RPC。
Google宣稱它在這個庫自己及其算法作了數據處理速度上的優化，做爲代價，並無考慮輸出大小以及和其餘相似工具的兼容性問題。
Snappy特意爲64位x86處理器作了優化，在單個Intel Core i7處理器內核上可以達到至少每秒250MB的壓縮速率和每秒500MB的解壓速率。

若是容許損失一些壓縮率的話，那麼能夠達到更高的壓縮速度，雖然生成的壓縮文件可能會比其餘庫的要大上20%至100%，可是，
相比其餘的壓縮庫，Snappy卻可以在特定的壓縮率下擁有驚人的壓縮速度，「壓縮普通文本文件的速度是其餘庫的1.5-1.7倍，
HTML能達到2-4倍，可是對於JPEG、PNG以及其餘的已壓縮的數據，壓縮速度不會有明顯改善」。

二、使得Snappy類庫對Hadoop可用性能

此處使用的是編譯好的庫文件；測試

#這裏是編譯好的庫文件，在壓縮包裏，先解壓縮
[root@hadoop-senior softwares]# mkdir 2.5.0-native-snappy

[root@hadoop-senior softwares]# tar zxf 2.5.0-native-snappy.tar.gz -C 2.5.0-native-snappy

[root@hadoop-senior softwares]# cd 2.5.0-native-snappy

[root@hadoop-senior 2.5.0-native-snappy]# ls
libhadoop.a       libhadoop.so        libhadooputils.a  libhdfs.so        libsnappy.a   libsnappy.so    libsnappy.so.1.2.0
libhadooppipes.a  libhadoop.so.1.0.0  libhdfs.a         libhdfs.so.0.0.0  libsnappy.la  libsnappy.so.1



#替換hadoop的安裝
[root@hadoop-senior lib]# pwd
/opt/modules/hadoop-2.5.0/lib

[root@hadoop-senior lib]# mv native/ 250-native

[root@hadoop-senior lib]# mkdir native

[root@hadoop-senior lib]# ls
250-native  native  native-bak
  
[root@hadoop-senior lib]# cp /opt/softwares/2.5.0-native-snappy/* ./native/

[root@hadoop-senior lib]# ls native
libhadoop.a       libhadoop.so        libhadooputils.a  libhdfs.so        libsnappy.a   libsnappy.so    libsnappy.so.1.2.0
libhadooppipes.a  libhadoop.so.1.0.0  libhdfs.a         libhdfs.so.0.0.0  libsnappy.la  libsnappy.so.1



#檢查
[root@hadoop-senior hadoop-2.5.0]# bin/hadoop checknative
19/04/25 09:59:51 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
19/04/25 09:59:51 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so
zlib:   true /lib64/libz.so.1
snappy: true /opt/modules/hadoop-2.5.0/lib/native/libsnappy.so.1    #snappy已經爲true
lz4:    true revision:99
bzip2:  true /lib64/libbz2.so.1

三、mapreduce壓縮測試

#建立測試文件
[root@hadoop-senior hadoop-2.5.0]# bin/hdfs dfs -mkdir -p /user/root/mapreduce/wordcount/input

[root@hadoop-senior hadoop-2.5.0]# touch /opt/datas/wc.input

[root@hadoop-senior hadoop-2.5.0]# vim !$
hadoop hdfs
hadoop hive
hadoop mapreduce
hadoop hue

[root@hadoop-senior hadoop-2.5.0]# bin/hdfs dfs -put /opt/datas/wc.input /user/root/mapreduce/wordcount/input
put: `/user/root/mapreduce/wordcount/input/wc.input': File exists

[root@hadoop-senior hadoop-2.5.0]# bin/hdfs dfs -ls -R /user/root/mapreduce/wordcount/input
-rw-r--r--   1 root supergroup         12 2019-04-08 15:03 /user/root/mapreduce/wordcount/input/wc.input



#先不壓縮運行MapReduce
[root@hadoop-senior hadoop-2.5.0]# bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /user/root/mapreduce/wordcount/input /user/root/mapreduce/wordcount/output



#壓縮運行MapReduce
[root@hadoop-senior hadoop-2.5.0]# bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/root/mapreduce/wordcount/input /user/root/mapreduce/wordcount/output2

#-Dmapreduce.map.output.compress=true ：map輸出的值要使用壓縮；-D是參數

#-Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec ：使用snappy壓縮；-D是參數
#因爲數據量過小，基本上看不出差異

3、hive配置壓縮

hive (default)> set mapreduce.map.output.compress=true;
hive (default)> set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

測試：

在hive中運行一個select會執行MapReduce：

hive (default)> select count(*) from emp;

在web頁面的具體job中能夠看到此做業使用的配置：

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。