Hive中的HiveServer二、Beeline及數據的壓縮和存儲

時間 2019-11-11

標籤 hive hiveserver beeline 數據壓縮存儲欄目 Hadoop 简体版

原文原文鏈接

　　一、使用HiveServer2及Beeline前端

　　HiveServer2的做用：將hive變成一種server服務對外開放，多個客戶端能夠鏈接。java

　　啓動namenode、datanode、resourcemanager、nodemanager。node

　　一個窗口輸入：hive-0.13.1]$ bin/hiveserver2 啓動hiveserver2服務，等效於：$ bin/hive --service hiveserver2git

　　第二個窗口輸入：~]$ ps -ef | grep java 查看hiveserver2進程github

　　第二個窗口輸入：hive-0.13.1]$ bin/beeline 啓動beeline算法

　　第二個窗口輸入：beeline> !connect jdbc:hive2://hadoop-senior.ibeifeng.com:10000 beifeng beifeng org.apache.hive.jdbc.HiveDriver 將beeline鏈接hiveserver2服務數據庫

　　hiveserver2的默認端口號：10000，臨時修改hiveserver2的端口號：$ bin/hiveserver2 --hiveconf hive.server2.thrift.port=14000apache

　　使用beeline：網絡

　　0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> show databases;session

　　0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> use default;

　　0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> show tables;

　　0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> select * from emp; 不使用mapreduce

　　0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> select empno from emp; 使用mapreduce

　　[beifeng@hadoop-senior hive-0.13.1]$ bin/beeline -u jdbc:hive2://hadoop-senior.ibeifeng.com:10000/default

　　直接鏈接hiveserver2進入beeline的default數據庫

　　$ !connect jdbc:hive2://bigdata-senior01.ibeifeng.com:10000

　　$ bin/beeline -u jdbc:hive2://bigdata-senior01.ibeifeng.com:10000 -n beifeng -p 123456

　　u：表明連接的意思

　　$ bin/beeline --help：經過幫助命令獲取經常使用選項參數

　　HiveServer2 JDBC的使用：

　　將分析的結果存儲在hive表中，前端經過DAO代碼，進行數據的查詢。可是HiveServer2的併發存在問題，須要作併發處理。使用JDBC鏈接以前必定要先啓動HiveServer2。

　　Hive使用JDBC格式：

　　Class.forName("org.apache.hive.jdbc.HiveDriver");

　　Connection conn = DriverManager.getConnection("jdbc:hive2://:","","");

　　二、Hive運行配置

　　Shell命令行臨時生效：

　　set hive.fetch.task.conversion;

　　hive.fetch.task.conversion=minimal

　　hive-site.xml配置文件生效：

　　hive.fetch.task.conversion

　　minimal

　　Some select queries can be converted to single FETCH task minimizing latency.

　　Currently the query should be single sourced not having any subquery and should not have

　　any aggregations or distincts (which incurs RS), lateral views and joins.

　　1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only

　　2. more : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)

　　三、虛擬列

　　One is 【INPUT__FILE__NAME】, which is the input file's name for a mapper task.

　　the other is 【BLOCK__OFFSET__INSIDE__FILE】, which is the current global file position.

　　【INPUT__FILE__NAME】表明這行數據屬於哪一個文件中的：

　　select deptno,dname,INPUT__FILE__NAME from dept;

　　【BLOCK__OFFSET__INSIDE__FILE】表明塊的偏移量：

　　select deptno,dname,BLOCK__OFFSET__INSIDE__FILE from dept;

　　這兩個虛擬列的格式都須要兩個下劃線。

　　四、安裝snappy數據壓縮格式

　　(1)安裝snappy：下載snappy安裝包，並解壓安裝。

　　snappy安裝包的下載地址：http://google.github.io/snappy/。

　　(2)編譯haodop 2.x源碼：

　　mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy /opt/modules/hadoop-2.5.0-src/target/hadoop-2.5.0/lib/native

　　[beifeng@hadoop-senior hadoop-2.5.0]$ bin/hadoop checknative

　　15/08/31 23:10:16 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native

　　15/08/31 23:10:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

　　Native library checking:

　　hadoop: true /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so

　　zlib: true /lib64/libz.so.1

　　snappy: true /opt/modules/hadoop-2.5.0/lib/native/libsnappy.so.1

　　lz4: true revision:99

　　bzip2: true /lib64/libbz2.so.1

　　用安裝好的snappy壓縮格式運行mapreduce程序wordcount。

　　bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /user/beifeng/mapreduce/wordcount/input /user/beifeng/mapreduce/wordcount/output bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/beifeng/mapreduce/wordcount/input /user/beifeng/mapreduce/wordcount/output2

　　五、Hive中的數據壓縮

　　壓縮格式：bzip2，gzip，lzo，snappy等

　　壓縮比：bzip2>gzip>lzo bzip2最節省存儲空間

　　解壓速度：lzo>gzip>bzip2 lzo解壓速度是最快的

　　數據壓縮的好處：

　　(1)節省了磁盤IO(map輸出壓縮)和網絡傳輸IO(reduce輸出壓縮)。

　　(2)數據大小變小。

　　(3)任務運行的性能提升(因爲任務大小小了)。

　　(4)必須考慮壓縮後的文件可拆分，即壓縮後的每個任務分片可獨立運行。

　　MapReduce過程當中的數據壓縮與解壓縮：

　　Hadoop中支持的壓縮格式：

　　壓縮格式　　壓縮格式所在的類

　　Zlib　　org.apache.hadoop.io.compress.DefaultCodec

　　Gzip　　org.apache.hadoop.io.compress.GzipCodec

　　Bzip2　　org.apache.hadoop.io.compress.BZip2Codec

　　Lzo　　com.hadoop.compression.lzo.LzoCodec

　　Lz4　　org.apache.hadoop.io.compress.Lz4Codec

　　Snappy　　org.apache.hadoop.io.compress.SnappyCodec

　　MapReduce壓縮的屬性設置：

　　壓縮的用法　　在core-site.xml中設置的屬性

　　Map 輸出設置　　mapreduce.map.output.compress = True;

　　mapreduce.map.output.compress.codec = CodecName;

　　Reducer 輸出設置　　mapreduce.output.fileoutputformat.compress = True;

　　mapreduce.output.fileoutputformat.compress.codec = CodecName;

　　MapReduce配置Snappy壓縮運行示例：

　　bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/beifeng/mr/input /user/beifeng/mr/output2

　　Hive壓縮的屬性設置：

　　hive.exec.compress.intermediate

　　true

　　壓縮的用法　　新建表時定義的屬性

　　Map 輸出設置　　SET hive.exec.compress.intermediate = True;

　　SET mapreduce.map.output.compress=true

　　SET mapred.map.output.compression.codec = CodecName;

　　SET mapred.map.output.compression.type = BLOCK/RECORD;

　　Reducer 輸出設置　　SET hive.exec.compress.output = True;

　　SET mapred.output.compression.codec = CodecName;

　　SET mapred.output.compression.type = BLOCK/RECORD;

　　六、數據文件存儲格式

　　file_format:

　　: SEQUENCEFILE

　　| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)

　　| RCFILE -- (Note: Available in Hive 0.6.0 and later)

　　| ORC -- (Note: Available in Hive 0.11.0 and later)

　　| PARQUET -- (Note: Available in Hive 0.13.0 and later)

　　| AVRO -- (Note: Available in Hive 0.14.0 and later)

　　| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

　　數據存儲格式分爲按行存儲數據和按列存儲數據。

　　(1)ORCFile(Optimized Row Columnar File)：hive/shark/spark支持。使用ORCFile格式存儲列數較多的表。

　　(2)Parquet(twitter+cloudera開源，被Hive、Spark、Drill、Impala、Pig等支持)。Parquet比較複雜，其靈感主要來自於dremel，parquet存儲結構的主要亮點是支持嵌套數據結構以及高效且種類豐富算法(以應對不一樣值分佈特徵的壓縮)。

　　(1)存儲爲TEXTFILE格式

　　create table page_views(

　　track_time string,

　　url string,

　　session_id string,

　　referer string,

　　ip string,

　　end_user_id string,

　　city_id string

　　)

　　ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

　　STORED AS TEXTFILE ;

　　load data local inpath '/opt/datas/page_views.data' into table page_views ;

　　dfs -du -h /user/hive/warehouse/page_views/ ;

　　18.1 M /user/hive/warehouse/page_views/page_views.data

　　(2)存儲爲ORC格式

　　create table page_views_orc(

　　track_time string,

　　url string,

　　session_id string,

　　referer string,

　　ip string,

　　end_user_id string,

　　city_id string

　　)

　　ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

　　STORED AS orc ;

　　insert into table page_views_orc select * from page_views ;

　　dfs -du -h /user/hive/warehouse/page_views_orc/ ;

　　2.6 M /user/hive/warehouse/page_views_orc/000000_0

　　(3)存儲爲Parquet格式

　　create table page_views_parquet(

　　track_time string,

　　url string,

　　session_id string,

　　referer string,

　　ip string,

　　end_user_id string,

　　city_id string

　　)

　　ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

　　STORED AS PARQUET ;

　　insert into table page_views_parquet select * from page_views ;

　　dfs -du -h /user/hive/warehouse/page_views_parquet/ ;

　　13.1 M /user/hive/warehouse/page_views_parquet/000000_0

　　(4)存儲爲ORC格式，使用snappy壓縮

　　create table page_views_orc_snappy(

　　track_time string,

　　url string,

　　session_id string,

　　referer string,

　　ip string,

　　end_user_id string,

　　city_id string

　　)無錫人流醫院哪家好 http://www.wxbhnkyy120.com/

　　ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

　　STORED AS orc tblproperties ("orc.compress"="SNAPPY");

　　insert into table page_views_orc_snappy select * from page_views ;

　　dfs -du -h /user/hive/warehouse/page_views_orc_snappy/ ;

　　(5)存儲爲ORC格式，不使用壓縮

　　create table page_views_orc_none(

　　track_time string,

　　url string,

　　session_id string,

　　referer string,

　　ip string,

　　end_user_id string,

　　city_id string

　　)

　　ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

　　STORED AS orc tblproperties ("orc.compress"="NONE");

　　insert into table page_views_orc_none select * from page_views ;

　　dfs -du -h /user/hive/warehouse/page_views_orc_none/ ;

　　(6)存儲爲Parquet格式，使用snappy壓縮

　　set parquet.compression=SNAPPY ;

　　create table page_views_parquet_snappy(

　　track_time string,

　　url string,

　　session_id string,

　　referer string,

　　ip string,

　　end_user_id string,

　　city_id string

　　)

　　ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

　　STORED AS parquet;

　　insert into table page_views_parquet_snappy select * from page_views ;

　　dfs -du -h /user/hive/warehouse/page_views_parquet_snappy/ ;

　　在實際的項目開發當中，hive表的數據的存儲格式通常使用orcfile / parquet，數據壓縮通常使用snappy壓縮格式。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。