一、使用HiveServer2及Beeline前端
HiveServer2的做用:將hive變成一種server服務對外開放,多個客戶端能夠鏈接。java
啓動namenode、datanode、resourcemanager、nodemanager。node
一個窗口輸入:hive-0.13.1]$ bin/hiveserver2 啓動hiveserver2服務,等效於:$ bin/hive --service hiveserver2git
第二個窗口輸入:~]$ ps -ef | grep java 查看hiveserver2進程github
第二個窗口輸入:hive-0.13.1]$ bin/beeline 啓動beeline算法
第二個窗口輸入:beeline> !connect jdbc:hive2://hadoop-senior.ibeifeng.com:10000 beifeng beifeng org.apache.hive.jdbc.HiveDriver 將beeline鏈接hiveserver2服務數據庫
hiveserver2的默認端口號:10000,臨時修改hiveserver2的端口號:$ bin/hiveserver2 --hiveconf hive.server2.thrift.port=14000apache
使用beeline:網絡
0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> show databases;session
0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> use default;
0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> show tables;
0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> select * from emp; 不使用mapreduce
0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> select empno from emp; 使用mapreduce
[beifeng@hadoop-senior hive-0.13.1]$ bin/beeline -u jdbc:hive2://hadoop-senior.ibeifeng.com:10000/default
直接鏈接hiveserver2進入beeline的default數據庫
$ !connect jdbc:hive2://bigdata-senior01.ibeifeng.com:10000
$ bin/beeline -u jdbc:hive2://bigdata-senior01.ibeifeng.com:10000 -n beifeng -p 123456
u:表明連接的意思
$ bin/beeline --help:經過幫助命令獲取經常使用選項參數
HiveServer2 JDBC的使用:
將分析的結果存儲在hive表中,前端經過DAO代碼,進行數據的查詢。可是HiveServer2的併發存在問題,須要作併發處理。使用JDBC鏈接以前必定要先啓動HiveServer2。
Hive使用JDBC格式:
Class.forName("org.apache.hive.jdbc.HiveDriver");
Connection conn = DriverManager.getConnection("jdbc:hive2://:","","");
二、Hive運行配置
Shell命令行臨時生效:
set hive.fetch.task.conversion;
hive.fetch.task.conversion=minimal
hive-site.xml配置文件生效:
hive.fetch.task.conversion
minimal
Some select queries can be converted to single FETCH task minimizing latency.
Currently the query should be single sourced not having any subquery and should not have
any aggregations or distincts (which incurs RS), lateral views and joins.
1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
2. more : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)
三、虛擬列
One is 【INPUT__FILE__NAME】, which is the input file's name for a mapper task.
the other is 【BLOCK__OFFSET__INSIDE__FILE】, which is the current global file position.
【INPUT__FILE__NAME】表明這行數據屬於哪一個文件中的:
select deptno,dname,INPUT__FILE__NAME from dept;
【BLOCK__OFFSET__INSIDE__FILE】表明塊的偏移量:
select deptno,dname,BLOCK__OFFSET__INSIDE__FILE from dept;
這兩個虛擬列的格式都須要兩個下劃線。
四、安裝snappy數據壓縮格式
(1)安裝snappy:下載snappy安裝包,並解壓安裝。
snappy安裝包的下載地址:http://google.github.io/snappy/。
(2)編譯haodop 2.x源碼:
mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy /opt/modules/hadoop-2.5.0-src/target/hadoop-2.5.0/lib/native
[beifeng@hadoop-senior hadoop-2.5.0]$ bin/hadoop checknative
15/08/31 23:10:16 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
15/08/31 23:10:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so
zlib: true /lib64/libz.so.1
snappy: true /opt/modules/hadoop-2.5.0/lib/native/libsnappy.so.1
lz4: true revision:99
bzip2: true /lib64/libbz2.so.1
用安裝好的snappy壓縮格式運行mapreduce程序wordcount。
bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /user/beifeng/mapreduce/wordcount/input /user/beifeng/mapreduce/wordcount/output bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/beifeng/mapreduce/wordcount/input /user/beifeng/mapreduce/wordcount/output2
五、Hive中的數據壓縮
壓縮格式:bzip2,gzip,lzo,snappy等
壓縮比:bzip2>gzip>lzo bzip2最節省存儲空間
解壓速度:lzo>gzip>bzip2 lzo解壓速度是最快的
數據壓縮的好處:
(1)節省了磁盤IO(map輸出壓縮)和網絡傳輸IO(reduce輸出壓縮)。
(2)數據大小變小。
(3)任務運行的性能提升(因爲任務大小小了)。
(4)必須考慮壓縮後的文件可拆分,即壓縮後的每個任務分片可獨立運行。
MapReduce過程當中的數據壓縮與解壓縮:
Hadoop中支持的壓縮格式:
壓縮格式 壓縮格式所在的類
Zlib org.apache.hadoop.io.compress.DefaultCodec
Gzip org.apache.hadoop.io.compress.GzipCodec
Bzip2 org.apache.hadoop.io.compress.BZip2Codec
Lzo com.hadoop.compression.lzo.LzoCodec
Lz4 org.apache.hadoop.io.compress.Lz4Codec
Snappy org.apache.hadoop.io.compress.SnappyCodec
MapReduce壓縮的屬性設置:
壓縮的用法 在core-site.xml中設置的屬性
Map 輸出設置 mapreduce.map.output.compress = True;
mapreduce.map.output.compress.codec = CodecName;
Reducer 輸出設置 mapreduce.output.fileoutputformat.compress = True;
mapreduce.output.fileoutputformat.compress.codec = CodecName;
MapReduce配置Snappy壓縮運行示例:
bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/beifeng/mr/input /user/beifeng/mr/output2
Hive壓縮的屬性設置:
hive.exec.compress.intermediate
true
壓縮的用法 新建表時定義的屬性
Map 輸出設置 SET hive.exec.compress.intermediate = True;
SET mapreduce.map.output.compress=true
SET mapred.map.output.compression.codec = CodecName;
SET mapred.map.output.compression.type = BLOCK/RECORD;
Reducer 輸出設置 SET hive.exec.compress.output = True;
SET mapred.output.compression.codec = CodecName;
SET mapred.output.compression.type = BLOCK/RECORD;
六、數據文件存儲格式
file_format:
: SEQUENCEFILE
| TEXTFILE -- (Default, depending on hive.default.fileformat configuration)
| RCFILE -- (Note: Available in Hive 0.6.0 and later)
| ORC -- (Note: Available in Hive 0.11.0 and later)
| PARQUET -- (Note: Available in Hive 0.13.0 and later)
| AVRO -- (Note: Available in Hive 0.14.0 and later)
| INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
數據存儲格式分爲按行存儲數據和按列存儲數據。
(1)ORCFile(Optimized Row Columnar File):hive/shark/spark支持。使用ORCFile格式存儲列數較多的表。
(2)Parquet(twitter+cloudera開源,被Hive、Spark、Drill、Impala、Pig等支持)。Parquet比較複雜,其靈感主要來自於dremel,parquet存儲結構的主要亮點是支持嵌套數據結構以及高效且種類豐富算法(以應對不一樣值分佈特徵的壓縮)。
(1)存儲爲TEXTFILE格式
create table page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE ;
load data local inpath '/opt/datas/page_views.data' into table page_views ;
dfs -du -h /user/hive/warehouse/page_views/ ;
18.1 M /user/hive/warehouse/page_views/page_views.data
(2)存儲爲ORC格式
create table page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc ;
insert into table page_views_orc select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc/ ;
2.6 M /user/hive/warehouse/page_views_orc/000000_0
(3)存儲爲Parquet格式
create table page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS PARQUET ;
insert into table page_views_parquet select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet/ ;
13.1 M /user/hive/warehouse/page_views_parquet/000000_0
(4)存儲爲ORC格式,使用snappy壓縮
create table page_views_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)無錫人流醫院哪家好 http://www.wxbhnkyy120.com/
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
insert into table page_views_orc_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_snappy/ ;
(5)存儲爲ORC格式,不使用壓縮
create table page_views_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="NONE");
insert into table page_views_orc_none select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_orc_none/ ;
(6)存儲爲Parquet格式,使用snappy壓縮
set parquet.compression=SNAPPY ;
create table page_views_parquet_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS parquet;
insert into table page_views_parquet_snappy select * from page_views ;
dfs -du -h /user/hive/warehouse/page_views_parquet_snappy/ ;
在實際的項目開發當中,hive表的數據的存儲格式通常使用orcfile / parquet,數據壓縮通常使用snappy壓縮格式。