Hive中的HiveServer二、Beeline及數據的壓縮和存儲

  一、使用HiveServer2及Beeline前端

  HiveServer2的做用:將hive變成一種server服務對外開放,多個客戶端能夠鏈接。java

  啓動namenode、datanode、resourcemanager、nodemanager。node

  一個窗口輸入:hive-0.13.1]$ bin/hiveserver2 啓動hiveserver2服務,等效於:$ bin/hive --service hiveserver2git

  第二個窗口輸入:~]$ ps -ef | grep java 查看hiveserver2進程github

  第二個窗口輸入:hive-0.13.1]$ bin/beeline 啓動beeline算法

  第二個窗口輸入:beeline> !connect jdbc:hive2://hadoop-senior.ibeifeng.com:10000 beifeng beifeng org.apache.hive.jdbc.HiveDriver 將beeline鏈接hiveserver2服務數據庫

  hiveserver2的默認端口號:10000,臨時修改hiveserver2的端口號:$ bin/hiveserver2 --hiveconf hive.server2.thrift.port=14000apache

  使用beeline:網絡

  0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> show databases;session

  0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> use default;

  0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> show tables;

  0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> select * from emp; 不使用mapreduce

  0: jdbc:hive2://hadoop-senior.ibeifeng.com:10> select empno from emp; 使用mapreduce

  [beifeng@hadoop-senior hive-0.13.1]$ bin/beeline -u jdbc:hive2://hadoop-senior.ibeifeng.com:10000/default

  直接鏈接hiveserver2進入beeline的default數據庫

  $ !connect jdbc:hive2://bigdata-senior01.ibeifeng.com:10000

  $ bin/beeline -u jdbc:hive2://bigdata-senior01.ibeifeng.com:10000 -n beifeng -p 123456

  u:表明連接的意思

  $ bin/beeline --help:經過幫助命令獲取經常使用選項參數

  HiveServer2 JDBC的使用:

  將分析的結果存儲在hive表中,前端經過DAO代碼,進行數據的查詢。可是HiveServer2的併發存在問題,須要作併發處理。使用JDBC鏈接以前必定要先啓動HiveServer2。

  Hive使用JDBC格式:

  Class.forName("org.apache.hive.jdbc.HiveDriver");

  Connection conn = DriverManager.getConnection("jdbc:hive2://:","","");

  二、Hive運行配置

  Shell命令行臨時生效:

  set hive.fetch.task.conversion;

  hive.fetch.task.conversion=minimal

  hive-site.xml配置文件生效:

  hive.fetch.task.conversion

  minimal

  Some select queries can be converted to single FETCH task minimizing latency.

  Currently the query should be single sourced not having any subquery and should not have

  any aggregations or distincts (which incurs RS), lateral views and joins.

  1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only

  2. more : SELECT, FILTER, LIMIT only (TABLESAMPLE, virtual columns)

  三、虛擬列

  One is 【INPUT__FILE__NAME】, which is the input file's name for a mapper task.

  the other is 【BLOCK__OFFSET__INSIDE__FILE】, which is the current global file position.

  【INPUT__FILE__NAME】表明這行數據屬於哪一個文件中的:

  select deptno,dname,INPUT__FILE__NAME from dept;

  【BLOCK__OFFSET__INSIDE__FILE】表明塊的偏移量:

  select deptno,dname,BLOCK__OFFSET__INSIDE__FILE from dept;

  這兩個虛擬列的格式都須要兩個下劃線。

  四、安裝snappy數據壓縮格式

  (1)安裝snappy:下載snappy安裝包,並解壓安裝。

  snappy安裝包的下載地址:http://google.github.io/snappy/。

  (2)編譯haodop 2.x源碼:

  mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy /opt/modules/hadoop-2.5.0-src/target/hadoop-2.5.0/lib/native

  [beifeng@hadoop-senior hadoop-2.5.0]$ bin/hadoop checknative

  15/08/31 23:10:16 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native

  15/08/31 23:10:16 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library

  Native library checking:

  hadoop: true /opt/modules/hadoop-2.5.0/lib/native/libhadoop.so

  zlib: true /lib64/libz.so.1

  snappy: true /opt/modules/hadoop-2.5.0/lib/native/libsnappy.so.1

  lz4: true revision:99

  bzip2: true /lib64/libbz2.so.1

  用安裝好的snappy壓縮格式運行mapreduce程序wordcount。

  bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount /user/beifeng/mapreduce/wordcount/input /user/beifeng/mapreduce/wordcount/output bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/beifeng/mapreduce/wordcount/input /user/beifeng/mapreduce/wordcount/output2

  五、Hive中的數據壓縮

  壓縮格式:bzip2,gzip,lzo,snappy等

  壓縮比:bzip2>gzip>lzo bzip2最節省存儲空間

  解壓速度:lzo>gzip>bzip2 lzo解壓速度是最快的

  數據壓縮的好處:

  (1)節省了磁盤IO(map輸出壓縮)和網絡傳輸IO(reduce輸出壓縮)。

  (2)數據大小變小。

  (3)任務運行的性能提升(因爲任務大小小了)。

  (4)必須考慮壓縮後的文件可拆分,即壓縮後的每個任務分片可獨立運行。

  MapReduce過程當中的數據壓縮與解壓縮:

  Hadoop中支持的壓縮格式:

  壓縮格式  壓縮格式所在的類

  Zlib  org.apache.hadoop.io.compress.DefaultCodec

  Gzip  org.apache.hadoop.io.compress.GzipCodec

  Bzip2  org.apache.hadoop.io.compress.BZip2Codec

  Lzo  com.hadoop.compression.lzo.LzoCodec

  Lz4  org.apache.hadoop.io.compress.Lz4Codec

  Snappy  org.apache.hadoop.io.compress.SnappyCodec

  MapReduce壓縮的屬性設置:

  壓縮的用法  在core-site.xml中設置的屬性

  Map 輸出設置  mapreduce.map.output.compress = True;

  mapreduce.map.output.compress.codec = CodecName;

  Reducer 輸出設置  mapreduce.output.fileoutputformat.compress = True;

  mapreduce.output.fileoutputformat.compress.codec = CodecName;

  MapReduce配置Snappy壓縮運行示例:

  bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.0.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /user/beifeng/mr/input /user/beifeng/mr/output2

  Hive壓縮的屬性設置:

  hive.exec.compress.intermediate

  true

  壓縮的用法  新建表時定義的屬性

  Map 輸出設置  SET hive.exec.compress.intermediate = True;

  SET mapreduce.map.output.compress=true

  SET mapred.map.output.compression.codec = CodecName;

  SET mapred.map.output.compression.type = BLOCK/RECORD;

  Reducer 輸出設置  SET hive.exec.compress.output = True;

  SET mapred.output.compression.codec = CodecName;

  SET mapred.output.compression.type = BLOCK/RECORD;

  六、數據文件存儲格式

  file_format:

  : SEQUENCEFILE

  | TEXTFILE -- (Default, depending on hive.default.fileformat configuration)

  | RCFILE -- (Note: Available in Hive 0.6.0 and later)

  | ORC -- (Note: Available in Hive 0.11.0 and later)

  | PARQUET -- (Note: Available in Hive 0.13.0 and later)

  | AVRO -- (Note: Available in Hive 0.14.0 and later)

  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

  數據存儲格式分爲按行存儲數據和按列存儲數據。

  (1)ORCFile(Optimized Row Columnar File):hive/shark/spark支持。使用ORCFile格式存儲列數較多的表。

  (2)Parquet(twitter+cloudera開源,被Hive、Spark、Drill、Impala、Pig等支持)。Parquet比較複雜,其靈感主要來自於dremel,parquet存儲結構的主要亮點是支持嵌套數據結構以及高效且種類豐富算法(以應對不一樣值分佈特徵的壓縮)。

  (1)存儲爲TEXTFILE格式

  create table page_views(

  track_time string,

  url string,

  session_id string,

  referer string,

  ip string,

  end_user_id string,

  city_id string

  )

  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

  STORED AS TEXTFILE ;

  load data local inpath '/opt/datas/page_views.data' into table page_views ;

  dfs -du -h /user/hive/warehouse/page_views/ ;

  18.1 M /user/hive/warehouse/page_views/page_views.data

  (2)存儲爲ORC格式

  create table page_views_orc(

  track_time string,

  url string,

  session_id string,

  referer string,

  ip string,

  end_user_id string,

  city_id string

  )

  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

  STORED AS orc ;

  insert into table page_views_orc select * from page_views ;

  dfs -du -h /user/hive/warehouse/page_views_orc/ ;

  2.6 M /user/hive/warehouse/page_views_orc/000000_0

  (3)存儲爲Parquet格式

  create table page_views_parquet(

  track_time string,

  url string,

  session_id string,

  referer string,

  ip string,

  end_user_id string,

  city_id string

  )

  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

  STORED AS PARQUET ;

  insert into table page_views_parquet select * from page_views ;

  dfs -du -h /user/hive/warehouse/page_views_parquet/ ;

  13.1 M /user/hive/warehouse/page_views_parquet/000000_0

  (4)存儲爲ORC格式,使用snappy壓縮

  create table page_views_orc_snappy(

  track_time string,

  url string,

  session_id string,

  referer string,

  ip string,

  end_user_id string,

  city_id string

  )無錫人流醫院哪家好 http://www.wxbhnkyy120.com/

  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

  STORED AS orc tblproperties ("orc.compress"="SNAPPY");

  insert into table page_views_orc_snappy select * from page_views ;

  dfs -du -h /user/hive/warehouse/page_views_orc_snappy/ ;

  (5)存儲爲ORC格式,不使用壓縮

  create table page_views_orc_none(

  track_time string,

  url string,

  session_id string,

  referer string,

  ip string,

  end_user_id string,

  city_id string

  )

  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

  STORED AS orc tblproperties ("orc.compress"="NONE");

  insert into table page_views_orc_none select * from page_views ;

  dfs -du -h /user/hive/warehouse/page_views_orc_none/ ;

  (6)存儲爲Parquet格式,使用snappy壓縮

  set parquet.compression=SNAPPY ;

  create table page_views_parquet_snappy(

  track_time string,

  url string,

  session_id string,

  referer string,

  ip string,

  end_user_id string,

  city_id string

  )

  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

  STORED AS parquet;

  insert into table page_views_parquet_snappy select * from page_views ;

  dfs -du -h /user/hive/warehouse/page_views_parquet_snappy/ ;

  在實際的項目開發當中,hive表的數據的存儲格式通常使用orcfile / parquet,數據壓縮通常使用snappy壓縮格式。

相關文章
相關標籤/搜索