Hive學習

時間 2019-11-26

標籤 hive 學習欄目 Hadoop 简体版

原文原文鏈接

1、Hive索引

爲何要建立索引？
　　Hive的索引目的是提升Hive表指定列的查詢速度。
　　沒有索引時，相似'WHERE tab1.col1 = 10' 的查詢，Hive會加載整張表或分區，而後處理全部的rows，可是若是在字段col1上面存在索引時，那麼只會加載和處理文件的一部分。與其餘傳統數據庫同樣，增長索引在提高查詢速度時，會消耗額外資源去建立索引和須要更多的磁盤空間存儲索引。
　　Hive 0.7.0版本中，加入了索引。Hive 0.8.0版本中增長了bitmap索引。java

使用：mysql

新建索引：linux

CREATE INDEX user_index（索引名稱） ON TABLE user(id)（表名+索引字段） 
AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH deferred REBUILD;

從新更新索引：sql

ALTER INDEX user_index（索引名稱） on user REBUILD

刪除索引：數據庫

DROP INDEX user_index on user;

查看索引：apache

SHOW INDEX on user;

2、桶

Hive還能夠把表或分區，組織成桶。將表或分區組織成桶有如下幾個目的：安全

CREATE TABLE bucketed_user(id INT,name String) 
CLUSTERED BY (id) INTO 4 BUCKETS;

分區中的數據能夠被進一步拆分紅桶，先partitioned by (stat_date string)，再clustered by (id) sorted by(age) into 2 bucketbash

INSERT OVERWRITE TABLE videos_b
PARTITION(year=1999)
SELECT producer,title,string WHERE year=2009 CLUSTER BY title;

3、視圖

引入視圖機制，用戶能夠將注意力集中在其關心的數據上（而非所有數據），這樣就大大提升了用戶效率與用戶滿意度，並且若是這些數據來源於多個基本表結構，或者數據不只來自於基本表結構，還有一部分數據來源於其餘視圖，而且搜索條件又比較複雜時，須要編寫的查詢語句就會比較煩瑣，此時定義視圖就可使數據的查詢語句變得簡單可行。定義視圖能夠將表與表之間的複雜的操做鏈接和搜索條件對用戶不可見，用戶只須要簡單地對一個視圖進行查詢便可，故增長了數據的安全性，但不能提升查詢效率。maven

基礎數據表：ide

create table student(id int, name string, age int, class_id int);
create table classes(id int, class_name string);

視圖建立：

create view stu_cla as select a.id, a.name, a.age, a.class_id, b.class_name from student a join classes b on a.class_id=b.id;

視圖機制：

視圖處理有兩種機制，替換式和具化式；

替換式：操做視圖時，視圖名直接被視圖定義給替換掉，結果就變成select * from (select c.name as c_name ,s.name as stu_name from student s,class c where c.id = s.class_id),在提交給mysql執行；

具化式：mysql先獲得了視圖執行的結果，該結果造成一箇中間結果暫時存在內存中。以後，外面的select語句就調用了這些中間結果(臨時表)。

4、數據類型

1.struct

structs內部的數據能夠經過DOT（.）來存取

使用:

create table stu_test(id int, info struct<name:string, age:int>) row format delimited fields terminated by ',' collection items terminated by ":";
#導入測試數據
1,zhou:30
2,yan:30
3,chen:20
4,li:80
5,wei:18
load data local inpath '/opt/hive-test/stu.txt' overwrite into table stu_test;
select info.name from stu_test;

'FIELDS TERMINATED BY' ：字段與字段之間的分隔符
''COLLECTION ITEMS TERMINATED BY' ：一個字段各個item的分隔符

2.Array

使用：

create table class_test(name string, student_list array<int>) row format delimited fields terminated by ',' collection items terminated by ':';
#數據
一班,1:2:3:4
二班,11:12:13
三班,21:22:23
load data local inpath '/opt/hive-test/class.txt' overwrite into table class_test;
select  student_list[0] from class_test;

3.Map

使用：

create table student_map(id string, info map<string,string>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':';
#數據
1       chineses:100,english:100,math:110
2       chineses:90,english:91,math:111
load data local inpath '/opt/hive-test/stu_map.txt' overwrite into table student_map;
select info['math'] from student_map;

5、Hive運行

hive腳本的執行方式大體有三種：
1. hive控制檯執行；
2. hive -e "SQL"執行；
3. hive -f SQL文件執行；

6、擴展接口

1.cli

hive.cli.print.header:當設置爲true時，查詢返回結果的同時會打印列名。默認狀況下設置爲false。

hive.cli.print.current.db:當設置爲true時，將打印當前數據庫的名字。默認狀況下設置爲false。

6、自定義函數

1.UDF

maven依賴：

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>2.1.1</version>
</dependency>
<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-common</artifactId>
    <version>2.1.1</version>
</dependency>

自定義函數實現：

package com.qf58.bdp.hive.udf;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.serde2.ByteStream;
import org.apache.hadoop.hive.serde2.io.DoubleWritable;
import org.apache.hadoop.hive.serde2.lazy.LazyInteger;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

/**
 * Description:
 *
 * @Author: weishenpeng
 * Date: 2018/1/25
 * Time: 上午 11:44
 */
public class OperationAddUDF extends UDF {
	private final ByteStream.Output out = new ByteStream.Output();

	/**
	 * IntWritable
	 *
	 * @param num1
	 * @param num2
	 * @return
	 */
	public IntWritable evaluate(IntWritable num1, IntWritable num2) {
		if (num1 == null || num2 == null) {
			return null;
		}
		return new IntWritable(num1.get() + num2.get());
	}

	/**
	 * DoubleWritable
	 *
	 * @param num1
	 * @param num2
	 * @return
	 */
	public DoubleWritable evaluate(DoubleWritable num1, DoubleWritable num2) {
		if (num1 == null || num2 == null) {
			return null;
		}
		return new DoubleWritable(num1.get() + num2.get());
	}

	/**
	 * FloatWritable
	 *
	 * @param num1
	 * @param num2
	 * @return
	 */
	public FloatWritable evaluate(FloatWritable num1, FloatWritable num2) {
		if (num1 == null || num2 == null) {
			return null;
		}
		return new FloatWritable(num1.get() + num2.get());
	}

	/**
	 * Text
	 *
	 * @param num1
	 * @param num2
	 * @return
	 */
	public Text evaluate(Text num1, Text num2) {
		if (num1 == null || num2 == null) {
			return null;
		}
		try {
			Integer n1 = Integer.valueOf(num1.toString());
			Integer n2 = Integer.valueOf(num2.toString());
			Integer result = n1 + n2;
			out.reset();
			LazyInteger.writeUTF8NoException(out, result);
			Text text = new Text();
			text.set(out.getData(), 0, out.getLength());
			return text;
		} catch (Exception e) {
			return null;
		}
	}
}

添加Jar文件到類路徑下：

add jar /opt/hive-test/hive-udf-addUDF.jar;

建立函數addUDF：

create temporary function addUDF as 'com.qf58.bdp.hive.udf.OperationAddUDF';

刪除函數addUDF:

drop temporary function if exists add;

函數使用：

select addUDF(id, age) as ddd from student;

7、Hive與依賴環境交互

1.與linux交互命令

格式
在linux的命令前加上!（英文感嘆號），以;（英文分號結尾）
例子：

!ls;
!pwd;

2.與hdfs交互命令

格式

hdfs的命令。以 dfs 開頭，以英文分號結束。

例子：

dfs -ls /;
dfs -mkdir /hive123;

1. 學習Hive(五)Hive 優化
2. Hive學習之Hive CLI
3. Hive學習1：Hive原理
4. Hive學習
5. hive學習_01
6. HIVE學習
7. Hive 學習
8. Hive HQL學習
9. [Hive DML學習]
10. Hive學習總結
更多相關文章...
• 您已經學習了 XML Schema，下一步學習什麼呢？ - XML Schema 教程
• 我們已經學習了 SQL，下一步學習什麼呢？ - SQL 教程
• Tomcat學習筆記（史上最全tomcat學習筆記）
• 適用於PHP初學者的學習線路和建議

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。