HQL練習

時間 2019-12-14

標籤 hql 練習简体版

原文原文鏈接

Hive學習筆記總結數據庫

05. Hql練習

1. hql基礎練習

題目和數據來源：http://www.w2b-c.com/article/150326(去掉-)函數

create和loadoop

create table students(Sno int,Sname string,Sex string,Sage int,Sdept string)row format delimited fields terminated by ','stored as textfile;
create table course(Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile;
create table sc(Sno int,Cno int,Grade int)row format delimited fields terminated by ',' stored as textfile;

load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student;
load data local inpath '/home/hadoop/hivedata/sc.txt' overwrite into table sc;
load data local inpath '/home/hadoop/hivedata/course.txt' overwrite into table course;

1.查詢全體學生的學號與姓名學習

hive> select Sno,Sname from students;

2.查詢選修了課程的學生姓名code

select distinct Sname from students, sc where students.Sno = sc.Sno;

或：orm

select distinct Sname from students inner join sc on students.Sno = sc.Sno;

3.查詢學生的總人數hadoop

select count(*) from students;

4.計算1號課程的學平生均成績get

select avg(Grade) from sc where Cno = 1;

5.查詢各科成績平均分string

select Cname,avg(Grade) from sc, course where sc.Cno = course.Cno group by sc.Cno;

//Grade要麼出如今group關鍵詞以後，要麼使用聚合函數。it

6.查詢選修1號課程的學生最高分數

select max(Grade) from sc where Cno = 1;

7.求各個課程號及相應的選課人數

select Cno,count(*) from sc group by Cno;

8.查詢選修了3門以上的課程的學生學號

select Sno from sc group by Sno having count(Cno) >3 ;

9.查詢學生信息，結果按學號全局有序

select * from students order by Sno;

10.查詢學生信息，結果區分性別按年齡有序

set mapred.reduce.tasks=2;
select * from students distribute by sex sort by sage;

11.查詢每一個學生及其選修課程的狀況

select students.*,sc.* from students join sc on (students.Sno =sc.Sno);

12.查詢學生的得分狀況
13.查詢選修2號課程且成績在90分以上的全部學生。

select students.Sname from sc,students where sc.Cno = 2 and sc.Grade > 90 and sc.Sno = students.Sno;

或者：

select students.Sname,sc.Grade from students join sc on students.Sno=sc.Sno where  sc.Cno=2 and sc.Grade>90;

14.查詢全部學生的信息，若是在成績表中有成績，則輸出成績表中的課程號

select students.Sname,sc.Cno from students join sc on students.Sno=sc.Sno;

15.重寫如下子查詢爲LEFT SEMI JOIN

SELECT a.key, a.value FROM a WHERE a.key exist in (SELECT b.key FROM B);

查詢目的：查找A中，key值在B中存在的數據。
能夠被重寫爲：

select a.key,a.value from a left semi join b on a.key = b.key;

16.查詢與「劉晨」在同一個系學習的學生

select s1.Sname from students s1 where sdept in (select sdept from students where sname = '劉晨');

或者：

select s1.Sname from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';

注意比較：

select * from students s1 left join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 right join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 inner join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
select * from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';

2. 執行順序

標準順序：
select--from--where--group by--having--order by

join操做中，on條件與where條件的區別

數據庫在經過鏈接兩張或多張表來返回記錄時，都會生成一張中間的臨時表，而後再將這張臨時表返回給用戶。
join發生在where字句以前，在使用left jion時，on和where條件的區別以下：

一、on條件是在生成臨時表時使用的條件，它無論on中的條件是否爲真，都會返回左邊表中的記錄。（右邊置爲Null了）

二、where條件是在臨時表生成好後，再對臨時表進行過濾的條件。這時已經沒有left join的含義（必須返回左邊表的記錄）了，條件不爲真的就所有過濾掉。

假設有兩張表：

表1：tab1

id size 
1  10 
2  20 
3  30

表2：tab2

size name 
10   AAA 
20   BBB 
20   CCC

兩條SQL:

一、select * from tab1 left join tab2 on tab1.size = tab2.size where tab2.name='AAA'
二、select * from tab1 left join tab2 on tab1.size = tab2.size and tab2.name='AAA'

第一條SQL的過程：
一、中間表
on條件:

tab1.size = tab2.size 
tab1.id tab1.size tab2.size tab2.name 
1 10 10 AAA 
2 20 20 BBB 
2 20 20 CCC 
3 30 (null) (null)

二、再對中間表過濾
where 條件：

tab2.name='AAA'
tab1.id tab1.size tab2.size tab2.name 
1 10 10 AAA

第二條SQL的過程：
一、中間表
on條件:

tab1.size = tab2.size and tab2.name='AAA'
(條件不爲真也會返回左表中的記錄) tab1.id tab1.size tab2.size tab2.name 
1 10 10 AAA 
2 20 (null) (null) 
3 30 (null) (null)

其實以上結果的關鍵緣由就是left join,right join,full join的特殊性，
無論on上的條件是否爲真都會返回left或right表中的記錄，full則具備left和right的特性的並集。

** 而inner join沒這個特殊性，則條件放在on中和where中，返回的結果集是相同的。**

3. Hive實戰--級聯求和（累計報表）

需求：
有以下訪客訪問次數統計表 t_access_times

訪客	月份	訪問次數
A	2015-01	5
A	2015-01	15
B	2015-01	5
A	2015-01	8
B	2015-01	25
A	2015-01	5
A	2015-02	4
A	2015-02	6
B	2015-02	10
B	2015-02	5

須要輸出報表：t_access_times_accumulate
月訪問：當月的總次數；累計訪問總計：截止到當月的月訪問次數之和。

訪客	月份	月訪問總計	累計訪問總計
A	2015-01	33	33
A	2015-02	10	43
B	2015-01	30	30
B	2015-02	15	45

準備數據：
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5

create table t_access_time(username string,month string,salary int)
row format delimited fields terminated by ',';

load data local inpath '/home/hadoop/t_access_times.dat' into table t_access_time;

一、第一步，先求每一個用戶的月總金額

select username,month,sum(salary) from t_access_time group by username,month;

+-----------+----------+---------+--+
| username | month | salary |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+

二、第二步，將月總金額表本身鏈接(自鏈接)

select * from 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabA 
inner join 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabB 
on TabA.username = TabB.username;

+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+

三、第三步，從上一步的結果中
進行分組查詢，分組的字段是a.username a.month
求月累計值：將b.month <= a.month的全部b.salary求和便可

select TabA.username,TabA.month,max(TabA.salary) as month_salary,sum(TabB.salary) as sum_salary 
from 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabA 
inner join 
(select username,month,sum(salary) as salary from t_access_time group by username,month) TabB 
on TabA.username = TabB.username 
where TabB.month<= TabA.month 
group by TabA.username,TabA.month;

max(TabA.salary)不能直接寫成TabA.salary，由於這個字段沒有出如今group by中，也沒有聚合函數，因此使用max表示。

結果：
A 2015-01 33 33
A 2015-02 10 43
B 2015-01 30 30
B 2015-02 15 45

參考http://www.w2b-c.com/article/150326（去掉-）

初接觸，記下學習筆記，還有不少問題，望指導，謝謝。