Hive學習筆記總結數據庫
題目和數據來源:http://www.w2b-c.com/article/150326(去掉-)函數
create和loadoop
create table students(Sno int,Sname string,Sex string,Sage int,Sdept string)row format delimited fields terminated by ','stored as textfile; create table course(Cno int,Cname string) row format delimited fields terminated by ',' stored as textfile; create table sc(Sno int,Cno int,Grade int)row format delimited fields terminated by ',' stored as textfile; load data local inpath '/home/hadoop/hivedata/students.txt' overwrite into table student; load data local inpath '/home/hadoop/hivedata/sc.txt' overwrite into table sc; load data local inpath '/home/hadoop/hivedata/course.txt' overwrite into table course;
1.查詢全體學生的學號與姓名學習
hive> select Sno,Sname from students;
2.查詢選修了課程的學生姓名code
select distinct Sname from students, sc where students.Sno = sc.Sno;
或:orm
select distinct Sname from students inner join sc on students.Sno = sc.Sno;
3.查詢學生的總人數hadoop
select count(*) from students;
4.計算1號課程的學平生均成績get
select avg(Grade) from sc where Cno = 1;
5.查詢各科成績平均分string
select Cname,avg(Grade) from sc, course where sc.Cno = course.Cno group by sc.Cno;
//Grade要麼出如今group關鍵詞以後,要麼使用聚合函數。it
6.查詢選修1號課程的學生最高分數
select max(Grade) from sc where Cno = 1;
7.求各個課程號及相應的選課人數
select Cno,count(*) from sc group by Cno;
8.查詢選修了3門以上的課程的學生學號
select Sno from sc group by Sno having count(Cno) >3 ;
9.查詢學生信息,結果按學號全局有序
select * from students order by Sno;
10.查詢學生信息,結果區分性別按年齡有序
set mapred.reduce.tasks=2; select * from students distribute by sex sort by sage;
11.查詢每一個學生及其選修課程的狀況
select students.*,sc.* from students join sc on (students.Sno =sc.Sno);
12.查詢學生的得分狀況
13.查詢選修2號課程且成績在90分以上的全部學生。
select students.Sname from sc,students where sc.Cno = 2 and sc.Grade > 90 and sc.Sno = students.Sno;
或者:
select students.Sname,sc.Grade from students join sc on students.Sno=sc.Sno where sc.Cno=2 and sc.Grade>90;
14.查詢全部學生的信息,若是在成績表中有成績,則輸出成績表中的課程號
select students.Sname,sc.Cno from students join sc on students.Sno=sc.Sno;
15.重寫如下子查詢爲LEFT SEMI JOIN
SELECT a.key, a.value FROM a WHERE a.key exist in (SELECT b.key FROM B);
查詢目的:查找A中,key值在B中存在的數據。
能夠被重寫爲:
select a.key,a.value from a left semi join b on a.key = b.key;
16.查詢與「劉晨」在同一個系學習的學生
select s1.Sname from students s1 where sdept in (select sdept from students where sname = '劉晨');
或者:
select s1.Sname from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
注意比較:
select * from students s1 left join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨'; select * from students s1 right join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨'; select * from students s1 inner join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨'; select * from students s1 left semi join students s2 on s1.Sdept=s2.Sdept and s2.Sname='劉晨';
標準順序:
select--from--where--group by--having--order by
join操做中,on條件與where條件的區別
數據庫在經過鏈接兩張或多張表來返回記錄時,都會生成一張中間的臨時表,而後再將這張臨時表返回給用戶。
join發生在where字句以前,在使用left jion時,on和where條件的區別以下:
一、on條件是在生成臨時表時使用的條件,它無論on中的條件是否爲真,都會返回左邊表中的記錄。(右邊置爲Null了)
二、where條件是在臨時表生成好後,再對臨時表進行過濾的條件。這時已經沒有left join的含義(必須返回左邊表的記錄)了,條件不爲真的就所有過濾掉。
假設有兩張表:
表1:tab1
id size 1 10 2 20 3 30
表2:tab2
size name 10 AAA 20 BBB 20 CCC
兩條SQL:
一、select * from tab1 left join tab2 on tab1.size = tab2.size where tab2.name='AAA' 二、select * from tab1 left join tab2 on tab1.size = tab2.size and tab2.name='AAA'
第一條SQL的過程:
一、中間表
on條件:
tab1.size = tab2.size tab1.id tab1.size tab2.size tab2.name 1 10 10 AAA 2 20 20 BBB 2 20 20 CCC 3 30 (null) (null)
二、再對中間表過濾
where 條件:
tab2.name='AAA' tab1.id tab1.size tab2.size tab2.name 1 10 10 AAA
第二條SQL的過程:
一、中間表
on條件:
tab1.size = tab2.size and tab2.name='AAA' (條件不爲真也會返回左表中的記錄) tab1.id tab1.size tab2.size tab2.name 1 10 10 AAA 2 20 (null) (null) 3 30 (null) (null)
其實以上結果的關鍵緣由就是left join,right join,full join的特殊性,
無論on上的條件是否爲真都會返回left或right表中的記錄,full則具備left和right的特性的並集。
** 而inner join沒這個特殊性,則條件放在on中和where中,返回的結果集是相同的。**
需求:
有以下訪客訪問次數統計表 t_access_times
訪客 | 月份 | 訪問次數 |
---|---|---|
A | 2015-01 | 5 |
A | 2015-01 | 15 |
B | 2015-01 | 5 |
A | 2015-01 | 8 |
B | 2015-01 | 25 |
A | 2015-01 | 5 |
A | 2015-02 | 4 |
A | 2015-02 | 6 |
B | 2015-02 | 10 |
B | 2015-02 | 5 |
須要輸出報表:t_access_times_accumulate
月訪問:當月的總次數;累計訪問總計:截止到當月的月訪問次數之和。
訪客 | 月份 | 月訪問總計 | 累計訪問總計 |
---|---|---|---|
A | 2015-01 | 33 | 33 |
A | 2015-02 | 10 | 43 |
B | 2015-01 | 30 | 30 |
B | 2015-02 | 15 | 45 |
準備數據:
A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
create table t_access_time(username string,month string,salary int) row format delimited fields terminated by ','; load data local inpath '/home/hadoop/t_access_times.dat' into table t_access_time;
一、第一步,先求每一個用戶的月總金額
select username,month,sum(salary) from t_access_time group by username,month;
+-----------+----------+---------+--+
| username | month | salary |
+-----------+----------+---------+--+
| A | 2015-01 | 33 |
| A | 2015-02 | 10 |
| B | 2015-01 | 30 |
| B | 2015-02 | 15 |
+-----------+----------+---------+--+
二、第二步,將月總金額表 本身鏈接(自鏈接)
select * from (select username,month,sum(salary) as salary from t_access_time group by username,month) TabA inner join (select username,month,sum(salary) as salary from t_access_time group by username,month) TabB on TabA.username = TabB.username;
+-------------+----------+-----------+-------------+----------+-----------+--+
| a.username | a.month | a.salary | b.username | b.month | b.salary |
+-------------+----------+-----------+-------------+----------+-----------+--+
| A | 2015-01 | 33 | A | 2015-01 | 33 |
| A | 2015-01 | 33 | A | 2015-02 | 10 |
| A | 2015-02 | 10 | A | 2015-01 | 33 |
| A | 2015-02 | 10 | A | 2015-02 | 10 |
| B | 2015-01 | 30 | B | 2015-01 | 30 |
| B | 2015-01 | 30 | B | 2015-02 | 15 |
| B | 2015-02 | 15 | B | 2015-01 | 30 |
| B | 2015-02 | 15 | B | 2015-02 | 15 |
+-------------+----------+-----------+-------------+----------+-----------+--+
三、第三步,從上一步的結果中
進行分組查詢,分組的字段是a.username a.month
求月累計值: 將b.month <= a.month的全部b.salary求和便可
select TabA.username,TabA.month,max(TabA.salary) as month_salary,sum(TabB.salary) as sum_salary from (select username,month,sum(salary) as salary from t_access_time group by username,month) TabA inner join (select username,month,sum(salary) as salary from t_access_time group by username,month) TabB on TabA.username = TabB.username where TabB.month<= TabA.month group by TabA.username,TabA.month;
max(TabA.salary)不能直接寫成TabA.salary,由於這個字段沒有出如今group by中,也沒有聚合函數,因此使用max表示。
結果:
A 2015-01 33 33
A 2015-02 10 43
B 2015-01 30 30
B 2015-02 15 45
參考http://www.w2b-c.com/article/150326(去掉-)
初接觸,記下學習筆記,還有不少問題,望指導,謝謝。