1.用Hive對爬蟲大做業產生的文本文件(或者英文詞頻統計下載的英文長篇小說)進行詞頻統計。python
啓動hadoop:數據庫
1
2
|
start
-
all
.sh
jps
|
查看hdfs上的文件夾:oop
1
2
|
cd
/
usr
/
local
/
hadoop
hdfs dfs
-
ls
|
將本地系統hadoop文件夾裏的英文版故事LittlePrince.txt上傳至hdfs的hive文件夾中:spa
1
|
hdfs dfs
-
put ~
/
hadoop
/
LittlePrince.txt hive
|
查看hdfs上的LittlePrince.txt文件內容:3d
1
|
hdfs dfs
-
cat hive
/
LittlePrince.txt
|
啓動hive:code
1
|
hive
|
建立文檔表word_frequency:blog
1
2
3
|
show databases;
use hive;
create table word_frequency(line string);
|
導入文件內容到表word_frequency:hadoop
1
|
load data inpath
'/user/hadoop/hive/LittlePrince.txt'
overwrite into table word_frequency;
|
查看錶word_frequency裏的內容(總共27章):ci
1
|
select
*
from
word_frequency;
|
用HQL進行詞頻統計,結果放在表words裏:文檔
1
|
create table words as select word,count(
1
)
from
(select explode(split(line,
' '
)) as word
from
word_frequency) word group by word;
|
查看統計結果(總共3751 row(s)):
1
|
select
*
from
words;
|
2.用Hive對爬蟲大做業產生的csv文件進行數據分析,寫一篇博客描述你的分析過程和分析結果。
先把爬取的文件上傳到郵箱,而後在虛擬機上下載並放到本地的hadoop文件中:
啓動hadoop:
1
2
|
start
-
all
.sh
jps
|
將本地系統hadoop文件夾裏的jieba.csv上傳至hdfs的hive文件夾中:
1
2
|
cd
/
usr
/
local
/
hadoop
hdfs dfs
-
put ~
/
hadoop
/
jieba.csv hive
|
查看hdfs上的jieba.csv文件前20條數據的內容:
1
|
hdfs dfs
-
cat hive
/
jieba.csv | head
-
20
|
啓動hive:
1
|
hive
|
在數據庫hive裏建立文檔表jieba:
1
2
3
|
show databases;
use hive;
create table jieba(line string);
|
導入文件內容到表jieba:
1
|
load data inpath
'/user/hadoop/hive/jieba.csv'
overwrite into table jieba;
|
查看錶的總數據條數:
1
|
select count(
*
)
from
jieba;
|