Hadoop綜合大做業

時間 2019-11-10

標籤 hadoop 綜合大做欄目 Hadoop 简体版

原文原文鏈接

adoop綜合大做業要求：html

1.用Hive對爬蟲大做業產生的文本文件（或者英文詞頻統計下載的英文長篇小說）詞頻統計。數據庫

此次處理的文本是哈利波特之鳳凰社英文長篇小說。app

圖2-1 分析的文本截圖oop

操做：post

1.開啓Hadoopspa

圖2-2 啓動hadoop截圖翻譯

2.查詢Hadoop開啓狀況3d

圖2-3 hadoop開啓狀況截圖code

3. Hdfs上建立文件夾htm

圖2-4 建立文件夾截圖

4. 上傳文件至hdfs

圖2-5 上傳文件截圖

5.啓動hive

圖2-6 啓動hive截圖

6. 建立原始文檔表

圖2-7 建立文檔表截圖

7. 導入文件內容到表docs並查看

圖2-8 導入文件截圖

8. 進行詞頻統計，結果放在表word_count裏

圖2-9 詞頻統計截圖

9.結果

圖2-10 部分結果截圖

分析說明

從以上部分結果截圖中能夠看出，robes出現了65次，robe中文翻譯是盧比，是哈利波特魔法世界的通用貨幣，統計發現robes出現瞭如此屢次，說明在哈利波特世界中隱晦的表達了做者對於貨幣資本家的批判。

2.用Hive對爬蟲大做業產生的csv文件進行數據分析，寫一篇博客描述你的分析過程和分析結果

爬蟲大做業源代碼：

import requests, re, pandas, time
from bs4 import BeautifulSoup
from datetime import datetime


# 獲取新聞細節
def getNewsDetail(newsUrl):
    time.sleep(0.1)
    res = requests.get(newsUrl)
    res.encoding = 'gb2312'
    soupd = BeautifulSoup(res.text, 'html.parser')
    detail = {'title': soupd.select('#epContentLeft')[0].h1.text, 'newsUrl': newsUrl, 'time': datetime.strptime(
        re.search('(\d{4}.\d{2}.\d{2}\s\d{2}.\d{2}.\d{2})', soupd.select('.post_time_source')[0].text).group(1),
        '%Y-%m-%d %H:%M:%S'), 'source': re.search('來源:(.*)', soupd.select('.post_time_source')[0].text).group(1),
              'content': soupd.select('#endText')[0].text}
    return detail


# 獲取一頁的新聞
def getListPage(listUrl):
    res = requests.get(listUrl)
    res.encoding = "utf-8"
    soup = BeautifulSoup(res.text, 'html.parser')
    pagedetail = []  # 存儲一頁全部新聞的詳情
    for news in soup.select('#news-flow-content')[0].select('li'):
        newsdetail = getNewsDetail(news.select('a')[0]['href'])  # 調用getNewsDetail()獲取新聞詳情
        pagedetail.append(newsdetail)
    return pagedetail


pagedetail = getListPage('http://tech.163.com/it/')  # 獲取首頁新聞
for i in range(2, 20):  # 由於網易新聞頻道只存取20頁新聞，直接設置20
    listUrl = 'http://tech.163.com/special/it_2016_%02d/' % i  # 填充新聞頁，頁面格式爲兩位數字字符
    pagedetail.extend(getListPage(listUrl))
df = pandas.DataFrame(pagedetail)
df.to_csv('news.csv')