Hadoop綜合大做業 要求:php
用Hive對爬蟲大做業產生的文本文件(或者英文詞頻統計下載的英文長篇小說)詞頻統計。html
用Hive對爬蟲大做業產生的csv文件進行數據分析python
這裏的具體操做步驟以下:shell
首先,Python爬蟲程序代碼以下:數據庫
import jieba import requests from bs4 import BeautifulSoup lyrics = '' headers = { 'User-Agent': 'User-Agent:*/*' } resp = requests.get('http://www.juzimi.com/writer/%E6%96%B9%E6%96%87%E5%B1%B1', headers=headers) resp.encoding = 'UTF-8' print(resp.status_code) soup = BeautifulSoup(resp.text, 'html.parser') page_url = 'http://www.juzimi.com/writer/%E6%96%B9%E6%96%87%E5%B1%B1?page={}' page_last = soup.select('.pager-last') if len(page_last) > 0: page_last = page_last[0].text for i in range(0, int(page_last)): print(i) resp = requests.get(page_url.format(i), headers=headers) resp.encoding = 'UTF-8' soup = BeautifulSoup(resp.text, 'html.parser') for a in soup.select('.xlistju'): lyrics += a.text + ' ' # 保留爬取的句子 with open('lyrics.txt', 'a+', encoding='UTF-8') as lyricFile: lyricFile.write(lyrics) # 加載標點符號並去除歌詞中的標點 with open('punctuation.txt', 'r', encoding='UTF-8') as punctuationFile: for punctuation in punctuationFile.readlines(): lyrics = lyrics.replace(punctuation[0], ' ') # 加載無心義詞彙 with open('meaningless.txt', 'r', encoding='UTF-8') as meaninglessFile: mLessSet = set(meaninglessFile.read().split('\n')) mLessSet.add(' ') # 加載保留字 with open('reservedWord.txt', 'r', encoding='UTF-8') as reservedWordFile: reservedWordSet = set(reservedWordFile.read().split('\n')) for reservedWord in reservedWordSet: jieba.add_word(reservedWord) keywordList = list(jieba.cut(lyrics)) keywordSet = set(keywordList) - mLessSet # 將無心義詞從詞語集合中刪除 keywordDict = {} # 統計出詞頻字典 for word in keywordSet: keywordDict[word] = keywordList.count(word) # 對詞頻進行排序 keywordListSorted = list(keywordDict.items()) keywordListSorted.sort(key=lambda e: e[1], reverse=True) # 將全部詞頻寫出到txt for topWordTup in keywordListSorted: print(topWordTup) with open('word.txt', 'a+', encoding='UTF-8') as wordFile: for i in range(0, topWordTup[1]): wordFile.write(topWordTup[0]+'\n')
如今將word.txt
放入HDFS中並用hive查詢統計,命令以下:api
hdfs dfs -mkdir temp hdfs dfs -put news.csv temp hive hive> create database db_temp; use db_temp; create table tb_word(word string); load data inpath '/user/hadoop/temp/word.txt' into table tb_word; select word, count(1) as num from tb_word group by word order by num desc limit 50;
以上的運行結果截圖以下:app
我這裏選擇了爬取校園新聞並生產csv文件來分析,首先編寫爬蟲,主要代碼以下:less
import requests from bs4 import BeautifulSoup from datetime import datetime import re import pandas news_list = [] def crawlOnePageSchoolNews(page_url): res0 = requests.get(page_url) res0.encoding = 'UTF-8' soup0 = BeautifulSoup(res0.text, 'html.parser') news = soup0.select('.news-list > li') for n in news: # print(n) print('**' * 5 + '列表頁信息' + '**' * 10) print('新聞連接:' + n.a.attrs['href']) print('新聞標題:' + n.select('.news-list-title')[0].text) print('新聞描述:' + n.a.select('.news-list-description')[0].text) print('新聞時間:' + n.a.select('.news-list-info > span')[0].text) print('新聞來源:' + n.a.select('.news-list-info > span')[1].text) news_list.append(getNewDetail(n.a.attrs['href'])) return news_list def getNewDetail(href): print('**' * 5 + '詳情頁信息' + '**' * 10) print(href) res1 = requests.get(href) res1.encoding = 'UTF-8' soup1 = BeautifulSoup(res1.text, 'html.parser') news = {} if soup1.select('#content'): news_content = soup1.select('#content')[0].text news['content'] = news_content.replace('\n', ' ').replace('\r', ' ').replace(',', '·') print(news_content) # 文章內容 else: news['content'] = '' if soup1.select('.show-info'): # 防止以前網頁沒有show_info news_info = soup1.select('.show-info')[0].text else: return news info_list = ['來源', '發佈時間', '點擊', '做者', '審覈', '攝影'] # 須要解析的字段 news_info_set = set(news_info.split('\xa0')) - {' ', ''} # 網頁中的 獲取後會解析成\xa0,因此可使用\xa0做爲分隔符 # 循環打印文章信息 for n_i in news_info_set: for info_flag in info_list: if n_i.find(info_flag) != -1: # 由於時間的冒號採用了英文符因此要進行判斷 if info_flag == '發佈時間': # 將發佈時間字符串轉爲datetime格式,方便往後存儲到數據庫 release_time = datetime.strptime(n_i[n_i.index(':') + 1:], '%Y-%m-%d %H:%M:%S ') news[info_flag] = release_time print(info_flag + ':', release_time) elif info_flag == '點擊': # 點擊次數是經過文章id訪問php後使用js寫入,因此這裏單獨處理 news[info_flag] = getClickCount(href) else: news[info_flag] = n_i[n_i.index(':') + 1:].replace(',', '·') print(info_flag + ':' + n_i[n_i.index(':') + 1:]) print('————' * 40) return news def getClickCount(news_url): click_num_url = 'http://oa.gzcc.cn/api.php?op=count&id={}&modelid=80' click_num_url = click_num_url.format(re.search('_(.*)/(.*).html', news_url).group(2)) res2 = requests.get(click_num_url) res2.encoding = 'UTF-8' click_num = re.search("\$\('#hits'\).html\('(\d*)'\)", res2.text).group(1) print('點擊:' + click_num) return click_num print(crawlOnePageSchoolNews('http://news.gzcc.cn/html/xiaoyuanxinwen/')) pageURL = 'http://news.gzcc.cn/html/xiaoyuanxinwen/{}.html' res = requests.get('http://news.gzcc.cn/html/xiaoyuanxinwen/') res.encoding = 'UTF-8' soup = BeautifulSoup(res.text, 'html.parser') newsSum = int(re.search('(\d*)條', soup.select('a.a1')[0].text).group(1)) if newsSum % 10: pageSum = int(newsSum / 10) + 1 else: pageSum = int(newsSum / 10) for i in range(2, pageSum + 1): crawlOnePageSchoolNews(pageURL.format(i)) # with open('news.txt', 'w') as file: # file.write() dit = pandas.DataFrame(news_list) dit.to_csv('news.csv') print(dit)
由於csv是用逗號分隔,而文章內容有逗號和換行符容易形成影響,因此在爬取數據時作了相應處理,將換行逗號等使用其餘代替。爬取後將文件放入HDFS系統,並將第一行的數據刪除,這裏使用insert語句覆蓋原先導入的表便可,而後經過hive查詢作出相應操做分析文章做者在何時發表的量比較多。分佈式
hdfs dfs -put news.csv temp/ hive hive> create table tb_news(id string, content string, author string, publish timestamp, verify string, photo string, source string, click int)row format delimited fields terminated by ','; load data inpath '/user/hadoop/temp/news.csv' overwrite into table tb_news; insert overwrite table tb_news select * from tb_news where content != 'content'; select time_publish, count(1) as num from (select hour(publish) as time_publish from tb_news) tb_time group by time_publish order by num desc;
根據以上截圖的結果能夠看出,小編在發佈時間大部分都是在0時
,我只能說,熬夜很差
oop