爬蟲大做業

時間 2019-11-10

標籤爬蟲大做欄目網絡爬蟲简体版

原文原文鏈接

1.選一個本身感興趣的主題。html

2.用python 編寫爬蟲程序，從網絡上爬取相關主題的數據。python

3.對爬了的數據進行文本分析，生成詞雲。正則表達式

4.對文本分析結果進行解釋說明。數組

5.寫一篇完整的博客，描述上述實現過程、遇到的問題及解決辦法、數據分析思想及結論。瀏覽器

6.最後提交爬取的所有數據、爬蟲及數據分析源代碼。網絡

　　在本次做業，我決定爬取網易新聞科技頻道的IT專題，首先這是新聞頻道的首頁。app

　　首先咱們打開瀏覽器的開發者工具，快捷鍵爲F12會Ctrl+Shift+I，找到咱們要爬取新聞的新聞內容列表結構。工具

　　因此咱們能夠發現新聞列表都存儲在類爲 .newsList 的標籤裏，新聞連接存儲在類爲 .titleBar 的標籤裏的 <a> 標籤內。post

　　而後問咱們打開一個新聞頁，分析它的結構。依然打開開發者工具，分析其結構字體

新聞詳情頁

新聞結構

　　因此咱們發現新聞內容都存儲在類爲 .post_content_main 的標籤裏，其中新聞信息存儲在類爲 .post_time_source 的標籤裏，標題存儲在 <h1> 標籤裏。

　　詳細代碼以下：　

import requests, re, jieba, pandas
from bs4 import BeautifulSoup
from datetime import datetime
from wordcloud import WordCloud
import matplotlib.pyplot as plt


# 獲取新聞細節
def getNewsDetail(newsUrl):
    res = requests.get(newsUrl)
    res.encoding = 'gb2312'
    soupd = BeautifulSoup(res.text, 'html.parser')
    detail = {'title': soupd.select('#epContentLeft')[0].h1.text, 'newsUrl': newsUrl, 'time': datetime.strptime(
        re.search('(\d{4}.\d{2}.\d{2}\s\d{2}.\d{2}.\d{2})', soupd.select('.post_time_source')[0].text).group(1),
        '%Y-%m-%d %H:%M:%S'), 'source': re.search('來源:(.*)', soupd.select('.post_time_source')[0].text).group(1),
              'content': soupd.select('#endText')[0].text}
    return detail


# 經過jieba分詞，獲取新聞詞雲
def getKeyWords():
    content = open('news.txt', 'r', encoding='utf-8').read()
    wordSet = set(jieba._lcut(''.join(re.findall('[\u4e00-\u9fa5]', content))))  # 經過正則表達式選取中文字符數組，拼接爲無標點字符內容,再轉換爲字符集合
    wordDict = {}
    deleteList, keyWords = [], []
    for i in wordSet:
        wordDict[i] = content.count(i)  # 生成詞雲字典
    for i in wordDict.keys():
        if len(i) < 2:
            deleteList.append(i)  # 生成單字無心義字符列表
    for i in deleteList:
        del wordDict[i]  # 在詞雲字典中刪除無心義字符
    dictList = list(wordDict.items())
    dictList.sort(key=lambda item: item[1], reverse=True)
    for dict in dictList:
        keyWords.append(dict[0])
    writekeyword(keyWords)


# 將新聞內容寫入到文件
def writeNews(pagedetail):
    f = open('news.txt', 'a', encoding='utf-8')
    for detail in pagedetail:
        f.write(detail['content'])
    f.close()


# 將詞雲寫入到文件
def writekeyword(keywords):
    f = open('keywords.txt', 'a', encoding='utf-8')
    for word in keywords:
        f.write('  ' + word)
    f.close()


# 獲取一頁的新聞
def getListPage(listUrl):
    res = requests.get(listUrl)
    res.encoding = "utf-8"
    soup = BeautifulSoup(res.text, 'html.parser')
    pagedetail = []  # 存儲一頁全部新聞的詳情
    for news in soup.select('#news-flow-content')[0].select('li'):
        newsdetail = getNewsDetail(news.select('a')[0]['href'])  # 調用getNewsDetail()獲取新聞詳情
        pagedetail.append(newsdetail)
    return pagedetail


def getWordCloud():
    keywords = open('keywords.txt', 'r', encoding='utf-8').read()  # 打開詞雲文件
    wc = WordCloud(font_path=r'C:\Windows\Fonts\simfang.ttf', background_color='white', max_words=100).generate(
        keywords).to_file('kwords.png')  # 生成詞雲，字體設置爲可識別中文字符
    plt.imshow(wc)
    plt.axis('off')
    plt.show()


pagedetail = getListPage('http://tech.163.com/it/')  # 獲取首頁新聞
writeNews(pagedetail)
for i in range(2, 20):  # 由於網易新聞頻道只存取20頁新聞，直接設置20
    listUrl = 'http://tech.163.com/special/it_2016_%02d/' % i  # 填充新聞頁，頁面格式爲兩位數字字符
    pagedetail = getListPage(listUrl)
    writeNews(pagedetail)
getKeyWords()  # 獲取詞雲，而且寫到文件
getWordCloud()  # 從詞雲文件讀取詞雲，生成詞雲