爬蟲大做業

時間 2020-07-10

標籤爬蟲大做欄目網絡爬蟲简体版

原文原文鏈接

1、主題：html

爬取博客園博問上160頁每頁25條帖子標題，利用jieba分詞生成詞雲進行分析python

2、python爬取數據數據庫

博問主頁：https://q.cnblogs.com/list/unsolved?page=1 瀏覽器

第二頁：https://q.cnblogs.com/list/unsolved?page=2 以此類推……ide

可得160頁bkyUrl地址函數

for i in range(1,161): bkyUrl = "https://q.cnblogs.com/list/unsolved?page={}".format(i)

經過瀏覽器查看博問主頁元素：學習

觀察可得在主體div類爲.left_sidebar標籤下有25個標籤h二、h2標籤內a標籤文本即爲各博問貼子標題大數據

所以可得getpagetitle函數獲取每頁25條博問貼子標題：spa

def getpagetitle(bkyUrl): time.sleep(1) print(bkyUrl) res1 = requests.get(bkyUrl)  # 返回response對象
    res1.encoding = 'utf-8' soup1 = BeautifulSoup(res1.text, 'html.parser') item_list = soup1.select(".left_sidebar")[0] for i in item_list.select("h2"): title = i.select("a")[0].text

將上述操做整合一塊兒，獲取160 * 25 條博問標題命令行

import requests import time from bs4 import BeautifulSoup def addtitle(title): f = open("F:/study/大三/大數據/title.txt","a",encoding='utf-8') f.write(title+"\n") f.close() def getpagetitle(bkyUrl): time.sleep(1) print(bkyUrl) res1 = requests.get(bkyUrl)  # 返回response對象
    res1.encoding = 'utf-8' soup1 = BeautifulSoup(res1.text, 'html.parser') item_list = soup1.select(".left_sidebar")[0] for i in item_list.select("h2"): title = i.select("a")[0].text addtitle(title) for i in range(160,161): bkyUrl = "https://q.cnblogs.com/list/unsolved?page={}".format(i) getpagetitle(bkyUrl)

保存標題title.txt文本：

3、生成詞雲：

將文本中標題信息以string類型讀取出來，利用jieba進行分詞，去除一些標點符號和無用詞（這裏作的不夠細緻），生成字典countdict：

def gettitle(): f = open("F:/study/大三/大數據/title.txt","r",encoding='utf-8') return f.read() str1 = gettitle() stringList =list(jieba.cut(str1)) delset = {"，","。","：","「","」","？"," ","；","！","、"} stringset = set(stringList) - delset countdict = {} for i in stringset: countdict[i] = stringList.count(i) print(countdict)

進行文本分析生詞詞雲：

from PIL import Image,ImageSequence import numpy as np import matplotlib.pyplot as plt from wordcloud import WordCloud,ImageColorGenerator graph = np.array(countdict) font = r'C:\Windows\Fonts\simhei.ttf' backgroud_Image = plt.imread("F:\study\大三\大數據\\background.jpg") wc = WordCloud(background_color='White',max_words=500,font_path=font, mask=backgroud_Image) wc.generate_from_frequencies(countdict) plt.imshow(wc) plt.axis("off") plt.show()

這裏使用background.jpg做爲背景圖:

生成詞雲圖以下：

從詞雲圖就能很直觀的看出博問上鎖提出問題大部分集中在數據庫、python、C#和Java.

4、爬取數據過程當中遇到的問題：

爬取標題數據信息的過程比較順利，主要問題出如今wordCloud的安裝過程當中：

安裝worldCloud有兩種方式：

一是在pycharm中進入File-setting-proje-Project Interpreter、經過install worldCloud 安裝包

二是在

https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud 中下載對應python版本和window 32/64位版本

個人python版本是3.6，win10 64位系統，因此下載

wordcloud‑1.4.1‑cp36‑cp36m‑win_amd64.whl

這裏把下載文件放在F盤

cmd命令行進入對應wordcloud安裝路徑，我是放在F盤，因此進入F：

輸入 pip install wordcloud‑1.4.1‑cp36‑cp36m‑win_amd64.whl 便可成功導入

可是在執行方法一的時候總會出現這個錯誤提示：

解決辦法應該是安裝Microsoft Visual C++ 14.0，可是文件比較大，沒有進行過嘗試，因此使用方法二

執行二方法：

能夠看到wordCloud已經安裝到

中，若是在這以後沒有在pycharm File-setting-proje-Project Interpreter看到wordCloud包，就須要手動在上圖路徑中找到wordCloud，複製到C:\User\ - \PycharmProjects\**\verv\lib 中便可，（**表示本身建立的項目名字）

5、總結

利用python爬取數據生成詞雲的過程仍是頗有趣的，原本想經過python爬取博客園各博主圓齡，但必需要登陸博客園後才能進入各博主主頁，目前所學還沒辦法作到以用戶身份爬取數據，此後會繼續學習研究~！！