爬取博客園全部文章詳情並生成詞雲

時間 2019-12-13

標籤博客全部文章詳情生成简体版

原文原文鏈接

1、網站分析

　　一、打開瀏覽器

　　　　在地址欄輸入https://edu.cnblogs.com/posts，瀏覽器將會返回一頁博文列表給咱們，以下圖：html

　　　　咱們按下F12快捷鍵，將會打開網頁調試工具，點擊network，能夠看到咱們對https://edu.cnblogs.com/posts的請求，以及python

　　https://edu.cnblogs.com/posts的響應信息git

　　簡單介紹一下：github

　　　　Request URl:表示請求的URL
　　　　Request Method：表示請求的方法，此處爲GET。除此以外，HTTP的請求方法還有OPTION、HEAD、POST、　DELETE、PUT等，而最經常使用的就是GET和POST方法web

　　　　POST：向指定資源提交數據，請求服務器進行處理（例如提交表單或者上傳文件）。數據被包含在請求本文中。這個請求可能會建立新的資源或修改現有資源，或兩者皆有。正則表達式

　　　 GET：向指定的資源發出「顯示」請求。mongodb

　　　 Status Code：顯示HTTP請求和狀態碼，表示HTTP請求的狀態，此處爲200，表示請求已被服務器接收、理解和處理；數據庫

狀態代碼的第一個數字表明當前響應的類型，HTTP協議中有如下幾種響應類型：編程

　　　　1xx消息——請求已被服務器接收，繼續處理windows

　　　　2xx成功——請求已成功被服務器接收、理解、並接受

　　　　3xx重定向——須要後續操做才能完成這一請求

　　　　4xx請求錯誤——請求含有詞法錯誤或者沒法被執行

　　　　5xx服務器錯誤——服務器在處理某個正確請求時發生錯誤

　　　　HTTP請求頭

　　　　　　Accept：表示請求的資源類型;
　　　　　　Cookie:爲了辨別用戶身份、進行 session 跟蹤而儲存在用戶本地終端上的數據;
　　　　　　User-Agent:表示瀏覽器標識;
　　　　　　Accept-Language:表示瀏覽器所支持的語言類型；
　　　　　　Accept-Charset:告訴 Web 服務器，瀏覽器能夠接受哪些字符編碼；
　　　　　　Accept:表示瀏覽器支持的 MIME 類型；
　　　　　　Accept-Encoding:表示瀏覽器有能力解碼的編碼類型；
　　　　　　Connection:表示客戶端與服務鏈接類型；

　　二、分析網頁結構

　　　　如圖：

　　　　博客園返回一個文章列表給咱們，而咱們想獲取文章詳情，經過網頁調試工具發現，文章列表在li標籤裏，而文章標題超連接在h3標籤裏的a標籤，因此咱們要作的就是先將全部文章標題超連接裏的鏈接路徑保存起來，而後再取獲取文章詳情。

　　　點擊文章超連接進去並分析文章詳情的網頁結構，如圖

　　　　經過上圖，咱們發現，咱們須要的文章正文再id爲post_detail的div下，因此接下來咱們要作的即是爬取id爲post_detail裏邊的正文。

　　　到這裏基本清楚咱們須要獲取哪些數據了，接下來就進行編碼了。

2、開發環境　　

　　編程語言：Python3.6

　代碼運行工具：pycham

　　數據庫:mongodb

　　依賴庫：Requests，BeautifulSoup，pymongo，re等

3、第三方庫的安裝　

　　　在Python有兩個著名的包管理工具easy_install.py和pip。在Python2.7的安裝包中，easy_install.py是默認安裝的，而pip須要咱們手動安裝。在python3.x中默認使用的是pip，咱們使用的是python3.6，因此不用手動安裝pip。

　　在windows上安裝python依賴庫很是簡單，語法以下：

　　　　　pip install PackageName

　　PackageName指的是你安裝的依賴包名稱。

　　安裝BeautifulSoup依賴包能夠這樣安裝，其餘相似

　　　　　pip install BeautifulSoup

　　也能夠在pycharm上安裝，如圖

　　　數據庫使用的是mongodb,安裝的話自行安裝啦，這裏就不說啦，要想是python鏈接mongodb數據庫，就須要安裝pymongo，安裝方式和上面同樣

4、代碼編寫之基礎版

# coding: utf-8
import urllib.request
#import re
from bs4 import BeautifulSoup
import time
#import pymongo
from util.delhtml import filter_tags

class CnBlog(object):
    def __init__(self):
        user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
        self.headers = {'Cache-Control': 'max-age=0',
                        'Connection': 'keep-alive',
                        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                        'User-Agent': user_agent,
                        }
    
     # 得到界面
    def getPage(self, url=None):
        request = urllib.request.Request(url, headers=self.headers)
        response = urllib.request.urlopen(request)
        soup = BeautifulSoup(response.read(), "html.parser")
        # print(soup.prettify())
        return soup
    def parsePage(self,url=None,pageNo=None):
        #獲取界面
        soup=self.getPage(url+"?page="+pageNo)
        itemBlog=soup.find_all("li",{"class":"am-g am-list-item-desced am-list-item-thumbed am-list-item-thumb-bottom-right"})
        print(itemBlog.__len__())
        blog=CnBlog()
        for i,blogInfo in enumerate(itemBlog):
            blog.num=i
            blog.url=blogInfo.find("h3").find("a").get("href")
            blog.title=blogInfo.find("h3").find("a").string
            blog.grade=blogInfo.find("div","am-fr").find_all("a")[0].text
            blog.author=blogInfo.find("div","am-fr").find_all("a")[1].text
            item = self.getPage(blog.url)
            blog.article = filter_tags(str(item))
            print("博客數量",blog.num,"標題：",blog.title,"做者：",blog.author,"詳情：",blog.url)
            dic={
                'num':blog.num,
                'title':blog.title,
                'grade':blog.grade,
                'author':blog.author,
                'url':blog.url,
                'article':blog.article
            }
        
if __name__ =="__main__":
    #要抓取的網頁地址
    url = "https://edu.cnblogs.com/posts"
    cnblog=CnBlog()
    soup=cnblog.getPage(url)
    for i in range(1,500):
        cnblog.parsePage(url,str(i))
        time.sleep(1)

5、問題分析及代碼改進

　　一、在獲取文章詳情時，因爲博客園使用的時TinyMCE和markdown兩種編輯器，因此獲純文本，就必須將html標籤所有去掉，

　　解決方法的代碼以下：

# -*- coding: utf-8-*-
import re
##過濾HTML中的標籤
#將HTML中標籤等信息去掉
#@param htmlstr HTML字符串.
def filter_tags(htmlstr):
#先過濾CDATA
    re_cdata=re.compile('//<!\[CDATA\[[^>]*//\]\]>',re.I) #匹配CDATA
    re_script=re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',re.I)#Script
    re_style=re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>',re.I)#style
    re_br=re.compile('<br\s*?/?>')#處理換行
    re_h=re.compile('</?\w+[^>]*>')#HTML標籤
    re_comment=re.compile('<!--[^>]*-->')#HTML註釋
    s=re_cdata.sub('',htmlstr)#去掉CDATA
    s=re_script.sub('',s) #去掉SCRIPT
    s=re_style.sub('',s)#去掉style
    s=re_br.sub('\n',s)#將br轉換爲換行
    s=re_h.sub('',s) #去掉HTML 標籤
    s=re_comment.sub('',s)#去掉HTML註釋
#去掉多餘的空行
    blank_line=re.compile('\n+')
    s=blank_line.sub('\n',s)
    s=replaceCharEntity(s)#替換實體
    return s

##替換經常使用HTML字符實體.
#使用正常的字符替換HTML中特殊的字符實體.
#你能夠添加新的實體字符到CHAR_ENTITIES中,處理更多HTML字符實體.
#@param htmlstr HTML字符串.
def replaceCharEntity(htmlstr):
    CHAR_ENTITIES={'nbsp':' ','160':' ',
                   'lt':'<','60':'<',
                    'gt':'>','62':'>',
                   'amp':'&','38':'&',
                   'quot':'"','34':'"',}

    re_charEntity=re.compile(r'&#?(?P<name>\w+);')
    sz=re_charEntity.search(htmlstr)
    while sz:
     try:
        global key
        entity = sz.group()  # entity全稱，如&gt;
        key = sz.group('name')  # 去除&;後entity,如&gt;爲gt
        htmlstr=re_charEntity.sub(CHAR_ENTITIES[key],htmlstr,1)
        sz=re_charEntity.search(htmlstr)
     except KeyError:
     #以空串代替
            htmlstr=re_charEntity.sub('',htmlstr,1)
            sz=re_charEntity.search(htmlstr)
    return htmlstr

def repalce(s,re_exp,repl_string):
  return re_exp.sub(repl_string,s)

　　二、在爬蟲的時候，發現爬取兩三頁後博客園會將個人IP暫封一段時間，由於連續請求被發現了，最簡單的解決方法時設置線程延遲，意思就是每一個請求隔兩三秒才請求，個人解決方法時設置動態代理IP，這樣每次請求的IP就會不同。我去github上找了一個免費的代理IP池，地址爲：https://github.com/jhao104/proxy_pool，具體使用方法，github上有將，我在這就不囉嗦啦通過上述操做，成功爬取到博客園500頁的博文詳細信息。

三、因爲爬取的數據量比較多，因此將爬取到的數據存放到mongodb數據庫中，方便後面分析

　　改進的代碼以下：

import requests
from bs4 import BeautifulSoup
import time
import pymongo
from util.delhtml import filter_tags

class CnBlog(object):
    def __init__(self):
        user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
        self.headers = {'Cache-Control': 'max-age=0',
                        'Connection': 'keep-alive',
                        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                        'User-Agent': user_agent,
                        }
    def getConnect(self):
        con = pymongo.MongoClient('localhost', 27017)
        return con.cnblog

     # 得到界面
    def getPage(self, url=None):
        retry_count = 5
        while retry_count > 0:
            try:
                proxy = {"http": self.get_proxy() + ""}
                response = requests.get(url, proxies=proxy, headers=self.headers)
                soup = BeautifulSoup(response.text, "html.parser")
                return soup
            except Exception:
                retry_count -= 1
           
    def parsePage(self,url=None,pageNo=None,collection=None):
        #獲取界面
        soup=self.getPage(url+"?page="+pageNo)
        itemBlog=soup.find_all("li",{"class":"am-g am-list-item-desced am-list-item-thumbed am-list-item-thumb-bottom-right"})
        print(itemBlog.__len__())
        blog=CnBlog()
        for i,blogInfo in enumerate(itemBlog):
            blog.num=i
            blog.url=blogInfo.find("h3").find("a").get("href")
            blog.title=blogInfo.find("h3").find("a").string
            blog.grade=blogInfo.find("div","am-fr").find_all("a")[0].text
            blog.author=blogInfo.find("div","am-fr").find_all("a")[1].text
            item = self.getPage(blog.url)
            blog.article = filter_tags(str(item))
            print("博客數量",blog.num,"標題：",blog.title,"做者：",blog.author,"詳情：",blog.url)
            dic={
                'num':blog.num,
                'title':blog.title,
                'grade':blog.grade,
                'author':blog.author,
                'url':blog.url,
                'article':blog.article
            }
            collection.items.insert(dic)

    def get_proxy(self):
        return requests.get("http://127.0.0.1:6068/get/", headers=self.headers).content.decode()
    def delete_proxy(proxy):
        requests.get("http://127.0.0.1:6068/delete/?proxy={}".format(proxy))
if __name__ =="__main__":
    #要抓取的網頁地址
    url = "https://edu.cnblogs.com/posts"
    cnblog=CnBlog()
    collection=cnblog.getConnect()
    soup=cnblog.getPage(url)
    for i in range(1,500):
        cnblog.parsePage(url,str(i),collection)
        time.sleep(1)

6、生成詞雲

　　一、安裝wordcloud依賴庫

　　　　在windows上安裝有wordcloud有兩種方法，

　　　　一是安裝微軟Visual C++ Build Tools，可是這個軟件比較大，

　　　　二是去這個地址：https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud下載whl文件，而後安裝。

　　二、這裏將一下第二種安裝方式，首先去https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud下載whl文件，如圖：

　　　　選擇合適的版本下載，cp指的是系統上安裝的python版本，32表示安裝的python版本是32位，不是操做系統

打開cmd運行,切換到指定目錄運行

　　　執行如下命令

pip install wordcloud-1.3.3-cp36-cp36m-win32.whl
pip install wordcloud

　　　　到此wordcloud就安裝成功了，就可使用了。

　　三、安裝jieba依賴

　　　　在這裏我是用的是jieba進行分詞，而後生成詞雲，安裝jieba依賴庫很簡單，只須要執行命令 pip install jieba 便可。

　　通過jieba分詞後生成的原始詞雲以下圖：

　　四、生成的詞雲以下圖：

　　五、發現有不少英文，分析發現，有些博客園裏面有不少代碼，因此會出現不少英文，解決方法就是把那些代碼都去掉，只留中文，使用的正則表達式以下：

[A-Za-z0-9\[\`\~\!\@\#\$\^\&\*\(\)\=\|\{\}\'\:\;\'\,\[\]\.\<\>\/\?\~\！\@\#\\\&\*\%]

　　　更改後的效果圖以下：

　　　六、來點好玩的，生成自定義的詞雲形狀

　　　　　來張老喬的照片

　　　　生成的詞雲以下圖：

　　　　到此，爬取博客園500頁博文信息並生成詞雲的小demo已經完成。

　　七、詞雲源碼以下

import jieba
import matplotlib.pyplot as plt
from wordcloud import WordCloud, ImageColorGenerator
import codecs
import numpy as np
from PIL import Image
import re
file = codecs.open('ciyun.txt', 'r', 'utf-8')
image=np.array(Image.open('E:/pthonProject/pacong/image/qiao.jpg'))
font=r'C:\Windows\Fonts\simkai.ttf'
word=file.read()
#去掉英文，保留中文
resultword=re.sub("[A-Za-z0-9\[\`\~\!\@\#\$\^\&\*\(\)\=\|\{\}\'\:\;\'\,\[\]\.\<\>\/\?\~\！\@\#\\\&\*\%]", "",word)
wordlist_after_jieba = jieba.cut(resultword, cut_all = True)

wl_space_split = " ".join(wordlist_after_jieba)
print(wl_space_split)
my_wordcloud = WordCloud(font_path=font,mask=image,background_color='white',max_words = 100,max_font_size = 100,random_state=50).generate(wl_space_split)
#根據圖片生成詞雲
iamge_colors = ImageColorGenerator(image)
#my_wordcloud.recolor(color_func = iamge_colors)
#顯示生成的詞雲
plt.imshow(my_wordcloud)
plt.axis("off")
plt.show()
#保存生成的圖片，當關閉圖片時纔會生效，中斷程序不會保存
my_wordcloud.to_file('result.jpg')