python 利用Beautifulsoup爬取笑話網站

時間 2019-11-12

標籤 python 利用 beautifulsoup 取笑網站欄目 Python 简体版

原文原文鏈接

利用Beautifulsoup爬取知名笑話網站

首先咱們來看看須要爬取的網站：http://xiaohua.zol.com.cn/html

1.開始前準備

1.1 python3，本篇博客內容採用python3來寫，若是電腦上沒有安裝python3請先安裝python3.

1.2 Request庫，urllib的升級版本打包了所有功能並簡化了使用方法。下載方法：

pip install requests

1.3 Beautifulsoup庫，是一個能夠從HTML或XML文件中提取數據的Python庫.它可以經過你喜歡的轉換器實現慣用的文檔導航，查找,修改文檔的方式.。下載方法：

pip install beautifulsoup4

1.4 LXML，用於輔助Beautifulsoup庫解析網頁。（若是你不用anaconda，你會發現這個包在Windows下pip安裝報錯）下載方法：

pip install lxml

1.5 pycharm，一款功能強大的pythonIDE工具。下載官方版本後，使用license sever無償使用（同系列產品相似），具體參照http://www.cnblogs.com/hanggegege/p/6763329.html。

2.爬取過程演示與分析

from bs4 import BeautifulSoup
import os
import requests

導入須要的庫，os庫用來後期儲存爬取內容。python

隨後咱們點開「最新笑話」，發現有「所有笑話」這一欄，可以讓咱們最大效率地爬取全部歷史笑話！app

咱們來經過requests庫來看看這個頁面的源代碼：工具

from bs4 import BeautifulSoup
import os
import requests
all_url = 'http://xiaohua.zol.com.cn/new/ 
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
all_html=requests.get(all_url,headers = headers)
print(all_html.text)

header是請求頭，大部分網站沒有這個請求頭會爬取失敗源碼分析

部分效果以下：佈局

經過源碼分析發現咱們仍是不能經過此網站就直接獲取到全部笑話的信息，所以咱們在在這個頁面找一些間接的方法。優化

點開一個笑話查看全文，咱們發現此時網址變成了http://xiaohua.zol.com.cn/detail58/57681.html，在點開其餘的笑話，咱們發現網址部都是形如http://xiaohua.zol.com.cn/detail？/？.html的格式，咱們以這個爲突破口，去爬取全部的內容網站

咱們的目的是找到全部形如http://xiaohua.zol.com.cn/detail？/？.html的網址，再去爬取其內容。url

咱們在「所有笑話」頁面隨便翻到一頁：http://xiaohua.zol.com.cn/new/5.html ，按下F12查看其源代碼，按照其佈局發現：spa

每一個笑話對應其中一個

標籤，分析得每一個笑話展開全文的網址藏在href當中，咱們只須要獲取href就能獲得笑話的網址

from bs4 import BeautifulSoup
import os
import requests
RootUrl = 'http://xiaohua.zol.com.cn/new/'
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
RootCode=requests.get(RootUrl,headers = headers)
#print(RootCode.text)
Soup = BeautifulSoup(RootCode.text,'lxml')
SoupList=Soup.find_all('li',class_ = 'article-summary')
for i in SoupList:
    #print(i)
    SubSoup = BeautifulSoup(i.prettify(),'lxml')
    list2=SubSoup.find_all('a',target = '_blank',class_='all-read')
    for b in list2:
        href = b['href']
        print(href)

咱們經過以上代碼，成功得到第一頁全部笑話的網址後綴：

也就是說，咱們只須要得到全部的循環遍歷全部的頁碼，就能得到全部的笑話。

上面的代碼優化後：

from bs4 import BeautifulSoup
import requests
import os

RootUrl = 'http://xiaohua.zol.com.cn/new/'
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
RootCode=requests.get(RootUrl,headers = headers)

def GetJokeUrl():
    JokeUrlList = []
    Soup = BeautifulSoup(RootCode.text,'lxml')
    SoupList=Soup.find_all('span',class_ = 'article-title')
    for i in SoupList:
        SubSoup = BeautifulSoup(i.prettify(),'lxml')
        JokeUrlList.append("http://xiaohua.zol.com.cn/"+str(SubSoup.a['href']))
    return JokeUrlList

簡單分析笑話頁面html內容後，接下來獲取一個頁面所有笑話的內容:

from bs4 import BeautifulSoup
import requests
import os

RootUrl = 'http://xiaohua.zol.com.cn/new/'
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
RootCode=requests.get(RootUrl,headers = headers)

def GetJokeUrl():
    JokeUrlList = []
    Soup = BeautifulSoup(RootCode.text,'lxml')
    SoupList=Soup.find_all('span',class_ = 'article-title')
    for i in SoupList:
        SubSoup = BeautifulSoup(i.prettify(),'lxml')
        JokeUrlList.append("http://xiaohua.zol.com.cn/"+str(SubSoup.a['href']))
    return JokeUrlList

def GetJokeText(url):
    HtmlCode = requests.get(url,headers=headers) #don not forget
    Soup = BeautifulSoup(HtmlCode.text,'lxml')
    Content = Soup.find_all('p')
    for p in Content:
        print(p.text)
        
    
def main():
    JokeUrlList = GetJokeUrl()
    for url in JokeUrlList:
        GetJokeText(url)
        
if __name__ == "__main__":
    main()

效果圖以下：