首先咱們來看看須要爬取的網站:http://xiaohua.zol.com.cn/html
pip install requests
pip install beautifulsoup4
pip install lxml
from bs4 import BeautifulSoup import os import requests
導入須要的庫,os庫用來後期儲存爬取內容。python
隨後咱們點開「最新笑話」,發現有「所有笑話」這一欄,可以讓咱們最大效率地爬取全部歷史笑話!app
咱們來經過requests庫來看看這個頁面的源代碼:工具
from bs4 import BeautifulSoup import os import requests all_url = 'http://xiaohua.zol.com.cn/new/ headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} all_html=requests.get(all_url,headers = headers) print(all_html.text)
header是請求頭,大部分網站沒有這個請求頭會爬取失敗源碼分析
部分效果以下:佈局
經過源碼分析發現咱們仍是不能經過此網站就直接獲取到全部笑話的信息,所以咱們在在這個頁面找一些間接的方法。優化
點開一個笑話查看全文,咱們發現此時網址變成了http://xiaohua.zol.com.cn/detail58/57681.html,在點開其餘的笑話,咱們發現網址部都是形如http://xiaohua.zol.com.cn/detail?/?.html的格式,咱們以這個爲突破口,去爬取全部的內容網站
咱們的目的是找到全部形如http://xiaohua.zol.com.cn/detail?/?.html的網址,再去爬取其內容。url
咱們在「所有笑話」頁面隨便翻到一頁:http://xiaohua.zol.com.cn/new/5.html ,按下F12查看其源代碼,按照其佈局發現 :spa
每一個笑話對應其中一個
標籤,分析得每一個笑話展開全文的網址藏在href當中,咱們只須要獲取href就能獲得笑話的網址
from bs4 import BeautifulSoup import os import requests RootUrl = 'http://xiaohua.zol.com.cn/new/' headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} RootCode=requests.get(RootUrl,headers = headers) #print(RootCode.text) Soup = BeautifulSoup(RootCode.text,'lxml') SoupList=Soup.find_all('li',class_ = 'article-summary') for i in SoupList: #print(i) SubSoup = BeautifulSoup(i.prettify(),'lxml') list2=SubSoup.find_all('a',target = '_blank',class_='all-read') for b in list2: href = b['href'] print(href)
咱們經過以上代碼,成功得到第一頁全部笑話的網址後綴:
也就是說,咱們只須要得到全部的循環遍歷全部的頁碼,就能得到全部的笑話。
上面的代碼優化後:
from bs4 import BeautifulSoup import requests import os RootUrl = 'http://xiaohua.zol.com.cn/new/' headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} RootCode=requests.get(RootUrl,headers = headers) def GetJokeUrl(): JokeUrlList = [] Soup = BeautifulSoup(RootCode.text,'lxml') SoupList=Soup.find_all('span',class_ = 'article-title') for i in SoupList: SubSoup = BeautifulSoup(i.prettify(),'lxml') JokeUrlList.append("http://xiaohua.zol.com.cn/"+str(SubSoup.a['href'])) return JokeUrlList
簡單分析笑話頁面html內容後,接下來獲取一個頁面所有笑話的內容:
from bs4 import BeautifulSoup import requests import os RootUrl = 'http://xiaohua.zol.com.cn/new/' headers = {'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"} RootCode=requests.get(RootUrl,headers = headers) def GetJokeUrl(): JokeUrlList = [] Soup = BeautifulSoup(RootCode.text,'lxml') SoupList=Soup.find_all('span',class_ = 'article-title') for i in SoupList: SubSoup = BeautifulSoup(i.prettify(),'lxml') JokeUrlList.append("http://xiaohua.zol.com.cn/"+str(SubSoup.a['href'])) return JokeUrlList def GetJokeText(url): HtmlCode = requests.get(url,headers=headers) #don not forget Soup = BeautifulSoup(HtmlCode.text,'lxml') Content = Soup.find_all('p') for p in Content: print(p.text) def main(): JokeUrlList = GetJokeUrl() for url in JokeUrlList: GetJokeText(url) if __name__ == "__main__": main()
效果圖以下: