beautifulsoup 解析器
解析器 | 使用方法 | 優點 | 劣勢 |
---|---|---|---|
Python標準庫 | BeautifulSoup(text, "html.parser") | Python的內置標準庫執行速度適中文檔容錯能力強 | Python 2.7.3 or 3.2.2前的版本中文檔容錯能力差 |
lxml HTML 解析器 | BeautifulSoup(text, "lxml") | 速度快文檔容錯能力強 | 須要安裝C語言庫 |
lxml XML 解析器 | BeautifulSoup(text, "xml") | 速度快惟一支持XML的解析器 | 須要安裝C語言庫 |
html5lib | BeautifulSoup(text, "html5lib") | 生成HTML5格式的文檔 | 速度慢不依賴外部擴展 |
做業1:爬取文章, 並保存到本地(每一個文章, 一個html文件)
wordpress-edu-3autumn.localprod.forc.work
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://wordpress-edu-3autumn.localprod.forc.work/').text,'html.parser')
for i in soup.find_all('h2',class_='entry-title'):
print(i.find('a').text)
with open('{}.html'.format(i.find('a').text),'w',encoding='utf8') as file:
soup = BeautifulSoup(requests.get(i.find('a')['href']).text,'lxml')
file.write(str(soup.find('div',class_='entry-content')))
複製代碼
做業2: 爬取分類下的圖書名和對應價格, 保存到books.txt
books.toscrape.com
最終效果...
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('http://books.toscrape.com/').text,'html.parser')
with open('books.txt','w',encoding='utf8') as file:
for i in soup.find('ul',class_='nav nav-list').find('ul').find_all('li'):
file.write(i.text.strip()+'\n')
res = requests.get("http://books.toscrape.com/"+i.find('a')['href'])
res.encoding='utf8'
soup = BeautifulSoup(res.text,'html.parser')
for j in soup.find_all('li',class_="col-xs-6 col-sm-4 col-md-3 col-lg-3"):
print(j.find('h3').find('a')['title'])
file.write('\t"{}" {}\n'.format(j.find('h3').find('a')['title'],j.find('p',class_='price_color').text))
複製代碼
Travel
"It's Only the Himalayas" £45.17
"Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond" £49.43
"See America: A Celebration of Our National Parks & Treasured Sites" £48.87
"Vagabonding: An Uncommon Guide to the Art of Long-Term World Travel" £36.94
"Under the Tuscan Sun" £37.33
"A Summer In Europe" £44.34
"The Great Railway Bazaar" £30.54
"A Year in Provence (Provence #1)" £56.88
"The Road to Little Dribbling: Adventures of an American in Britain (Notes From a Small Island #2)" £23.21
"Neither Here nor There: Travels in Europe" £38.95
"1,000 Places to See Before You Die" £26.08
Mystery
"Sharp Objects" £47.82
"In a Dark, Dark Wood" £19.63
"The Past Never Ends" £56.50
"A Murder in Time" £16.64
"The Murder of Roger Ackroyd (Hercule Poirot #4)" £44.10
"The Last Mile (Amos Decker #2)" £54.21
"That Darkness (Gardiner and Renner #1)" £13.92
"Tastes Like Fear (DI Marnie Rome #3)" £10.69
"A Time of Torment (Charlie Parker #14)" £48.35
"A Study in Scarlet (Sherlock Holmes #1)" £16.73
"Poisonous (Max Revere Novels #3)" £26.80
"Murder at the 42nd Street Library (Raymond Ambler #1)" £54.36
"Most Wanted" £35.28
"Hide Away (Eve Duncan #20)" £11.84
"Boar Island (Anna Pigeon #19)" £59.48
"The Widow" £27.26
"Playing with Fire" £13.71
"What Happened on Beale Street (Secrets of the South Mysteries #2)" £25.37
"The Bachelor Girl's Guide to Murder (Herringford and Watts Mysteries #1)" £52.30
"Delivering the Truth (Quaker Midwife Mystery #1)" £20.89
Historical Fiction
"Tipping the Velvet" £53.74
"Forever and Forever: The Courtship of Henry Longfellow and Fanny Appleton" £29.69
"A Flight of Arrows (The Pathfinders #2)" £55.53
"The House by the Lake" £36.95
"Mrs. Houdini" £30.25
"The Marriage of Opposites" £28.08
"Glory over Everything: Beyond The Kitchen House" £45.84
"Love, Lies and Spies" £20.55
"A Paris Apartment" £39.01
"Lilac Girls" £17.28
"The Constant Princess (The Tudor Court #1)" £16.62
"The Invention of Wings" £37.34
"World Without End (The Pillars of the Earth #2)" £32.97
"The Passion of Dolssa" £28.32
"Girl With a Pearl Earring" £26.77
"Voyager (Outlander #3)" £21.07
"The Red Tent" £35.66
"The Last Painting of Sara de Vos" £55.55
"The Guernsey Literary and Potato Peel Pie Society" £49.53
"Girl in the Blue Coat" £46.83
......
複製代碼
貓哥教你寫爬蟲 000--開篇.md
貓哥教你寫爬蟲 001--print()函數和變量.md
貓哥教你寫爬蟲 002--做業-打印皮卡丘.md
貓哥教你寫爬蟲 003--數據類型轉換.md
貓哥教你寫爬蟲 004--數據類型轉換-小練習.md
貓哥教你寫爬蟲 005--數據類型轉換-小做業.md
貓哥教你寫爬蟲 006--條件判斷和條件嵌套.md
貓哥教你寫爬蟲 007--條件判斷和條件嵌套-小做業.md
貓哥教你寫爬蟲 008--input()函數.md
貓哥教你寫爬蟲 009--input()函數-人工智能小愛同窗.md
貓哥教你寫爬蟲 010--列表,字典,循環.md
貓哥教你寫爬蟲 011--列表,字典,循環-小做業.md
貓哥教你寫爬蟲 012--布爾值和四種語句.md
貓哥教你寫爬蟲 013--布爾值和四種語句-小做業.md
貓哥教你寫爬蟲 014--pk小遊戲.md
貓哥教你寫爬蟲 015--pk小遊戲(全新改版).md
貓哥教你寫爬蟲 016--函數.md
貓哥教你寫爬蟲 017--函數-小做業.md
貓哥教你寫爬蟲 018--debug.md
貓哥教你寫爬蟲 019--debug-做業.md
貓哥教你寫爬蟲 020--類與對象(上).md
貓哥教你寫爬蟲 021--類與對象(上)-做業.md
貓哥教你寫爬蟲 022--類與對象(下).md
貓哥教你寫爬蟲 023--類與對象(下)-做業.md
貓哥教你寫爬蟲 024--編碼&&解碼.md
貓哥教你寫爬蟲 025--編碼&&解碼-小做業.md
貓哥教你寫爬蟲 026--模塊.md
貓哥教你寫爬蟲 027--模塊介紹.md
貓哥教你寫爬蟲 028--模塊介紹-小做業-廣告牌.md
貓哥教你寫爬蟲 029--爬蟲初探-requests.md
貓哥教你寫爬蟲 030--爬蟲初探-requests-做業.md
貓哥教你寫爬蟲 031--爬蟲基礎-html.md
貓哥教你寫爬蟲 032--爬蟲初體驗-BeautifulSoup.md
貓哥教你寫爬蟲 033--爬蟲初體驗-BeautifulSoup-做業.md
貓哥教你寫爬蟲 034--爬蟲-BeautifulSoup實踐.md
貓哥教你寫爬蟲 035--爬蟲-BeautifulSoup實踐-做業-電影top250.md
貓哥教你寫爬蟲 036--爬蟲-BeautifulSoup實踐-做業-電影top250-做業解析.md
貓哥教你寫爬蟲 037--爬蟲-寶寶要聽歌.md
貓哥教你寫爬蟲 038--帶參數請求.md
貓哥教你寫爬蟲 039--存儲數據.md
貓哥教你寫爬蟲 040--存儲數據-做業.md
貓哥教你寫爬蟲 041--模擬登陸-cookie.md
貓哥教你寫爬蟲 042--session的用法.md
貓哥教你寫爬蟲 043--模擬瀏覽器.md
貓哥教你寫爬蟲 044--模擬瀏覽器-做業.md
貓哥教你寫爬蟲 045--協程.md
貓哥教你寫爬蟲 046--協程-實踐-吃什麼不會胖.md
貓哥教你寫爬蟲 047--scrapy框架.md
貓哥教你寫爬蟲 048--爬蟲和反爬蟲.md
貓哥教你寫爬蟲 049--完結撒花.mdhtml