1.爬取網頁 玩轉網頁html
小規模,數據量小
爬取速度不敏感
Requests庫
使用比例:>90%web
2.爬取網站 爬取系列網站瀏覽器
中規模,數據規模較大
爬取速度敏感
Scrapy庫服務器
3.爬取全網網絡
大規模,搜索引擎
爬取速度關鍵
定製開發ide
1.網絡爬蟲的性能騷擾性能
Web服務器默認接收人類訪問
受限於編寫水平和目的,網絡爬蟲將會爲Web服務器帶來巨大的資源開銷網站
2.網絡爬蟲的法律風險ui
服務器上的數據有產權歸屬
網絡爬蟲獲取數據後牟利將帶來法律風險搜索引擎
3.網絡爬蟲的隱私泄露
網絡爬蟲可能具有突破簡單訪問控制的能力,得到被保護數據
從而泄露我的隱私
1. 來源審查:判斷User‐Agent進行限制
檢查來訪HTTP協議頭的User‐Agent域,只響應瀏覽器或友好爬蟲的訪問
2. 發佈公告:Robots協議
告知全部爬蟲網站的爬取策略,要求爬蟲遵照
1.Robots協議
Robots Exclusion Standard,網絡爬蟲排除標準
做用:網站告知網絡爬蟲哪些頁面能夠抓取,哪些不行
形式:在網站根目錄下的robots.txt文件
2.Robots協議基本語法
# 註釋,*表明全部,/表明根目錄
User‐agent: *
Disallow: /
3. 京東的Robots協議案例
文件地址:https://www.jd.com/robots.txt
文件內容
User‐agent: * Disallow: /?* Disallow: /pop/*.html Disallow: /pinpai/*.html?* User‐agent: EtaoSpider Disallow: / User‐agent: HuihuiSpider Disallow: / User‐agent: GwdangSpider Disallow: / User‐agent: WochachaSpider Disallow: /
4. 真實的Robots協議
1). https://www.sina.com/robots.txt
User-agent: * Disallow:
2). http://www.baidu.com/robots.txt
User-agent: Baiduspider Disallow: /baidu Disallow: /s? Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Googlebot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: MSNBot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Baiduspider-image Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: YoudaoBot Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Sogou web spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Sogou inst spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Sogou spider2 Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Sogou blog Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Sogou News Spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Sogou Orion spider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: ChinasoSpider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: Sosospider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: yisouspider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: EasouSpider Disallow: /baidu Disallow: /s? Disallow: /shifen/ Disallow: /homepage/ Disallow: /cpro Disallow: /ulink? Disallow: /link? Disallow: /home/news/data/ User-agent: * Disallow: /
3).http://news.sina.com.cn/robots.txt
User-agent: * Disallow: /wap/ Disallow: /iframe/ Disallow: /temp/
4).https://www.qq.com/robots.txt
User-agent: * Disallow: Sitemap: http://www.qq.com/sitemap_index.xml
5).https://news.qq.com/robots.txt
User-agent: * Disallow: Sitemap: http://www.qq.com/sitemap_index.xml Sitemap: http://news.qq.com/topic_sitemap.xml
6).http://news.sina.com.cn/robots.txt
User-agent: * Disallow: /wap/ Disallow: /iframe/ Disallow: /temp/
1.Robots協議的使用
網絡爬蟲:自動或人工識別robots.txt,再進行內容爬取
約束性:Robots協議是建議但非約束性,網絡爬蟲能夠不遵照,但存在法律風險
2.對Robots協議的理解
1).爬取網頁 玩轉網頁
訪問量很小:能夠遵照
訪問量較大:建議遵照
2).爬取網站 爬取系列網站
非商業且偶爾:建議遵照
商業利益:必須遵照
3).爬取全網
必須遵照
4).原則:類人行爲可不參考Robots協議