---恢復內容開始---php
1.scrapy框架css
每一步的解釋:html
step1:引擎從爬蟲器獲取要爬行的初始請求。html5
step2:引擎在調度程序中調度請求,引擎把這個初始請求傳遞給調度器,並向調度器索要下一個請求。golang
step3:調度程序將下一個請求返回給引擎。web
step4:引擎經過下載器中間件將請求發送給下載器。面試
step5:一旦頁面下載完成,下載器就會生成一個響應(帶有該頁面)並將經過下載器中間件響應發送到引擎。數據庫
step6:引擎接收來自下載器的響應,並經過Spider中間件將其發送給Spider進行處理。json
step7:爬行器處理響應,並經過爬行器中間件將抓取的數據和新請求(要跟蹤的)返回給引擎api
step8:引擎將處理過的數據發送到數據管道,而後將處理過的請求發送到調度程序,並請求可能的下一個要爬行的請求。
step9:流程重複(從步驟1開始),直到調度程序再也不發出請求。
每個元件的做用:
引擎 Engine:負責控制系統全部組件之間的數據流,並在某些操做發生時觸發事件
調度器:調度器接收來自引擎的請求,並對它們進行排隊,以便稍後在引擎請求它們時將請求對象提供給引擎。
下載器:下載器負責獲取web頁面並將響應內容提供給引擎,而引擎又將響應內容提供給爬蟲器。
爬蟲器:spider是由Scrapy用戶編寫的自定義類,用於解析響應並從響應或後續請求中提取數據(也稱爲抓取數據)。
管道:項目管道負責處理爬蟲提取(或處理)後的數據。典型的任務包括清理數據、驗證和持久性(好比在數據庫中存儲)。
下載器中間件:Downloader中間件是位於引擎和Downloader之間的特定鉤子,當請求從引擎傳遞到Downloader以及響應從Downloader傳遞到引擎時,處理引擎和下載器中間的請求和響應,並進行傳遞。
爬蟲中間件:Spider中間件是位於引擎和Spider之間的特定鉤子,可以處理Spider輸入(響應)和輸出(數據和請求)
內容翻譯於:https://doc.scrapy.org/en/master/topics/architecture.html
scrapy框架:Scrapy是用Twisted編寫的,Twisted是一種流行的Python事件驅動的網絡框架。所以,它使用非阻塞(即異步)代碼實現併發。
1.Scrapy可使用HTTP代理嗎?
是的。經過HTTP代理下載器中間件提供了對HTTP代理的支持(由於Scrapy 0.8)。具體見:HttpProxyMiddleware。
使用方式一:在中間件,middlewares.py中添加以下代碼。 class ProxyMiddleware(object):
def process_request(self,request,spider):
if request.url.startswith("http://"):
request.meta['proxy']="http://"+'127.0.0.0:8000' # http代理
elif request.url.startswith("https://"):
request.meta['proxy']="https://"+'127.0.0.0:8000' # https代理 而後在setting.py中的middleware中設置 # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = { 'biquge.middlewares.ProxyMiddleware': 100, }
使用方式二:在scarpy的爬蟲代碼中進行編寫,重寫start_request()函數,經過meta字典傳遞代理。 import scrapy class ProxySpider(scrapy.Spider): name = 'proxy' allowed_domains = ["httpbin.org"] def start_requests(self): url = 'http://httpbin.org/get' proxy = '127.0.0.0:8000' proxies = "" if url.startswith("http://"): proxies = "http://"+str(proxy) elif url.startswith("https://"): proxies = "https://"+str(proxy) #注意這裏面的meta={'proxy':proxies},必定要是proxy進行攜帶,其它的不行,後面的proxies必定 要是字符串,其它任何形式都不行 yield scrapy.Request(url, callback=self.parse,meta={'proxy':proxies}) def parse(self,response): print(response.text)
CrawlSpider類
rules:
它是一個(或多個)規則對象的列表。每一個規則都定義了爬行站點的特定行爲。規則對象描述以下。若是多個規則匹配相同的連接,則根據在此屬性中定義規則的順序使用第一個規則。有自動去重url的功能
Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)
百度百科案例:深度爬取
# -*- coding: utf-8 -*- import lxml.etree import scrapy from scrapy.spiders import CrawlSpider,Rule #提取超連接的規則 from scrapy.linkextractors import LinkExtractor # 提取超連接 from bidubaike.items import BidubaikeItem class BaidubaikeSpider(CrawlSpider): # scrapy.Spider name = 'baidubaike' # 規則匹配的url樣式,https://baike.baidu.com/item/%E5%A5%BD%E5%A5%BD%E8%AF%B4%E8%AF%9D/20361348?secondId=165077&mediaId=mda-hcssahwn1h3mk6zz' # allowed_domains = ['https://baike.baidu.com/'] # 這裏須要關閉這個功能 # 或者設置爲下面形式 allowed_domins = ['https://baike.baidu.com/item/'] start_urls = ['https://baike.baidu.com/item/Python/407313'] # 頁面提取超連接 TODO 規則定義爲每一頁的url,也能夠實現翻頁 pagelinks = LinkExtractor(allow=(r"/item/.*")) # 提取規則 # follow表示是否一致循環下去 rules = [Rule(pagelinks,callback="parse_item",follow=True)] # 返回連接的urllist # 這裏不能用原生的def parse()方法,須要換一個名稱,要不不能實現 def parse_item(self, response): html = response.body.decode("utf-8") item = BidubaikeItem() e = lxml.etree.HTML(html) if html != None: title1 = e.xpath("//dd[@class='lemmaWgt-lemmaTitle-title']//h1/text()") if len(title1) == 0: part_a = "" else: part_a = title1[0] title2 = e.xpath("//dd[@class='lemmaWgt-lemmaTitle-title']//h2/text()") if len(title2) == 0: part_b = "" else: part_b = title2[0] title = part_a + part_b title += "\r\n" item["title"] = title item["url"] = response.url yield item else: item["title"] = None item["url"] = None yield item
配置隨機代理的方式:
方式一:
setting.py中設置: PROXIES = [ "http://120.84.102.21:9999", "http://114.239.150.233:9999", "http://112.87.71.174:9999" ]
在middleware.py中建立一個類 class ProxyMiddleWare(object): def __init__(self,ip): self.ip = ip # 列表 @classmethod def from_crawler(cls, crawler): # 從setting.py中獲得PORXIES列表 return cls(ip=crawler.setting.get("PORXIES")) # 列表 def process_request(self,request,spider): ip = random.choice(self.ip) # 列表中隨機選值 request.meta["proxy"] = ip # 給請求對象加上代理參數
setting.py中開啓DOWNLOADER_MIDDLEWARES 並將middlewares.py中建立的類添加進去。 DOWNLOADER_MIDDLEWARES = { 'Proxy.middlewares.ProxyDownloaderMiddleware': 543, 'Proxy.middlewares.ProxyMiddleWare':543, }
# -*- coding: utf-8 -*- import scrapy class RenrenloginSpider(scrapy.Spider): name = 'renrenlogin' allowed_domains = ['www.renren.com'] start_urls = ['http://www.renren.com/SysHome.do'] # 人人登陸界面網址 def parse(self, response): """表單登陸,發送用戶名和驗證碼""" yield scrapy.FormRequest.from_response(response, formdata={"email":"xxx","password":"xxx"}, # 這裏scarpy會自動模擬js對email和passowrd進行加密 callback = self.parse_person_homepage) def parse_person_homepage(self, response): homepage_url = "http://www.renren.com/xxx/profile" yield scrapy.Request(url=homepage_url,callback=self.parse_user_info) def parse_user_info(self,response): html = response.body with open("renren.html","w") as f: f.write(html.decode("gbk","ignore"))
# -*- coding: utf-8 -*- import scrapy class RenrenloginSpider(scrapy.Spider): name = 'csdnlogin.py' allowed_domains = ['www.csdn.net'] start_urls = ['https://passport.csdn.net/guide']
# 手動登陸後獲取到cookies,利用fidder抓取到cookies包,進行請求。 cookies = { "uuid_tt_dd": "xxx", "dc_session_id": "xxx", "smidV2": "xxx", "UN": "xxx", "Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac": "xxx", "Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac": "xxx", "aliyun_webUmidToken": "xxx", "MSG-SESSION": "xxx", "SESSION": "xxx", "dc_tos": "xxx", "UserName": "xxx", "UserInfo": "xxx", "UserToken": "xxx", "UserNick": "xxx", "AU": "xxx", "BT": "xxx", "p_uid": "xxx", "Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac": "xxx" } def start_requests(self): for url in self.start_urls: yield scrapy.FormRequest(url=url,cookies=self.cookies,callback=self.parse_newpage) def parse_newpage(self,response): with open("csdn.html","wb") as f: f.write(response.body)
報錯:Connection to the other side was lost in a non-clean fashion. 爬取問政的時候
構造url_list進行爬取---陽光問政
import re import lxml.etree import scrapy from sun.items import SunItem class SuninfoSpider(scrapy.Spider): name = 'suninfo' allowed_domains = ['wz.sun0769.com'] start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0'] def parse(self, response): html_str = response.body.decode("gbk",errors="ignore") e = lxml.etree.HTML(html_str) count_str = e.xpath("//div[@class='pagination']/text()")[-1] # 獲取帖子數量 count = re.findall("(\d+)", count_str)[0] page_count = int(count) // 30 url_list = list() url = "http://wz.sun0769.com/index.php/question/report?page={}" for i in range(0, page_count + 1): url_list.append(url.format(i * 30)) # 測試10頁的數據 for url in url_list[:10]: yield scrapy.Request(url=url,callback=self.parse_page_info) def parse_page_info(self,response): html_str = response.body.decode("gbk",errors="ignore") e = lxml.etree.HTML(html_str) item = SunItem() tr_list = e.xpath("//div[@class='newsHead clearfix']//table[2]//tr") for tr in tr_list: item["id"] = tr.xpath(".//td[1]/text()")[0] item["title"] = tr.xpath(".//td[3]//a[1]/text()")[0] person = tr.xpath(".//td[5]/text()") if len(person) == 0: person = "MISS" item["name"] = person else: person = tr.xpath(".//td[5]/text()")[0] item["name"] = person item["time"] = tr.xpath(".//td[6]/text()")[0] yield item
自動抓取頁面連接爬取---陽光問政
# -*- coding: utf-8 -*- import re import lxml.etree import scrapy from sun.items import SunItem from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class SuninfoSpider(CrawlSpider): name = 'suninfo' # allowed_domains = ['wz.sun0769.com'] start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0'] # http://wz.sun0769.com/index.php/question/report?page=30" # 提取規則的正則是由Linkextractor進行設置的 pagelink = LinkExtractor(allow=(r"page=.*")) rules = [Rule(pagelink, callback="parse_item", follow=True)] def parse_item(self, response): html_str = response.body.decode("gbk", errors="ignore") e = lxml.etree.HTML(html_str) item = SunItem() tr_list = e.xpath("//div[@class='newsHead clearfix']//table[2]//tr") for tr in tr_list: item["id"] = tr.xpath(".//td[1]/text()")[0] item["title"] = tr.xpath(".//td[3]//a[1]/text()")[0] person = tr.xpath(".//td[5]/text()") if len(person) == 0: person = "MISS" item["name"] = person else: person = tr.xpath(".//td[5]/text()")[0] item["name"] = person item["time"] = tr.xpath(".//td[6]/text()")[0] yield item
scary框架的去重原理:
第一步:框架工具包中的scrapy/utils/request.py
request_fingerprint函數---對請求對象進行簽名--sha1
怎麼判斷請求對象是同一個呢,簽名時候,對請求對象的 method,url,body(包含請求頭)進行加密
請求體:request.body樣子是什麼呢? # TODO
sha1加密:
from hashlib import sha1 fp = sha1() str1 = "a" str2 = "b" str3 = "c" fp.update(str1.encode()) fp.update(str2.encode()) fp.update(str2.encode())
# 理解爲:對str1 + str2 + str3 = "abc" 進行加密,由於sha1對任何長度的二進制數據加密後都獲得40位的16進制字符串。 ret = fp.hexdigest() print(ret) c64d3fcde20c5cd03142171e5ac47a87aa3c8ace # 生成一個40位長度的16進制的字符串。
第二步:去重工具:scrapy/dupefilters.py
集合(防止指紋(長度爲40的字符串)) set(指紋1,指紋2,指紋3,。。。。,)
request_seen---看見函數,對指紋進行處理,
指紋在集合就直接返回,不進行下一步,return True
指紋不在集合,就加入集合,再進行下一步。
第三步:scrapy/core/scheduler.py/enqueue_request 調度器
enqueue_request 函數中:request對象的 dont_fliter=false且當前request對象的簽名在集合中。那麼就將當前的請求對象放入請求隊列中去。
具體代碼分析:https://blog.csdn.net/Mr__lqy/article/details/85859361
輸出log日誌
settings中 LOG_FILE = "0769.log" LOG_LEVEL = "DEBUG" 控制檯不作任何輸出,結果都寫在名爲0769的日誌文件中。
scrapy抓取json數據
圖片下載保存方法一:urllib.request
import urllib.request class DouyumeinvPipeline(object): def process_item(self, item, spider): url = item["imgurl"] name = item["id"] urllib.request.urlretrieve(url,"pic/{}.jpg".format(name)) # 直接使用urllib.request中的urlretrieve函數,直接將請求的資源保存指定路徑 return item
圖片下載保存方法二:Imagepipeline
# setting.py中設置: ITEM_PIPELINES = { 'douyuimage.pipelines.ImagePipeLine': 1, }
# 實際上圖片是所有默認保存在full文件夾中,full文件夾則保存在IMAGE_STORE這個路徑下。 IMAGES_STORE = r'C:\Users\Administrator\Desktop\img' IMAGES_EXPIRES = 90 # pipelines.py中設置: class ImagePipeLine(ImagesPipeline): def get_media_requests(self, item, info):
# 每一個item都會通過這個函數,而後這個函數在發起一個請求對象,將結果傳給 item_completed函數 url = item["img_url"] yield scrapy.Request(url) def item_completed(self, results, item, info): # results = [(True,{"path":"full/xxxxxx.jpg"})] image_path = [x['path'] for ok, x in results if ok] # 僅僅是圖片路徑是:full/xxxx.jpg if not image_path: raise DouyuimageItem("Item contains no images") item['image_path'] = image_path return item # 在items中設置: class DouyuimageItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() id = scrapy.Field() name = scrapy.Field() img_url = scrapy.Field() image_path = scrapy.Field() pass # spider.py中設置: class ImgdownSpider(scrapy.Spider): name = 'imgdown' allowed_domains = ['www.douyu.com'] current_page = 1 start_urls = ['https://www.douyu.com/gapi/rknc/directory/yzRec/{}'.format(current_page)] def parse(self, response): item = DouyuimageItem() data = json.loads(response.body.decode())["data"] info_dict_list = data["rl"] for info_dict in info_dict_list: item["id"] = info_dict["rid"] item["name"] = info_dict["nn"] item["img_url"] = info_dict["rs16"] yield item # 運行的結果 {'id': 6837192, 'image_path': ['full/e63847f2205b92e0a6c46c17d68e37f708948337.jpg'], 'img_url': 'https://rpic.douyucdn.cn/asrpic/190916/6837192_6066111_26880_2_1546.jpg', 'name': '十三金丶'}
scrapy翻頁抓取數據
翻頁抓取數據方式一:頁碼遞增法
import requests from bs4 import BeautifulSoup url = "https://blog.csdn.net/itcastcpp/article/list/1?" html_content = requests.get(url).content soup = BeautifulSoup(html_content,"html5lib") # 將網頁的字節數據解析爲html5的類型 div_list = soup.find_all("div",class_="article-item-box csdn-tracking-statistics") # 按照class的屬性進行尋找。 for div in div_list: title = (div.find_all("h4")[0]).find_all("a")[0] # 獲取h4下的a標籤的文本值 print(title.get_text()) # 結果: 原 兄弟連區塊鏈 Go 學習大綱-取得大綱試看視頻聯繫微信yinchengak48 原 尹成學院golang學習快速筆記(2)表達式 原 尹成學院golang學習快速筆記(1)類型 原 區塊鏈交易所基礎開發(1)經過接口查詢區塊鏈各個幣種的提幣狀況-ada 原 Golang精編100題-搞定golang面試
---恢復內容結束---
---恢復內容開始---
1.scrapy框架
每一步的解釋:
step1:引擎從爬蟲器獲取要爬行的初始請求。
step2:引擎在調度程序中調度請求,引擎把這個初始請求傳遞給調度器,並向調度器索要下一個請求。
step3:調度程序將下一個請求返回給引擎。
step4:引擎經過下載器中間件將請求發送給下載器。
step5:一旦頁面下載完成,下載器就會生成一個響應(帶有該頁面)並將經過下載器中間件響應發送到引擎。
step6:引擎接收來自下載器的響應,並經過Spider中間件將其發送給Spider進行處理。
step7:爬行器處理響應,並經過爬行器中間件將抓取的數據和新請求(要跟蹤的)返回給引擎
step8:引擎將處理過的數據發送到數據管道,而後將處理過的請求發送到調度程序,並請求可能的下一個要爬行的請求。
step9:流程重複(從步驟1開始),直到調度程序再也不發出請求。
每個元件的做用:
引擎 Engine:負責控制系統全部組件之間的數據流,並在某些操做發生時觸發事件
調度器:調度器接收來自引擎的請求,並對它們進行排隊,以便稍後在引擎請求它們時將請求對象提供給引擎。
下載器:下載器負責獲取web頁面並將響應內容提供給引擎,而引擎又將響應內容提供給爬蟲器。
爬蟲器:spider是由Scrapy用戶編寫的自定義類,用於解析響應並從響應或後續請求中提取數據(也稱爲抓取數據)。
管道:項目管道負責處理爬蟲提取(或處理)後的數據。典型的任務包括清理數據、驗證和持久性(好比在數據庫中存儲)。
下載器中間件:Downloader中間件是位於引擎和Downloader之間的特定鉤子,當請求從引擎傳遞到Downloader以及響應從Downloader傳遞到引擎時,處理引擎和下載器中間的請求和響應,並進行傳遞。
爬蟲中間件:Spider中間件是位於引擎和Spider之間的特定鉤子,可以處理Spider輸入(響應)和輸出(數據和請求)
內容翻譯於:https://doc.scrapy.org/en/master/topics/architecture.html
scrapy框架:Scrapy是用Twisted編寫的,Twisted是一種流行的Python事件驅動的網絡框架。所以,它使用非阻塞(即異步)代碼實現併發。
1.Scrapy可使用HTTP代理嗎?
是的。經過HTTP代理下載器中間件提供了對HTTP代理的支持(由於Scrapy 0.8)。具體見:HttpProxyMiddleware。
使用方式一:在中間件,middlewares.py中添加以下代碼。 class ProxyMiddleware(object):
def process_request(self,request,spider):
if request.url.startswith("http://"):
request.meta['proxy']="http://"+'127.0.0.0:8000' # http代理
elif request.url.startswith("https://"):
request.meta['proxy']="https://"+'127.0.0.0:8000' # https代理 而後在setting.py中的middleware中設置 # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = { 'biquge.middlewares.ProxyMiddleware': 100, }
使用方式二:在scarpy的爬蟲代碼中進行編寫,重寫start_request()函數,經過meta字典傳遞代理。 import scrapy class ProxySpider(scrapy.Spider): name = 'proxy' allowed_domains = ["httpbin.org"] def start_requests(self): url = 'http://httpbin.org/get' proxy = '127.0.0.0:8000' proxies = "" if url.startswith("http://"): proxies = "http://"+str(proxy) elif url.startswith("https://"): proxies = "https://"+str(proxy) #注意這裏面的meta={'proxy':proxies},必定要是proxy進行攜帶,其它的不行,後面的proxies必定 要是字符串,其它任何形式都不行 yield scrapy.Request(url, callback=self.parse,meta={'proxy':proxies}) def parse(self,response): print(response.text)
CrawlSpider類
rules:
它是一個(或多個)規則對象的列表。每一個規則都定義了爬行站點的特定行爲。規則對象描述以下。若是多個規則匹配相同的連接,則根據在此屬性中定義規則的順序使用第一個規則。有自動去重url的功能
Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)
百度百科案例:深度爬取
# -*- coding: utf-8 -*- import lxml.etree import scrapy from scrapy.spiders import CrawlSpider,Rule #提取超連接的規則 from scrapy.linkextractors import LinkExtractor # 提取超連接 from bidubaike.items import BidubaikeItem class BaidubaikeSpider(CrawlSpider): # scrapy.Spider name = 'baidubaike' # 規則匹配的url樣式,https://baike.baidu.com/item/%E5%A5%BD%E5%A5%BD%E8%AF%B4%E8%AF%9D/20361348?secondId=165077&mediaId=mda-hcssahwn1h3mk6zz' # allowed_domains = ['https://baike.baidu.com/'] # 這裏須要關閉這個功能 # 或者設置爲下面形式 allowed_domins = ['https://baike.baidu.com/item/'] start_urls = ['https://baike.baidu.com/item/Python/407313'] # 頁面提取超連接 TODO 規則定義爲每一頁的url,也能夠實現翻頁 pagelinks = LinkExtractor(allow=(r"/item/.*")) # 提取規則 # follow表示是否一致循環下去 rules = [Rule(pagelinks,callback="parse_item",follow=True)] # 返回連接的urllist # 這裏不能用原生的def parse()方法,須要換一個名稱,要不不能實現 def parse_item(self, response): html = response.body.decode("utf-8") item = BidubaikeItem() e = lxml.etree.HTML(html) if html != None: title1 = e.xpath("//dd[@class='lemmaWgt-lemmaTitle-title']//h1/text()") if len(title1) == 0: part_a = "" else: part_a = title1[0] title2 = e.xpath("//dd[@class='lemmaWgt-lemmaTitle-title']//h2/text()") if len(title2) == 0: part_b = "" else: part_b = title2[0] title = part_a + part_b title += "\r\n" item["title"] = title item["url"] = response.url yield item else: item["title"] = None item["url"] = None yield item
配置隨機代理的方式:
方式一:
setting.py中設置: PROXIES = [ "http://120.84.102.21:9999", "http://114.239.150.233:9999", "http://112.87.71.174:9999" ]
在middleware.py中建立一個類 class ProxyMiddleWare(object): def __init__(self,ip): self.ip = ip # 列表 @classmethod def from_crawler(cls, crawler): # 從setting.py中獲得PORXIES列表 return cls(ip=crawler.setting.get("PORXIES")) # 列表 def process_request(self,request,spider): ip = random.choice(self.ip) # 列表中隨機選值 request.meta["proxy"] = ip # 給請求對象加上代理參數
setting.py中開啓DOWNLOADER_MIDDLEWARES 並將middlewares.py中建立的類添加進去。 DOWNLOADER_MIDDLEWARES = { 'Proxy.middlewares.ProxyDownloaderMiddleware': 543, 'Proxy.middlewares.ProxyMiddleWare':543, }
# -*- coding: utf-8 -*- import scrapy class RenrenloginSpider(scrapy.Spider): name = 'renrenlogin' allowed_domains = ['www.renren.com'] start_urls = ['http://www.renren.com/SysHome.do'] # 人人登陸界面網址 def parse(self, response): """表單登陸,發送用戶名和驗證碼""" yield scrapy.FormRequest.from_response(response, formdata={"email":"xxx","password":"xxx"}, # 這裏scarpy會自動模擬js對email和passowrd進行加密 callback = self.parse_person_homepage) def parse_person_homepage(self, response): homepage_url = "http://www.renren.com/xxx/profile" yield scrapy.Request(url=homepage_url,callback=self.parse_user_info) def parse_user_info(self,response): html = response.body with open("renren.html","w") as f: f.write(html.decode("gbk","ignore"))
# -*- coding: utf-8 -*- import scrapy class RenrenloginSpider(scrapy.Spider): name = 'csdnlogin.py' allowed_domains = ['www.csdn.net'] start_urls = ['https://passport.csdn.net/guide']
# 手動登陸後獲取到cookies,利用fidder抓取到cookies包,進行請求。 cookies = { "uuid_tt_dd": "xxx", "dc_session_id": "xxx", "smidV2": "xxx", "UN": "xxx", "Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac": "xxx", "Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac": "xxx", "aliyun_webUmidToken": "xxx", "MSG-SESSION": "xxx", "SESSION": "xxx", "dc_tos": "xxx", "UserName": "xxx", "UserInfo": "xxx", "UserToken": "xxx", "UserNick": "xxx", "AU": "xxx", "BT": "xxx", "p_uid": "xxx", "Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac": "xxx" } def start_requests(self): for url in self.start_urls: yield scrapy.FormRequest(url=url,cookies=self.cookies,callback=self.parse_newpage) def parse_newpage(self,response): with open("csdn.html","wb") as f: f.write(response.body)
報錯:Connection to the other side was lost in a non-clean fashion. 爬取問政的時候
構造url_list進行爬取---陽光問政
import re import lxml.etree import scrapy from sun.items import SunItem class SuninfoSpider(scrapy.Spider): name = 'suninfo' allowed_domains = ['wz.sun0769.com'] start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0'] def parse(self, response): html_str = response.body.decode("gbk",errors="ignore") e = lxml.etree.HTML(html_str) count_str = e.xpath("//div[@class='pagination']/text()")[-1] # 獲取帖子數量 count = re.findall("(\d+)", count_str)[0] page_count = int(count) // 30 url_list = list() url = "http://wz.sun0769.com/index.php/question/report?page={}" for i in range(0, page_count + 1): url_list.append(url.format(i * 30)) # 測試10頁的數據 for url in url_list[:10]: yield scrapy.Request(url=url,callback=self.parse_page_info) def parse_page_info(self,response): html_str = response.body.decode("gbk",errors="ignore") e = lxml.etree.HTML(html_str) item = SunItem() tr_list = e.xpath("//div[@class='newsHead clearfix']//table[2]//tr") for tr in tr_list: item["id"] = tr.xpath(".//td[1]/text()")[0] item["title"] = tr.xpath(".//td[3]//a[1]/text()")[0] person = tr.xpath(".//td[5]/text()") if len(person) == 0: person = "MISS" item["name"] = person else: person = tr.xpath(".//td[5]/text()")[0] item["name"] = person item["time"] = tr.xpath(".//td[6]/text()")[0] yield item
自動抓取頁面連接爬取---陽光問政
# -*- coding: utf-8 -*- import re import lxml.etree import scrapy from sun.items import SunItem from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class SuninfoSpider(CrawlSpider): name = 'suninfo' # allowed_domains = ['wz.sun0769.com'] start_urls = ['http://wz.sun0769.com/index.php/question/report?page=0'] # http://wz.sun0769.com/index.php/question/report?page=30" # 提取規則的正則是由Linkextractor進行設置的 pagelink = LinkExtractor(allow=(r"page=.*")) rules = [Rule(pagelink, callback="parse_item", follow=True)] def parse_item(self, response): html_str = response.body.decode("gbk", errors="ignore") e = lxml.etree.HTML(html_str) item = SunItem() tr_list = e.xpath("//div[@class='newsHead clearfix']//table[2]//tr") for tr in tr_list: item["id"] = tr.xpath(".//td[1]/text()")[0] item["title"] = tr.xpath(".//td[3]//a[1]/text()")[0] person = tr.xpath(".//td[5]/text()") if len(person) == 0: person = "MISS" item["name"] = person else: person = tr.xpath(".//td[5]/text()")[0] item["name"] = person item["time"] = tr.xpath(".//td[6]/text()")[0] yield item
scary框架的去重原理:
第一步:框架工具包中的scrapy/utils/request.py
request_fingerprint函數---對請求對象進行簽名--sha1
怎麼判斷請求對象是同一個呢,簽名時候,對請求對象的 method,url,body(包含請求頭)進行加密
請求體:request.body樣子是什麼呢? # TODO
sha1加密:
from hashlib import sha1 fp = sha1() str1 = "a" str2 = "b" str3 = "c" fp.update(str1.encode()) fp.update(str2.encode()) fp.update(str2.encode())
# 理解爲:對str1 + str2 + str3 = "abc" 進行加密,由於sha1對任何長度的二進制數據加密後都獲得40位的16進制字符串。 ret = fp.hexdigest() print(ret) c64d3fcde20c5cd03142171e5ac47a87aa3c8ace # 生成一個40位長度的16進制的字符串。
第二步:去重工具:scrapy/dupefilters.py
集合(防止指紋(長度爲40的字符串)) set(指紋1,指紋2,指紋3,。。。。,)
request_seen---看見函數,對指紋進行處理,
指紋在集合就直接返回,不進行下一步,return True
指紋不在集合,就加入集合,再進行下一步。
第三步:scrapy/core/scheduler.py/enqueue_request 調度器
enqueue_request 函數中:request對象的 dont_fliter=false且當前request對象的簽名在集合中。那麼就將當前的請求對象放入請求隊列中去。
具體代碼分析:https://blog.csdn.net/Mr__lqy/article/details/85859361
輸出log日誌
settings中 LOG_FILE = "0769.log" LOG_LEVEL = "DEBUG" 控制檯不作任何輸出,結果都寫在名爲0769的日誌文件中。
scrapy抓取json數據
圖片下載保存方法一:urllib.request
import urllib.request class DouyumeinvPipeline(object): def process_item(self, item, spider): url = item["imgurl"] name = item["id"] urllib.request.urlretrieve(url,"pic/{}.jpg".format(name)) # 直接使用urllib.request中的urlretrieve函數,直接將請求的資源保存指定路徑 return item
圖片下載保存方法二:Imagepipeline
# setting.py中設置: ITEM_PIPELINES = { 'douyuimage.pipelines.ImagePipeLine': 1, }
# 實際上圖片是所有默認保存在full文件夾中,full文件夾則保存在IMAGE_STORE這個路徑下。 IMAGES_STORE = r'C:\Users\Administrator\Desktop\img' IMAGES_EXPIRES = 90 # pipelines.py中設置: class ImagePipeLine(ImagesPipeline): def get_media_requests(self, item, info):
# 每一個item都會通過這個函數,而後這個函數在發起一個請求對象,將結果傳給 item_completed函數 url = item["img_url"] yield scrapy.Request(url) def item_completed(self, results, item, info): # results = [(True,{"path":"full/xxxxxx.jpg"})] image_path = [x['path'] for ok, x in results if ok] # 僅僅是圖片路徑是:full/xxxx.jpg if not image_path: raise DouyuimageItem("Item contains no images") item['image_path'] = image_path return item # 在items中設置: class DouyuimageItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() id = scrapy.Field() name = scrapy.Field() img_url = scrapy.Field() image_path = scrapy.Field() pass # spider.py中設置: class ImgdownSpider(scrapy.Spider): name = 'imgdown' allowed_domains = ['www.douyu.com'] current_page = 1 start_urls = ['https://www.douyu.com/gapi/rknc/directory/yzRec/{}'.format(current_page)] def parse(self, response): item = DouyuimageItem() data = json.loads(response.body.decode())["data"] info_dict_list = data["rl"] for info_dict in info_dict_list: item["id"] = info_dict["rid"] item["name"] = info_dict["nn"] item["img_url"] = info_dict["rs16"] yield item # 運行的結果 {'id': 6837192, 'image_path': ['full/e63847f2205b92e0a6c46c17d68e37f708948337.jpg'], 'img_url': 'https://rpic.douyucdn.cn/asrpic/190916/6837192_6066111_26880_2_1546.jpg', 'name': '十三金丶'}
scrapy翻頁抓取數據
翻頁抓取數據方式一:頁碼遞增法
import requests from bs4 import BeautifulSoup url = "https://blog.csdn.net/itcastcpp/article/list/1?" html_content = requests.get(url).content soup = BeautifulSoup(html_content,"html5lib") # 將網頁的字節數據解析爲html5的類型 div_list = soup.find_all("div",class_="article-item-box csdn-tracking-statistics") # 按照class的屬性進行尋找。 for div in div_list: title = (div.find_all("h4")[0]).find_all("a")[0] # 獲取h4下的a標籤的文本值 print(title.get_text()) # 結果: 原 兄弟連區塊鏈 Go 學習大綱-取得大綱試看視頻聯繫微信yinchengak48 原 尹成學院golang學習快速筆記(2)表達式 原 尹成學院golang學習快速筆記(1)類型 原 區塊鏈交易所基礎開發(1)經過接口查詢區塊鏈各個幣種的提幣狀況-ada 原 Golang精編100題-搞定golang面試
方式一:setting中設置便可
第一步,打開setting中的DEFAULT_REQUEST_HEADER的註釋 在裏面添加Cookie:"xxxx"的鍵值對,從瀏覽器中複製便可 # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', "Cookie": "xxx"} 第二步,打來# COOKIES_ENABLED = False的註釋,不受權使用cookie 第三步,開啓爬蟲
方式二:在爬蟲類中重寫start_requests方法
# 第一步,在spider.py中重寫start_requests方法,增長cookie def start_requests(self): url = self.start_urls[0] cookies = { "uuid_tt_dd":"xx", "dc_session_id":"xx, "smidV2":"xx", "UN":"xx", "Hm_ct_6bcd52f51e9b3dce32bec4a3997715ac":"xx", "acw_tc":"xx", "Hm_ct_e5ef47b9f471504959267fd614d579cd":"xx", "__yadk_uid":"xx", "firstDie":"xx", "UserName":"xx", "UserInfo":"xx", "UserToken":"xx", "UserNick":"xx", "BT":"xx", "p_uid":"xx", "Hm_lvt_e5ef47b9f471504959267fd614d579cd":"xx", "Hm_lvt_6bcd52f51e9b3dce32bec4a3997715ac":"xx", "acw_sc__v3":"xx", "acw_sc__v2":"xx", "dc_tos":"xx", "Hm_lpvt_6bcd52f51e9b3dce32bec4a3997715ac":"xx"} yield scrapy.Request(url,cookies=cookies,callback=self.parse) 第二步:保持# Disable cookies (enabled by default)的註釋狀態,不要打開 # COOKIES_ENABLED = False
總結:COOKIES_ENABLED = False 默認是註釋的,也就是框架默認是可以使用cookie,框架默認使用的cookie須要在Scrapy.Resuest(url,cookies={})中設置。
開啓 :COOKIES_ENABLED = False,此時cookie的配置須要在setting.py中的DEFAULT_REQUEST_HEADERR = {}中進行配置。
方式三:DownloadMiddleware中進行設置
第一步:setting.py中打開DOWNLOADER_MIDDLEWARES 的註釋,由於設置要在這個類中進行。 DOWNLOADER_MIDDLEWARES = { 'blog.middlewares.BlogDownloaderMiddleware': 543, } 第二步:middleware中找到對應的BlogDownloaderMiddleware,其中的函數 def process_request(self,request,spider): return None 改成: def process_request(self, request, spider): cookie_from_web = "瀏覽器中粘貼的Cookies:xx" cookies = {} for cookie in cookie_from_web.split(";"): cookies[cookie.split('=')[0]] = cookie.split('=')[1] print(cookies) request.cookies = cookies return None 第三步: 啓動爬蟲程序
1:框架的第一次請求時,加入能夠,須要重寫start_request函數,請求對象參數中加入cookies={}
2. 在下載中間鍵,process_request函數中,給request對象加屬性值,必須字典格式,request.cookies={}
3. setting.py中配置默認的請求頭,在默認請求頭中添加cookies。
---恢復內容結束---