爬蟲實戰篇---使用Scrapy框架進行模擬登陸(包括藉助阿里雲服務自動識別驗證碼)

時間 2019-11-11

標籤爬蟲實戰使用 scrapy 框架進行模擬登陸包括藉助阿里服務自動識別驗證碼欄目網絡爬蟲简体版

原文原文鏈接

（1）、前言

原理分析：咱們編寫代碼模擬向網站發出登陸請求，也就是提交包含登陸信息的表單（用戶名、密碼等）。css

實現方式：當咱們想在請求數據時發送post請求，這時候須要藉助Request的子類FormRequest來實現，若是想進一步在爬蟲一開始時就發送post請求，那麼咱們須要重寫start_request（）方法，捨棄原先的start_url()(採用get請求)html

（2）、模擬登陸人人網（例子1）

一、建立項目web

scrapy startproject renren---cd renren--建立爬蟲scrapy genspider spider renren.comjson

二、改寫settings.pyapi

 1 # -*- coding: utf-8 -*-
 2 
 3 # Scrapy settings for renren project
 4 #
 5 # For simplicity, this file contains only settings considered important or
 6 # commonly used. You can find more settings consulting the documentation:
 7 #
 8 #     https://doc.scrapy.org/en/latest/topics/settings.html
 9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
11 
12 BOT_NAME = 'renren'
13 
14 SPIDER_MODULES = ['renren.spiders']
15 NEWSPIDER_MODULE = 'renren.spiders'
16 
17 
18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 #USER_AGENT = 'renren (+http://www.yourdomain.com)'
20 
21 # Obey robots.txt rules
22 ROBOTSTXT_OBEY = False
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 DOWNLOAD_DELAY = 1
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 DEFAULT_REQUEST_HEADERS = {
43     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44     'Accept-Language': 'en',
45     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36',
46 
47 
48 }
49 
50 # Enable or disable spider middlewares
51 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
52 #SPIDER_MIDDLEWARES = {
53 #    'renren.middlewares.RenrenSpiderMiddleware': 543,
54 #}
55 
56 # Enable or disable downloader middlewares
57 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
58 #DOWNLOADER_MIDDLEWARES = {
59 #    'renren.middlewares.RenrenDownloaderMiddleware': 543,
60 #}
61 
62 # Enable or disable extensions
63 # See https://doc.scrapy.org/en/latest/topics/extensions.html
64 #EXTENSIONS = {
65 #    'scrapy.extensions.telnet.TelnetConsole': None,
66 #}
67 
68 # Configure item pipelines
69 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
70 #ITEM_PIPELINES = {
71 #    'renren.pipelines.RenrenPipeline': 300,
72 #}
73 
74 # Enable and configure the AutoThrottle extension (disabled by default)
75 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
76 #AUTOTHROTTLE_ENABLED = True
77 # The initial download delay
78 #AUTOTHROTTLE_START_DELAY = 5
79 # The maximum download delay to be set in case of high latencies
80 #AUTOTHROTTLE_MAX_DELAY = 60
81 # The average number of requests Scrapy should be sending in parallel to
82 # each remote server
83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
84 # Enable showing throttling stats for every response received:
85 #AUTOTHROTTLE_DEBUG = False
86 
87 # Enable and configure HTTP caching (disabled by default)
88 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
89 #HTTPCACHE_ENABLED = True
90 #HTTPCACHE_EXPIRATION_SECS = 0
91 #HTTPCACHE_DIR = 'httpcache'
92 #HTTPCACHE_IGNORE_HTTP_CODES = []
93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

三、改寫spider.pycookie

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 
 5 class SpiderSpider(scrapy.Spider):
 6     name = 'spider'
 7     allowed_domains = ['renren.com']
 8     start_urls = ['http://renren.com/']
 9 
10 
11     def start_requests(self):
12         url = 'http://www.renren.com/PLogin.do'
13         data = {
14             'email':'827832075@qq.com',
15             'password':'56571218lu',
16         } #構造表單數據
17         request = scrapy.FormRequest(url ,formdata=data, callback=self.parse_page)
18         yield request
19 
20     def parse_page(self,response):
21         url2 = 'http://www.renren.com/880792860/profile'
22         request = scrapy.Request(url2 ,callback=self.parse_profile)
23         yield request
24 
25     def parse_profile(self,response):
26         with open('baobeier.html','w',encoding='utf-8') as f: 寫入文件
27             f.write(response.text)
28             f.close()

四、運行爬蟲app

1 #author: "xian"
2 #date: 2018/6/13
3 from scrapy import cmdline
4 cmdline.execute('scrapy crawl spider'.split())

五、效果展現（咱們成功登陸並爬取了包貝兒的人人主頁）dom

（3）、使用阿里雲驗證碼服務自動識別驗證碼（服務地址：https://market.aliyun.com/products/57126001/cmapi014396.html#sku=yuncode=839600006）

測試服務：咱們一樣使用豆瓣登陸頁面的驗證碼進行測試：scrapy

 1 #author: "xian"
 2 #date: 2018/6/13
 3 from urllib import request
 4 from base64 import b64encode
 5 import requests
 6 
 7 captcha_url = 'https://www.douban.com/misc/captcha?id=oL8chJoRiCTIikzwtEECZNGH:en&size=s'
 8 
 9 request.urlretrieve(captcha_url ,'captcha.png')
10 
11 recognize_url = 'http://jisuyzmsb.market.alicloudapi.com/captcha/recognize?type=e'
12 
13 formdata = {}
14 with open('captcha.png','rb') as f:
15     data = f.read()
16     pic = b64encode(data)
17     formdata['pic'] = pic
18 
19 appcode = '614a1376aa4340b7a159d551d4eb0179'
20 headers = {
21     'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
22     'Authorization':'APPCODE ' + appcode,
23 }
24 
25 response = requests.post(recognize_url,data = formdata,headers =headers)
26 print(response.json()) #返回json格式

運行效果展現：（咱們藉助阿里雲平臺成功進行了驗證碼的自動識別）ide

（4）、使用阿里雲服務進行驗證碼驗證並模擬登陸豆瓣網

一、建立項目scrapy startproject douban---cd douban---建立爬蟲scrapy genspider spider doubao.com（樹形目錄以下：）

改寫settings.py

不遵循robots協議

設置請求頭

設置爬取時間間隔

改寫spider.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from urllib import request
 4 from PIL import Image #導入識別圖形的庫
 5 from base64 import b64encode #導入b64編碼庫
 6 import requests
 7 
 8 
 9 class SpiderSpider(scrapy.Spider):
10     name = 'spider'
11     allowed_domains = ['douban.com']
12     start_urls = ['https://accounts.douban.com/login'] #起始url
13     login_url = 'https://accounts.douban.com/login' #登陸界面url
14     profile_url = 'https://www.douban.com/people/179834288/' #我的主要url
15     editsignature_url = 'https://www.douban.com/j/people/179834288/edit_signature' #編輯簽名的接口url
16 
17     def parse(self, response):
18         formdata = {
19             'source': 'None',
20             'redir':'https://www.douban.com',
21             'form_email': '827832075@qq.com',
22             'form_password': '56571218lu',
23             'remember': 'on',
24             'login': '登陸',
25 
26         } #傳入部分表單數據
27         captcha_url = response.css('img#captcha_image::attr(src)').get() #獲取驗證碼
28         if captcha_url: #判斷是否存在驗證碼
29             captcha = self.regonize_captcha(captcha_url) #識別驗證碼
30             formdata['captcha-solution'] = captcha #獲取captcha-solution表單字段
31             captcha_id = response.xpath('//input[@name = "captcha-id"]/@value').get() #獲取captcha_id表單字段
32             formdata['captcha-id'] = captcha_id 
33         yield scrapy.FormRequest(url = self.login_url,formdata=formdata,callback=self.parse_after_login) #提交表單數據
34 
35 
36     def parse_after_login(self,response): #解析登陸頁面函數
37         if response.url == 'https://www.douban.com': #判斷是否登陸成功
38             yield scrapy.Request(self.profile_url,callback=self.parse_profile) #若是登陸成功向我的主頁發送請求並回調解析函數
39             print('登陸成功!')
40         else:
41             print('登陸失敗！')
42 
43     def parse_profile(self,response): #解析我的主頁函數
44         print(response.url)
45         if response.url == self.profile_url: #判斷是否成功到達我的主頁
46             ck = response.xpath('//input[@name = "ck"]/@value').get() #獲取ck value的值
47             formdata = {
48                 'ck': ck,
49                 'signature':'積土成山，風雨興焉！',
50             } #構造表單數據
51             yield scrapy.FormRequest(self.editsignature_url,formdata=formdata,callback=self.parse_None) 提交表單數據，最後callback 指定回調函數，這裏若是不指定回調函數默認回調parse，最後會出現登陸失敗信息
52         else:
53             print('進入我的中心失敗了！')
54 
55     def parse_None(self,response):
56         pass
57 
58 
59 #部分爲人工驗證碼識別登陸方式
60     # def regonize_captcha(self,image_url):
61     #     request.urlretrieve(image_url,'captcha.png')
62     #     image = Image.open('captcha.png')
63     #     image.show()
64     #     captcha = input('請您輸入驗證碼：')
65     #     return captcha
66 
67     def regonize_captcha(self, image_url): #這裏使用上面的阿里雲服務識別驗證碼，參考阿里雲上提供的使用手冊便可
68         captcha_url = image_url
69 
70         request.urlretrieve(captcha_url, 'captcha.png')
71 
72         recognize_url = 'http://jisuyzmsb.market.alicloudapi.com/captcha/recognize?type=e'
73 
74         formdata = {}
75         with open('captcha.png', 'rb') as f:
76             data = f.read()
77             pic = b64encode(data)
78             formdata['pic'] = pic
79 
80         appcode = '614a1376aa4340b7a159d551d4eb0179'
81         headers = {
82             'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
83             'Authorization': 'APPCODE ' + appcode,
84         }
85 
86         response = requests.post(recognize_url, data=formdata, headers=headers)
87         result = response.json()
88         code = result['result']['code']
89         return code

最後運行爬蟲項目：

新建一個main.py方便調試

1 #author: "xian"
2 #date: 2018/6/13
3 from scrapy import cmdline
4 cmdline.execute('scrapy crawl spider'.split())

運行結果：（部分）

咱們能夠看到個人主頁的個性簽名已經改變爲咱們設置的了！（程序成功運行了！）

（5）、總結

一、在scrapy中想要發送Post請求，推薦使用scrapy.FormRequest方法，並指定表單數據

二、在爬蟲開始時發送Post請求，請重寫start_requests（）方法

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。