爬蟲實戰篇---使用Scrapy框架進行模擬登陸(包括藉助阿里雲服務自動識別驗證碼)

(1)、前言

原理分析:咱們編寫代碼模擬向網站發出登陸請求,也就是提交包含登陸信息的表單(用戶名、密碼等)。css

實現方式:當咱們想在請求數據時發送post請求,這時候須要藉助Request的子類FormRequest來實現,若是想進一步在爬蟲一開始時就發送post請求,那麼咱們須要重寫start_request()方法,捨棄原先的start_url()(採用get請求)html

(2)、模擬登陸人人網(例子1)

一、建立項目web

scrapy startproject renren---cd renren--建立爬蟲scrapy genspider spider renren.comjson

二、改寫settings.pyapi

 1 # -*- coding: utf-8 -*-
 2 
 3 # Scrapy settings for renren project
 4 #
 5 # For simplicity, this file contains only settings considered important or
 6 # commonly used. You can find more settings consulting the documentation:
 7 #
 8 #     https://doc.scrapy.org/en/latest/topics/settings.html
 9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
11 
12 BOT_NAME = 'renren'
13 
14 SPIDER_MODULES = ['renren.spiders']
15 NEWSPIDER_MODULE = 'renren.spiders'
16 
17 
18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 #USER_AGENT = 'renren (+http://www.yourdomain.com)'
20 
21 # Obey robots.txt rules
22 ROBOTSTXT_OBEY = False
23 
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26 
27 # Configure a delay for requests for the same website (default: 0)
28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 DOWNLOAD_DELAY = 1
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34 
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37 
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40 
41 # Override the default request headers:
42 DEFAULT_REQUEST_HEADERS = {
43     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
44     'Accept-Language': 'en',
45     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36',
46 
47 
48 }
49 
50 # Enable or disable spider middlewares
51 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
52 #SPIDER_MIDDLEWARES = {
53 #    'renren.middlewares.RenrenSpiderMiddleware': 543,
54 #}
55 
56 # Enable or disable downloader middlewares
57 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
58 #DOWNLOADER_MIDDLEWARES = {
59 #    'renren.middlewares.RenrenDownloaderMiddleware': 543,
60 #}
61 
62 # Enable or disable extensions
63 # See https://doc.scrapy.org/en/latest/topics/extensions.html
64 #EXTENSIONS = {
65 #    'scrapy.extensions.telnet.TelnetConsole': None,
66 #}
67 
68 # Configure item pipelines
69 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
70 #ITEM_PIPELINES = {
71 #    'renren.pipelines.RenrenPipeline': 300,
72 #}
73 
74 # Enable and configure the AutoThrottle extension (disabled by default)
75 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
76 #AUTOTHROTTLE_ENABLED = True
77 # The initial download delay
78 #AUTOTHROTTLE_START_DELAY = 5
79 # The maximum download delay to be set in case of high latencies
80 #AUTOTHROTTLE_MAX_DELAY = 60
81 # The average number of requests Scrapy should be sending in parallel to
82 # each remote server
83 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
84 # Enable showing throttling stats for every response received:
85 #AUTOTHROTTLE_DEBUG = False
86 
87 # Enable and configure HTTP caching (disabled by default)
88 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
89 #HTTPCACHE_ENABLED = True
90 #HTTPCACHE_EXPIRATION_SECS = 0
91 #HTTPCACHE_DIR = 'httpcache'
92 #HTTPCACHE_IGNORE_HTTP_CODES = []
93 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

三、改寫spider.pycookie

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 
 5 class SpiderSpider(scrapy.Spider):
 6     name = 'spider'
 7     allowed_domains = ['renren.com']
 8     start_urls = ['http://renren.com/']
 9 
10 
11     def start_requests(self):
12         url = 'http://www.renren.com/PLogin.do'
13         data = {
14             'email':'827832075@qq.com',
15             'password':'56571218lu',
16         } #構造表單數據
17         request = scrapy.FormRequest(url ,formdata=data, callback=self.parse_page)
18         yield request
19 
20     def parse_page(self,response):
21         url2 = 'http://www.renren.com/880792860/profile'
22         request = scrapy.Request(url2 ,callback=self.parse_profile)
23         yield request
24 
25     def parse_profile(self,response):
26         with open('baobeier.html','w',encoding='utf-8') as f: 寫入文件
27             f.write(response.text)
28             f.close()

四、運行爬蟲app

1 #author: "xian"
2 #date: 2018/6/13
3 from scrapy import cmdline
4 cmdline.execute('scrapy crawl spider'.split())

五、效果展現(咱們成功登陸並爬取了包貝兒的人人主頁)dom

(3)、使用阿里雲驗證碼服務自動識別驗證碼(服務地址:https://market.aliyun.com/products/57126001/cmapi014396.html#sku=yuncode=839600006)

測試服務:咱們一樣使用豆瓣登陸頁面的驗證碼進行測試:scrapy

 1 #author: "xian"
 2 #date: 2018/6/13
 3 from urllib import request
 4 from base64 import b64encode
 5 import requests
 6 
 7 captcha_url = 'https://www.douban.com/misc/captcha?id=oL8chJoRiCTIikzwtEECZNGH:en&size=s'
 8 
 9 request.urlretrieve(captcha_url ,'captcha.png')
10 
11 recognize_url = 'http://jisuyzmsb.market.alicloudapi.com/captcha/recognize?type=e'
12 
13 formdata = {}
14 with open('captcha.png','rb') as f:
15     data = f.read()
16     pic = b64encode(data)
17     formdata['pic'] = pic
18 
19 appcode = '614a1376aa4340b7a159d551d4eb0179'
20 headers = {
21     'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
22     'Authorization':'APPCODE ' + appcode,
23 }
24 
25 response = requests.post(recognize_url,data = formdata,headers =headers)
26 print(response.json()) #返回json格式

運行效果展現:(咱們藉助阿里雲平臺成功進行了驗證碼的自動識別)ide

(4)、使用阿里雲服務進行驗證碼驗證並模擬登陸豆瓣網

一、建立項目scrapy startproject douban---cd douban---建立爬蟲scrapy genspider spider doubao.com(樹形目錄以下:)

改寫settings.py

不遵循robots協議

設置請求頭

設置爬取時間間隔

改寫spider.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from urllib import request
 4 from PIL import Image #導入識別圖形的庫
 5 from base64 import b64encode #導入b64編碼庫
 6 import requests
 7 
 8 
 9 class SpiderSpider(scrapy.Spider):
10     name = 'spider'
11     allowed_domains = ['douban.com']
12     start_urls = ['https://accounts.douban.com/login'] #起始url
13     login_url = 'https://accounts.douban.com/login' #登陸界面url
14     profile_url = 'https://www.douban.com/people/179834288/' #我的主要url
15     editsignature_url = 'https://www.douban.com/j/people/179834288/edit_signature' #編輯簽名的接口url
16 
17     def parse(self, response):
18         formdata = {
19             'source': 'None',
20             'redir':'https://www.douban.com',
21             'form_email': '827832075@qq.com',
22             'form_password': '56571218lu',
23             'remember': 'on',
24             'login': '登陸',
25 
26         } #傳入部分表單數據
27         captcha_url = response.css('img#captcha_image::attr(src)').get() #獲取驗證碼
28         if captcha_url: #判斷是否存在驗證碼
29             captcha = self.regonize_captcha(captcha_url) #識別驗證碼
30             formdata['captcha-solution'] = captcha #獲取captcha-solution表單字段
31             captcha_id = response.xpath('//input[@name = "captcha-id"]/@value').get() #獲取captcha_id表單字段
32             formdata['captcha-id'] = captcha_id 
33         yield scrapy.FormRequest(url = self.login_url,formdata=formdata,callback=self.parse_after_login) #提交表單數據
34 
35 
36     def parse_after_login(self,response): #解析登陸頁面函數
37         if response.url == 'https://www.douban.com': #判斷是否登陸成功
38             yield scrapy.Request(self.profile_url,callback=self.parse_profile) #若是登陸成功向我的主頁發送請求並回調解析函數
39             print('登陸成功!')
40         else:
41             print('登陸失敗!')
42 
43     def parse_profile(self,response): #解析我的主頁函數
44         print(response.url)
45         if response.url == self.profile_url: #判斷是否成功到達我的主頁
46             ck = response.xpath('//input[@name = "ck"]/@value').get() #獲取ck value的值
47             formdata = {
48                 'ck': ck,
49                 'signature':'積土成山,風雨興焉!',
50             } #構造表單數據
51             yield scrapy.FormRequest(self.editsignature_url,formdata=formdata,callback=self.parse_None) 提交表單數據,最後callback 指定回調函數,這裏若是不指定回調函數默認回調parse,最後會出現登陸失敗信息
52         else:
53             print('進入我的中心失敗了!')
54 
55     def parse_None(self,response):
56         pass
57 
58 
59 #部分爲人工驗證碼識別登陸方式
60     # def regonize_captcha(self,image_url):
61     #     request.urlretrieve(image_url,'captcha.png')
62     #     image = Image.open('captcha.png')
63     #     image.show()
64     #     captcha = input('請您輸入驗證碼:')
65     #     return captcha
66 
67     def regonize_captcha(self, image_url): #這裏使用上面的阿里雲服務識別驗證碼,參考阿里雲上提供的使用手冊便可
68         captcha_url = image_url
69 
70         request.urlretrieve(captcha_url, 'captcha.png')
71 
72         recognize_url = 'http://jisuyzmsb.market.alicloudapi.com/captcha/recognize?type=e'
73 
74         formdata = {}
75         with open('captcha.png', 'rb') as f:
76             data = f.read()
77             pic = b64encode(data)
78             formdata['pic'] = pic
79 
80         appcode = '614a1376aa4340b7a159d551d4eb0179'
81         headers = {
82             'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
83             'Authorization': 'APPCODE ' + appcode,
84         }
85 
86         response = requests.post(recognize_url, data=formdata, headers=headers)
87         result = response.json()
88         code = result['result']['code']
89         return code

最後運行爬蟲項目:

新建一個main.py方便調試

1 #author: "xian"
2 #date: 2018/6/13
3 from scrapy import cmdline
4 cmdline.execute('scrapy crawl spider'.split())

運行結果:(部分)

咱們能夠看到個人主頁的個性簽名已經改變爲咱們設置的了!(程序成功運行了!)

(5)、總結

一、在scrapy中想要發送Post請求,推薦使用scrapy.FormRequest方法,並指定表單數據

二、在爬蟲開始時發送Post請求,請重寫start_requests()方法

相關文章
相關標籤/搜索