參數分析:php
setcookie:爲自動登陸所傳的值,不勾選時默認爲0。ajax
__hash__值的分析:只須要查看response網頁源代碼便可 ,而後用正則表達式提取。正則表達式
1.workon到本身的虛擬環境 cmd切換到項目目錄,輸入scrapy startproject ganjiwangdenglu,而後就能夠用pycharm打開該目錄啦。json
2.在pycharm terminal中輸入scrapy ganji ganjicom 建立地址,以下爲項目目錄cookie
3. 代碼詳情dom
import scrapy import re class GanjiSpider(scrapy.Spider): name = 'ganji' allowed_domains = ['ganji.com'] start_urls = ['https://passport.ganji.com/login.php'] def parse(self, response): hash_code = re.search(r'"__hash__":"(.+)"}', response.text).group(1) # 正則獲取哈希 img_url = 'https://passport.ganji.com/ajax.php?dir=captcha&module=login_captcha' # 驗證碼url yield scrapy.Request(img_url, callback=self.do_formdata, meta={'hash_code': hash_code}) # 發送獲取驗證碼請求並保存驗證碼到本地 def do_formdata(self, response): with open('yzm.jpg', 'wb') as f: f.write(response.body) # 驗證碼三種方案:1,保存下來手動輸入,2,雲打碼,3 tesseract模塊,在這裏咱們手動輸入 code = input('請輸入驗證碼:') # 建立表單 formdata = { 'username': 'your_username', 'password': 'your_password', 'setcookie': '14', 'checkCode': code, 'next': '', 'source': 'passport', '__hash__': response.request.meta['hash_code'] # meta是在respose.request中 } login_url = "https://passport.ganji.com/login.php" yield scrapy.FormRequest(url=login_url, formdata=formdata, callback=self.after_login) # 發送登陸請求 def after_login(self, response): print(response.text)
4.終端輸入scrapy carwl ganji 便可大功告成 。scrapy
返回來的json字符串解析以下:ide
注:setting中的設置不在贅述。post