scrapy--Cookies

時間 2019-11-30

原文原文鏈接

　　你們好,以前看到的關於cookies的應用,因爲有段時間沒看,再看的時候花了一些時間,來給你們總結下。本文是根據:"http://www.bubuko.com/infodetail-2233980.html"基礎上加了一些本身遇到的問題,但願能幫助到你們,那咱們開始!!javascript

一: 先上一些乾貨,稍微作些前期知識儲備:css

HTTP 請求方法：GET 和 POST(提交的方式)
    GET - 從指定的資源請求數據。
    POST - 向指定的資源提交要被處理的數據
    
Cookie和session(存儲的方式)
    一、cookie數據存放在客戶的瀏覽器上，session數據放在服務器上。
    二、cookie不是很安全，別人能夠分析存放在本地的COOKIE並進行COOKIE欺騙
       考慮到安全應當使用session。
    三、session會在必定時間內保存在服務器上。當訪問增多，會比較佔用你服務器的性能
       考慮到減輕服務器性能方面，應當使用COOKIE。
    四、單個cookie保存的數據不能超過4K，不少瀏覽器都限制一個站點最多保存20個cookie。
    
    cookie - (name,value,domain,path,secure,expire(過時時間))
            根據用戶名,密碼,標識用戶
    session - session_id+cookie
            根據session_id,標識用戶,發送用戶名和密碼
    
    原本 session 是一個抽象概念，開發者爲了實現中斷和繼續等操做，將 user agent 和 server  之間一對一的交互，抽象爲「會話」，進而衍生出「會話狀態」，也就是 session 的概念。 
    而 cookie 是一個實際存在的東西，http 協議中定義在 header 中的字段。能夠認爲是 session 的一種後端無狀態實現。
    session 由於 session id 的存在，一般要藉助 cookie 實現

Scrapy模擬瀏覽器登陸 html

scrapy的基本請求流程:
    1.start_request                        遍歷start_urls列表
    2.make_requests_from_url         執行Request方法,請求start_urls裏面的地址(post--登陸)
    
快速登陸的方法:
　　1.改寫start_request,直接GET登陸頁面的html信息(登陸的帳戶,密碼,怎麼提交,提交到哪)
    <form action="#" enctype="application/x-www-form-urlencoded" method="post">
        <tr id="auth_user_email__row">
        <tr id="auth_user_password__row">
        <tr id="auth_user_remember_me__row">
        <tr id="submit_record__row">
            <td class="w2p_fw">
                <input type="submit" value="Log In" class="btn">
                <button class="btn w2p-form-button" onclick="window.location='/places/default/user/register';return false">Register</button>
            </td>
        </tr>
        </tr>
        </tr>
        </tr>
    </form>    

　　2.start_request方法GET到數據後，用callback參數，執行拿到response後要接下來執行哪一個方法，而後在login方法裏面寫入登陸用戶名和密碼    

　　三、parse_login方法是提交完表單後callback回調函數指定要執行的方法，爲了驗證是否成功。這裏咱們直接在response中搜索Welcome Liu這個字眼就證實登陸成功。
　　這個好理解，重點是yield  from super().start_resquests()，這個表明着若是一旦登陸成功後，就直接帶着登陸成功後Cookie值，方法start_urls裏面的地址。
　　這樣的話登陸成功後的response能夠直接在parse裏面寫。

Cookie登陸:java

    start_requests()方法，能夠返回一個請求給爬蟲的起始網站，這個返回的請求至關於start_urls，start_requests()返回的請求會替代start_urls裏的請求

    Request()get請求，能夠設置，url、cookie、回調函數

    FormRequest.from_response()表單post提交，第一個必須參數，上一次響應cookie的response對象，其餘參數，cookie、url、表單內容等

    yield Request()能夠將一個新的請求返回給爬蟲執行

    在發送請求時cookie的操做，
    meta={'cookiejar':1}表示開啓cookie記錄，首次請求時寫在Request()裏
    meta={'cookiejar':response.meta['cookiejar']}表示使用上一次response的cookie，寫在FormRequest.from_response()裏post受權
    meta={'cookiejar':True}表示使用受權後的cookie訪問須要登陸查看的頁面

    獲取Scrapy框架Cookies

    請求Cookie
    Cookie = response.request.headers.getlist('Cookie')
    print(Cookie)

    響應Cookie
    Cookie2 = response.headers.getlist('Set-Cookie')
    print(Cookie2)

二: 如今給你們開始代碼:json

spider/pach.py後端

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request,FormRequest
from Pach.settings import USER_AGENT
import pdb

class PachSpider(scrapy.Spider):                            #定義爬蟲類，必須繼承scrapy.Spider
    name = 'pach'                                           #設置爬蟲名稱
    allowed_domains = ['edu.iqianyue.com']                  #爬取域名
    # start_urls = ['http://edu.iqianyue.com/index_user_login.html']     #爬取網址,只適於不須要登陸的請求，由於無法設置cookie等信息

    header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'}  #設置瀏覽器用戶代理
    '''
    headers = {
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Encoding': 'gzip, deflate',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'Connection': 'keep-alive',
        'Content-Length': '11',
        'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
        'Host': 'edu.iqianyue.com',
        'Origin': 'edu.iqianyue.com',
        #'Referer': 'http://www.dy2018.com/html/tv/oumeitv/index.html',
        'User-Agent': USER_AGENT,
        'X-Requested-With': 'XMLHttpRequest',

    }
    '''

    def start_requests(self):       #用start_requests()方法,代替start_urls
        """第一次請求一下登陸頁面，設置開啓cookie使其獲得cookie，設置回調函數"""
        yield Request('http://edu.iqianyue.com/index_user_login.html',meta={'cookiejar':1},callback=self.parse)

    def parse(self, response):      #parse回調函數

        data =  {                    #設置用戶登陸信息，對應抓包獲得字段
            'number':'xxxxx',
            'passwd':'xxxx',
            'submit':''
                }
        # 響應Cookie
        Cookie1 = response.headers.getlist('Set-Cookie')   #查看一下響應Cookie，也就是第一次訪問註冊頁面時後臺寫入瀏覽器的Cookie
        print('cookie1',Cookie1)
        """第二次用表單post請求，攜帶Cookie、瀏覽器代理、用戶登陸信息，進行登陸給Cookie受權"""
        yield FormRequest.from_response(response,
                                          url='http://edu.iqianyue.com/index_user_login.html',   #真實post地址
                                          meta={'cookiejar':response.meta['cookiejar']},
                                          headers=self.header,
                                          formdata=data,
                                          callback=self.next,
                                        )
    def next(self,response):
        """登陸後請求須要登陸才能查看的頁面，如我的中心，攜帶受權後的Cookie請求"""
        yield Request('http://edu.iqianyue.com/index_user_index.html',meta={'cookiejar':True},callback=self.next2)

    def next2(self,response):
        # 請求Cookie
        #Cookie2 = response.request.headers.getlist('Cookie')
        #print(Cookie2)
        body = response.body  # 獲取網頁內容字節類型
        unicode_body = response.body_as_unicode()  # 獲取網站內容字符串類型
        a = response.css('div.col-md-4.column ::text').extract()[2]
        #a = response.css('div#panel-54098 h2::text').extract()  #獲得我的中心頁面
        print(a)

設置代理:瀏覽器

middlewares.py安全

class PachSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
    def __init__(self, ip=''):
        self.ip = ip

    def process_request(self, request, spider):
        print('http://10.240:911')
        request.meta['proxy'] = 'http://10.240:911'

settings.py服務器

USER_AGENT ={       #設置瀏覽器的User_agent
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
}

ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 16
DOWNLOAD_DELAY = 0.5
#COOKIES_ENABLED = False

DOWNLOADER_MIDDLEWARES = {
    #'Pach.middlewares.PachDownloaderMiddleware': 543,
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 543,
    'Pach.middlewares.PachSpiderMiddleware': 125,
}

ITEM_PIPELINES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':1,
    #'Pach.pipelines.PachPipeline': 300,
}

三: 須要注意的點cookie

#COOKIES_ENABLED = False    
| COOKIES_ENABLED = True
#settings.py 將cookies開啓,才能獲取登陸成功後的信息,可以獲取請求response.request.headers的"Cookie",沒法獲取response.headers的"Set-Cookie"

#settings.py 將cookies禁用,沒法獲取登陸成功後的信息,沒法獲取請求response.request.headers的"Cookie",可以獲取response.headers的"Set-Cookie"

(Pdb) response.request.headers
{'Accept-Language': ['en'], 'Accept-Encoding': ['gzip,deflate'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'], 'Referer': ['http://edu.iqianyue.com/index_user_login.html'], 'Cookie': ['PHPSESSID=2jhiv5fdb3rips83928mf5gaq2'], 'Content-Type': ['application/x-www-form-urlencoded']}

(Pdb) response.headers
{'Via': ['1.1 shzdmzpr03'], 'X-Powered-By': ['PHP/5.6.36'], 'Expires': ['Thu, 19 Nov 1981 08:52:00 GMT'], 'Server': ['Apache/2.4.6 (CentOS) PHP/5.6.36'], 'Pragma': ['no-cache'], 'Cache-Control': ['no-store, no-cache, must-revalidate, post-check=0, pre-check=0'], 'Date': ['Tue, 09 Oct 2018 05:17:14 GMT'], 'Content-Type': ['text/html; charset=utf-8']}

-> a = response.css('div#panel-54098 h2::text').extract()
(Pdb) c
[u'\r\n\t\t\t\t\t\t\t\t\t\t\t\t\u6211\u7684\u57fa\u672c\u4fe1\u606f\r\n\t\t\t\t\t\t\t\t\t\t\t', u'\r\n\t\t\t\t\t\t\t\t\t\t\t\t\u6211\u5b66\u4e60\u7684\u8bfe\u7a0b\u6570\u91cf\r\n\t\t\t\t\t\t\t\t\t\t\t', u'\r\n\t\t\t\t\t\t\t\t\t\t\t\t\u6211\u8d2d\u4e70\u7684\u8bfe\u7a0b\u6570\u91cf\r\n\t\t\t\t\t\t\t\t\t\t\t']

COOKIES_ENABLED = False
(Pdb) response.request.headers
{'Accept-Language': ['en'], 'Accept-Encoding': ['gzip,deflate'], 'Accept': ['text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], 'User-Agent': ['Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'], 'Referer': ['http://edu.iqianyue.com/index_user_login.html'], 'Content-Type': ['application/x-www-form-urlencoded']}
(Pdb) response.headers
{'Via': ['1.1 shzdmzpr03'], 'X-Powered-By': ['PHP/5.6.36'], 'Set-Cookie': ['PHPSESSID=0bjvnkfom3q9c9u878f0aauvi4; path=/'], 'Expires': ['Thu, 19 Nov 1981 08:52:00 GMT'], 'Server': ['Apache/2.4.6 (CentOS) PHP/5.6.36'], 'Pragma': ['no-cache'], 'Cache-Control': ['no-store, no-cache, must-revalidate, post-check=0, pre-check=0'], 'Date': ['Tue, 09 Oct 2018 05:20:23 GMT'], 'Content-Type': ['text/html; charset=utf-8']}
-> a = response.css('div#panel-54098 h2::text').extract() 
(Pdb) c
[ ]

更多相關文章...

相關標籤/搜索

Python

HTTP/TCP

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。