廣西互聯網金融平臺系列-Scrapy爬蟲登錄爬取東盟貸(需登錄POST）

時間 2019-11-12

標籤廣西互聯網金融平臺系列 scrapy 爬蟲登錄東盟 post 欄目 Python 简体版

原文原文鏈接

1、背景

在爬取的過程當中發現，有一些平臺是須要登陸後才能訪問標的詳細信息或者標的列表，廣西這個叫東盟貸的網貸平臺就是如此。對於這種網站，一般有兩種應對方式：php

1.Scrapy結合selenium進行爬取，這樣就不會存在cookie和登陸的這些問題了，保持chrome不關閉就行。css

2.Scrapy模擬登陸chrome

2、目的

此次經過Scrapy模擬登陸來對東盟貸進行數據的抓取，只寫邏輯就行，具體操做就不寫了。瀏覽器

3、條件

1.須要登陸cookie

2.無需驗證碼dom

3.登錄後會自動跳轉到用戶中心scrapy

4、工具

import scrapy

from scrapy.http import FormRequest,Request

from urllib import parse

5、示例

事件邏輯：ide

1.模擬瀏覽器向網站發起登陸請求工具

2.根據登錄後跳轉的url判斷是否登陸成功post

3.若是登陸成功，則對列表頁的url進行抽取

4.將抽取到的具體標的url傳遞給parse_detail進行詳細數據的爬取

5.數據序列化

6.入庫

import scrapy
from scrapy.http import FormRequest,Request
from urllib import parse


class DongmengSpider(scrapy.Spider):
    name = 'dongmeng'
    allowed_domains = ['www.dongmengdai.com']
    # 設置基本的agent和referer
    header = {
        "HOST": "www.dongmengdai.com",
        "Referer": "https://www.dongmengdai.com/view/regphone.php",
        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0"
    }

    def start_requests(self):
        """
        經過FormRequest進行數據的post操做
            提交請求時帶上header，用戶名，密碼
            提交請求後調用is_login判斷是否登陸成功
        :return:
        """
        return [
            FormRequest(
                "https://www.dongmengdai.com/index.php?user&q=action/login",
                headers=self.header,
                formdata={"keywords": "13509090909", "password": "123456789"},
                callback=self.is_login
            )]

    def parse(self, response):
        """
        接收傳遞過來的response(標的列表頁)
            在當前頁取到具體的標的url,傳遞到parsdetail進行具體的數據爬取
        """
        total = response.css('.table-responsive.list-bid .table.margin-no tbody:last-child tr')
        for i in total:
            target_urls = i.css('td:first-child a::attr(href)').extract_first("")
            yield Request(url=parse.urljoin("https://www.dongmengdai.com",target_urls),callback=self.parse_detail)

    def parse_detail(self, response):
        """ 爬取標的信息操做在這裏編寫 """

        pass

    def is_login(self, response):
        """
        根據返回的url判斷是否登錄成功 
            若是成功則將列表url傳遞給parse方法進行列表頁的數據爬取
        """
        if "user" in response.url:
            print("登錄成功")
            yield Request(url="https://www.dongmengdai.com/view/Investment_list_che.php",callback=self.parse)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。