scrapy相關信息

時間 2019-11-11

標籤 scrapy 相關信息欄目 Python 简体版

原文原文鏈接

1、scrapy基本操做html

scrapy startproject scrapy_redis_spiders  #建立項目

cd scrapy_redis_spiders  #進入目錄

scrapy genspider chouti chouti.com   #建立爬蟲項目網站

scrapy crawl chouti --nolog   #運行爬蟲，--nolog表示不打印日誌

Scrapy簡介python

Scrapy是一個爲了爬取網站數據，提取結構性數據而編寫的應用框架。其能夠應用在數據挖掘，信息處理或存儲歷史數據等一系列的程序中。
其最初是爲了頁面抓取 (更確切來講, 網絡抓取 )所設計的，也能夠應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。Scrapy用途普遍，能夠用於數據挖掘、監測和自動化測試。react

Scrapy 使用了 Twisted異步網絡庫來處理網絡通信。總體架構大體以下web

Scrapy主要包括如下組件redis

引擎（Scrapy）

　　　用來處理整個系統的數據流處理, 觸發事務(框架核心)cookie

調度器（Scheduler）

　　　用來接受引擎發過來的請求, 壓入隊列中, 並在引擎再次請求的時候返回. 能夠想像成一個URL（抓取網頁的網址或者說是連接）的優先隊列, 由它來決定下一個要抓取的網址是什麼, 同時去除重複的網址網絡

下載器（Downloader）

　　用於下載網頁內容, 並將網頁內容返回給蜘蛛(Scrapy下載器是創建在twisted這個高效的異步模型上的)架構

爬蟲（Spiders）

　　　爬蟲是主要幹活的, 用於從特定的網頁中提取本身須要的信息, 即所謂的實體(Item)。用戶也能夠從中提取出連接,讓Scrapy繼續抓取下一個頁面app

項目管道（Pipeline）

　　　負責處理爬蟲從網頁中抽取的實體，主要的功能是持久化實體、驗證明體的有效性、清除不須要的信息。當頁面被爬蟲解析後，將被髮送到項目管道，並通過幾個特定的次序處理數據。框架

下載器中間件（Downloader Middlewares）

　　位於Scrapy引擎和下載器之間的框架，主要是處理Scrapy引擎與下載器之間的請求及響應。

爬蟲中間件（Spiders Middlewares）

　　介於Scrapy引擎和爬蟲之間的框架，主要工做是處理蜘蛛的響應輸入和請求輸出。

調度中間件（Scheduler Middlewares）

　　介於Scrapy引擎和調度之間的中間件，從Scrapy引擎發送到調度的請求和響應。

Scrapy運行流程：

引擎從調度器中取出一個連接(URL)用於接下來的抓取
引擎把URL封裝成一個請求(Request)傳給下載器
下載器把資源下載下來，並封裝成應答包(Response)
爬蟲解析Response
解析出實體（Item）,則交給實體管道進行進一步的處理
解析出的是連接（URL）,則把URL交給調度器等待抓取

2、Scrapy-redis基礎配置

　　在settings中

ITEM_PIPELINES = {
   # 'scrapy_redis_spiders.pipelines.ScrapyRedisSpidersPipeline': 300,
   'scrapy_redis.pipelines.RedisPipeline': 300,
   'scrapy_redis_spiders.pipelines.BigfilePipeline': 400,  
}


# ############ 鏈接redis 信息 #################
REDIS_HOST = '127.0.0.1'                            # 主機名
REDIS_PORT = 6379                                   # 端口
# REDIS_URL = 'redis://user:pass@hostname:9001'       # 鏈接URL（優先於以上配置）
REDIS_PARAMS  = {}                                 # Redis鏈接參數             默認：REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,}）
# REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' # 指定鏈接Redis的Python模塊  默認：redis.StrictRedis
REDIS_ENCODING = "utf-8"

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 有引擎來執行：自定義調度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'  # 默認使用優先級隊列（默認廣度優先），其餘：PriorityQueue（有序集合），FifoQueue（列表）、LifoQueue（列表）
SCHEDULER_QUEUE_KEY = '%(spider)s:requests'  # 調度器中請求存放在redis中的key
SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"  # 對保存到redis中的數據進行序列化，默認使用pickle
SCHEDULER_PERSIST = True  # 是否在關閉時候保留原來的調度器和去重記錄，True=保留，False=清空
SCHEDULER_FLUSH_ON_START = False  # 是否在開始以前清空 調度器和去重記錄，True=清空，False=不清空
# SCHEDULER_IDLE_BEFORE_CLOSE = 10  # 去調度器中獲取數據時，若是爲空，最多等待時間（最後沒數據，未獲取到）。
SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'  # 去重規則，在redis中保存時對應的key  chouti:dupefilter
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'  # 去重規則對應處理的類
DUPEFILTER_DEBUG = False


# 深度和優先級相關
DEPTH_PRIORITY = 1

REDIS_START_URLS_BATCH_SIZE = 1
# REDIS_START_URLS_AS_SET = True # 把起始url放到redis的集合
REDIS_START_URLS_AS_SET = False # 把起始url放到redis的列表

在spiders目錄下的chouti.py中內容以下

import scrapy
import scrapy_redis
from scrapy_redis.spiders import RedisSpider
from scrapy.http import Request
from ..items import ScrapyRedisSpidersItem
from scrapy.selector import  HtmlXPathSelector


from bs4 import  BeautifulSoup
from scrapy.http.cookies import CookieJar
import  os

class ChoutiSpider(RedisSpider):

    name = 'chouti'
    allowed_domains = ['chouti.com']
    # start_urls = ['https://dig.chouti.com/']
    # cookies = None

    # def start_requests(self):
    #     # os.environ['HTTP_PROXY'] = "192.168.10.1"
    #
    #     for url in self.start_urls:
    #         yield Request(url=url,callback=self.parse_index,meta={'cookiejar':True})
            # yield Request(url=url,callback=self.parse)

    def parse(self,response):
        #response.url獲取url
        url = response.url

        yield Request(url=url, callback=self.parse_index, meta={'cookiejar': True})  #meta={'cookiejar': True}表示自動獲取cookies

    def parse_index(self,response):
        #登陸chouti
        # print(response.text)
        req = Request(
            url='https://dig.chouti.com/login',
            method='POST',
            headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
                     'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'},
            body='phone=86xxxxxxx&password=xxxxxx&oneMonth=1',  #輸入手機號跟密碼
            callback=self.check_login,
            meta={'cookiejar': True}
        )
        # print(req)
        yield req

    def check_login(self,response):
        #獲取主頁信息
        print('check_login:',response.text)
        res = Request(
            url='https://dig.chouti.com/',
            method='GET',
            # callback=self.parse_check_login,   #這個去自動點贊功能
            callback=self.parse,   #這個是去下載圖片功能
            meta={'cookiejar': True},
            dont_filter=True,  #默認是False 默認表示去重   True表示這個url不去重
        )
        # print("res:",res)
        yield res


    def parse_check_login(self,response):
        #獲取主頁信息而後進行自動點贊
        # print('parse_check_login:',response.text)
        hxs = HtmlXPathSelector(response=response)
        # print(hxs)
        items = response.xpath("//div[@id='content-list']/div[@class='item']")
        # print(items)
        for item in items:
            #點贊帶的share-linkid號獲取全部的
            linksID = item.xpath(".//div[@class='part2']/@share-linkid").extract_first()
            # print(linksID)
            for nid in linksID:
                res = Request(
                    url='https://dig.chouti.com/link/vote?linksId=%s'%nid,
                    method='POST',
                    callback=self.parse_show_result,
                    meta={'cookiejar': True}  #攜帶cookies
                )
                yield res

    def parse_show_result(self,response):
        print(response.text)

    def parse(self, response):
        hxs = HtmlXPathSelector(response=response)
        # 去下載的頁面中：找新聞
        items = hxs.xpath("//div[@id='content-list']/div[@class='item']")
        for item in items:
            # print(item)
            href = item.xpath(".//div[@class='part1']//a[1]/@href").extract_first()
            # img = item.xpath("//div[@class='news-pic']/img/@original").extract_first()
            img = item.xpath(".//div[@class='part2']/@share-pic").extract_first()
            # print(img)
            # file_name = img.rsplit('//')[1].rsplit('?')[0]
            img_name = img.rsplit('_')[-1]
            file_path = 'images/{0}'.format(img_name)
            #使用大文件下載方式
            item = ScrapyRedisSpidersItem(url=img, type='file', file_name=file_path)
            print(img)
            yield item

            # pages = hxs.xpath("//div[@id='page-area']//a[@class='ct_pagepa']/@href").extract()
            # print(pages)
            # for page_url in pages:
            #     page_url = "http://dig.chouti.com" + page_url
            #     print(page_url)
            #     yield Request(url=page_url, callback=self.parse)

在scrapy_redis_spiders目錄下載建立images目錄用於存在圖片

在scrapy_redis_spiders目錄下配置items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ScrapyRedisSpidersItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    url = scrapy.Field()
    img = scrapy.Field()
    type = scrapy.Field()
    file_name = scrapy.Field()

在scrapy_redis_spiders中配置pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

from twisted.web.client import Agent, getPage, ResponseDone, PotentialDataLoss

from twisted.internet import defer, reactor, protocol
from twisted.web._newclient import Response
from io import BytesIO


class ScrapyRedisSpidersPipeline(object):
    def process_item(self, item, spider):
        return item


class _ResponseReader(protocol.Protocol):
     #與BigfilePipeline配套下載大文件
     
    def __init__(self, finished, txresponse, file_name):
        self._finished = finished
        self._txresponse = txresponse
        self._bytes_received = 0
        self.f = open(file_name, mode='wb')

    def dataReceived(self, bodyBytes):
        self._bytes_received += len(bodyBytes)

        # 一點一點的下載
        self.f.write(bodyBytes)

        self.f.flush()

    def connectionLost(self, reason):
        if self._finished.called:
            return
        if reason.check(ResponseDone):
            # 下載完成
            self._finished.callback((self._txresponse, 'success'))
        elif reason.check(PotentialDataLoss):
            # 下載部分
            self._finished.callback((self._txresponse, 'partial'))
        else:
            # 下載異常
            self._finished.errback(reason)

        self.f.close()


class BigfilePipeline(object):
    """
    用於下載大文件
    """
    def process_item(self, item, spider):
        # 建立一個下載文件的任務

        if item['type'] == 'file':
            agent = Agent(reactor)
            d = agent.request(
                method=b'GET',
                uri=bytes(item['url'], encoding='ascii')
            )
            # 當文件開始下載以後，自動執行 self._cb_bodyready 方法
            d.addCallback(self._cb_bodyready, file_name=item['file_name'])

            return d
        else:
            return item

    def _cb_bodyready(self, txresponse, file_name):
        # 建立 Deferred 對象，控制直到下載完成後，再關閉連接
        d = defer.Deferred()
        d.addBoth(self.download_result)  # 下載完成/異常/錯誤以後執行的回調函數
        txresponse.deliverBody(_ResponseReader(d, txresponse, file_name))
        return d

    def download_result(self, response):
        pass

在scrapy_redis_spiders目錄下建立一個start_url.py

#!/usr/bin/env python
#coding=utf-8

import redis

conn = redis.Redis(host='127.0.0.1',port=6379)

# 起始url的Key： chouti:start_urls
conn.lpush("chouti:start_urls",'https://dig.chouti.com')
v = conn.keys()
# value = v.mget("chouti:start_urls")
print(v)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。