python學習之三 scrapy框架

時間 2019-12-01

標籤 python 學習之三 scrapy 框架欄目 Python 简体版

原文原文鏈接

什麼是scrapy？html

Scrapy是一個爲了爬取網站數據，提取結構性數據而編寫的應用框架，簡單的理解它既是一個強大的爬蟲框架python

爲何要用這個框架？web

由於它的功能強大：redis

- 應用twisted，下載頁面，實現併發效果
- HTML解析對象，自帶lxml
- 能夠設置代理
- 能夠設置延遲下載
- 能夠自定義去重
- 能夠設置深度優先，廣度優先編程

-能夠與redis，實現分佈式爬蟲瀏覽器

安裝：cookie

Linux:
pip3 install scrapy

Windows:
併發

下載地址：http://www.lfd.uci.edu/~gohlke/pythonlibs/app

下載到文件：Twisted-17.5.0-cp36-cp36m-win_amd64.whl cp是指python解釋器版本後面的64是指64位win系統下載合適的版本框架

而後安裝：pip install Twisted-17.5.0-cp36-cp36m-win_amd64.whl

接下來還有2個pip
pip install scrapy
pip install pypiwin32

它的框架圖以下：

怎麼建立一個爬蟲？

- 建立爬蟲項目
scrapy startproject sp2（sp2是項目名稱）

進入項目並建立爬蟲
cd sp2
scrapy genspider chouti chouti.com （chouti是爬蟲名字，chouti.com是爬蟲的爬取限定的域名）
運行爬蟲
scrapy crawl chouti（chouti爲爬蟲名字）

通常咱們不看log就用：

scrapy crawl chouti --nolog

項目框架圖：

其中‘設置起始URL.py’不是必需的。

scrapy.cfg是個簡單的配置文件。

settings是詳細的配置文件

item和pipelines是用來作格式化，序列化的

middlewares是用來寫中間件的

文件夾spiders中存放的是爬蟲文件，用於解析數據，寫回調函數等，經過2個yield來向調配器與pipelines傳數據

以上就是一個最簡單的scrapy框架下的爬蟲

示例：

如下就是一個爬取校花網上的美女圖片的實例，來感覺一個簡單的scrapy爬蟲的運行流程：

1 在spide文件夾中的爬蟲文件xiaohuar中的代碼

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from scrapy.http import Request
# import requests
# import urllib.request


class XiaohuarSpider(scrapy.Spider):
    name = 'xiaohuar'
    allowed_domains = ['xiaohuar.com']
    start_urls = ['http://www.xiaohuar.com/hua/']

    def parse(self, response):
        pass
        hxs = Selector(response=response)  # 和BeautifulSoup比起來 不用.text就看成參數
        girl_list = hxs.xpath('//*[@id="list_img"]/div/div[1]/div')  # 從瀏覽器中copy-copy XPath而來
        # //出如今最前表示從整個html開始找  不然表示從子孫中找
        # / 不能出如今最前 只能出如今中間表示從兒子中找   若是後面是@屬性名或text()表示找屬性值或文本
        # .// 和*// 出如今最前表示從當前的子孫中找 最前面是./ 或 */或  什麼都不寫 表示從當前的兒子中找
        # img_list = []
        count = 1
        for girl in girl_list:   # 這裏確實取到了25個對象
            print(count)  # 這裏打印了從1到 25 證實girl_list裏面確實有25個對象 可是隻下載了前10個url的圖片 爲何？
            count += 1
            text = girl.xpath('div[1]/div[2]/span/a/text()').extract_first()  # 找到校花的簡介
            # self.filename = text
            img = girl.xpath('div[1]/div[1]/a/img/@src').extract_first()
            # self.url = 'http://www.xiaohuar.com' + img
            url = 'http://www.xiaohuar.com' + img

            img_path = r'F:\爬蟲\%s.jpg' % text
            # res = request.get(url).content
            # urllib.request.urlretrieve(url,img_path)

            # img_list.append(url)
            # print(text,img)
            from ..items import Sp1Item
            # # yield Sp1Item(url=img_url, text=self.filename)
            yield Sp1Item(url=url, text=img_path)

        result = hxs.xpath('//*[@id="page"]/div/a/@href')
        # print(result)
        # print(result.extract_first())
        # print(result.extract())

        # yield Item(xxxx)  # 先去item.py再去piplines.py去進行持久化   這裏是僞代碼

        # 遞歸
        result = result.extract()  # 果真 這裏要轉成字符串組成的列表後面的代碼也能正確執行 視頻中老師疏忽了 可是怎麼執行的搞不清楚 怎麼遞歸的流程不清楚
        for url1 in result:  # 事實證實若是result是個對象不是列表的話  此代碼和下行代碼無效
            # print(url)
            yield Request(url=url1,callback=self.parse)   # url賦值給第9行的start_urls  再回到parse從新執行

View Code

主要分爲解析和2個yield

解析用的是模塊

from scrapy.selector import Selector

2個yield分別用來作持久化和循環爬取起來頁的圖片

接下來是item和pipelines中的代碼

import scrapy


class Sp1Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # image_urls = scrapy.Field()
    # images = scrapy.Field()
    # image_path = scrapy.Field()
    # pass
    url = scrapy.Field()
    text = scrapy.Field()
    # print(url)
    # print(text)

View Code

# import urllib.request
# import requests
class Sp1Pipeline(object):

    def __init__(self):
        self.f = None
        # self.res = None
        pass
    def process_item(self, item, spider):
        import requests
        from scrapy.http import Request
        res = Request(item['url'])
        self.f = open(r'F:\爬蟲\%s.jpg' % item['text'],'wb')
        self.f.write(res.body)
        self.f.close()
        print(item)
        # # if spider.name == 'xiaohuarvideo':
        # vname = r'F:\爬蟲\video\%s.mp4' % item['url']
        # # urllib.request.urlretrieve(item['url'],vname)
        # res = requests.get(item['url'])
        # with open(vname,'wb') as f:
        #     f.write(res.content)
        # print('%s下載完成' % item['url'])
        # pass
        return item

    def open_spider(self,spider):
        """
        爬蟲開始執行時，調用
        :param spider:
        :return:
        """
        print('爬蟲開始')
        # self.f = open('%s.jpg' % name,'wb')

    def close_spider(self, spider):
        """
        爬蟲關閉時，被調用
        :param spider:
        :return:
        """
        print('爬蟲結束')
        # self.f.close()

View Code

固然配置文件settings中也要設置一下

# 設置爬取的遞歸深度
DEPTH_LIMIT = 1
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'sp1 (+http://www.yourdomain.com)'

# 是否遵照爬蟲協議
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 延遲下載秒數
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'sp1.middlewares.Sp1SpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'sp1.middlewares.Sp1DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
# 設置持久化文件路徑及其優先級，通常是從0到1000，數字越小越優先
ITEM_PIPELINES = {
'sp1.pipelines.Sp1Pipeline': 300
}