Python Scrapy爬蟲框架學習

時間 2019-12-07

標籤 python scrapy 爬蟲框架學習欄目 Python 简体版

原文原文鏈接

Scrapy 是用Python實現一個爲爬取網站數據、提取結構性數據而編寫的應用框架。php

1、Scrapy框架簡介

Scrapy是一個爲了爬取網站數據，提取結構性數據而編寫的應用框架。能夠應用在包括數據挖掘，信息處理或存儲歷史數據等一系列的程序中。css

其最初是爲了頁面抓取 (更確切來講, 網絡抓取 )所設計的，也能夠應用在獲取API所返回的數據(例如 Amazon Associates Web Services ) 或者通用的網絡爬蟲。html

2、架構流程圖

接下來的圖表展示了Scrapy的架構，包括組件及在系統中發生的數據流的概覽(綠色箭頭所示)。下面對每一個組件都作了簡單介紹，並給出了詳細內容的連接。數據流以下所描述。node

一、組件

Scrapy Engine

引擎負責控制數據流在系統中全部組件中流動，並在相應動做發生時觸發事件。詳細內容查看下面的數據流(Data Flow)部分。python

調度器(Scheduler)

調度器從引擎接受request並將他們入隊，以便以後引擎請求他們時提供給引擎。web

下載器(Downloader)

下載器負責獲取頁面數據並提供給引擎，然後提供給spider。正則表達式

Spiders

Spider是Scrapy用戶編寫用於分析response並提取item(即獲取到的item)或額外跟進的URL的類。每一個spider負責處理一個特定(或一些)網站。更多內容請看 Spiders 。算法

Item Pipeline

Item Pipeline負責處理被spider提取出來的item。典型的處理有清理、驗證及持久化(例如存取到數據庫中)。更多內容查看 Item Pipeline 。shell

下載器中間件(Downloader middlewares)

下載器中間件是在引擎及下載器之間的特定鉤子(specific hook)，處理Downloader傳遞給引擎的response。其提供了一個簡便的機制，經過插入自定義代碼來擴展Scrapy功能。更多內容請看下載器中間件(Downloader Middleware) 。數據庫

Spider中間件(Spider middlewares)

Spider中間件是在引擎及Spider之間的特定鉤子(specific hook)，處理spider的輸入(response)和輸出(items及requests)。其提供了一個簡便的機制，經過插入自定義代碼來擴展Scrapy功能。更多內容請看 Spider中間件(Middleware) 。

二、數據流(Data flow)

Scrapy中的數據流由執行引擎控制，其過程以下:

引擎打開一個網站(open a domain)，找處處理該網站的Spider並向該spider請求第一個要爬取的URL(s)。
引擎從Spider中獲取到第一個要爬取的URL並在調度器(Scheduler)以Request調度。
引擎向調度器請求下一個要爬取的URL。
調度器返回下一個要爬取的URL給引擎，引擎將URL經過下載中間件(請求(request)方向)轉發給下載器(Downloader)。
一旦頁面下載完畢，下載器生成一個該頁面的Response，並將其經過下載中間件(返回(response)方向)發送給引擎。
引擎從下載器中接收到Response並經過Spider中間件(輸入方向)發送給Spider處理。
Spider處理Response並返回爬取到的Item及(跟進的)新的Request給引擎。
引擎將(Spider返回的)爬取到的Item給Item Pipeline，將(Spider返回的)Request給調度器。
(從第二步)重複直到調度器中沒有更多地request，引擎關閉該網站。

三、事件驅動網絡(Event-driven networking)

Scrapy基於事件驅動網絡框架 Twisted 編寫。所以，Scrapy基於併發性考慮由非阻塞(即異步)的實現。

關於異步編程及Twisted更多的內容請查看下列連接:

3、4步製做爬蟲

新建項目（scrapy startproject xxx）:新建一個新的爬蟲項目
明確目標（編寫items.py）:明確你想要抓取的目標
製做爬蟲（spiders/xxsp der.py）:製做爬蟲開始爬取網頁
存儲內容（pipelines.py）:設計管道存儲爬取內容

4、安裝框架

這裏咱們使用 conda 來進行安裝：

conda install scrapy

或者使用 pip 進行安裝：

pip install scrapy

查看安裝：

➜  spider scrapy -h
Scrapy 1.4.0 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

1.建立項目

➜  spider scrapy startproject SF
New Scrapy project 'SF', using template directory '/Users/kaiyiwang/anaconda2/lib/python2.7/site-packages/scrapy/templates/project', created in:
    /Users/kaiyiwang/Code/python/spider/SF

You can start your first spider with:
    cd SF
    scrapy genspider example example.com
➜  spider

使用 tree 命令能夠查看項目結構：

➜  SF tree
.
├── SF
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

2.在spiders 目錄下建立模板

➜  spiders scrapy genspider sf "https://segmentfault.com"
Created spider 'sf' using template 'basic' in module:
  SF.spiders.sf
➜  spiders

這樣，就生成了一個項目文件 sf.py

# -*- coding: utf-8 -*-
import scrapy
from SF.items import SfItem


class SfSpider(scrapy.Spider):
    name = 'sf'
    allowed_domains = ['https://segmentfault.com']
    start_urls = ['https://segmentfault.com/']

    def parse(self, response):
        # print response.body
        # pass
        node_list = response.xpath("//h2[@class='title']")

        # 用來存儲全部的item字段的
        # items = []
        for node in node_list:
            # 建立item字段對象，用來存儲信息
            item = SfItem()
            # .extract() 將xpath對象轉換爲 Unicode字符串
            title = node.xpath("./a/text()").extract()

            item['title'] = title[0]

            # 返回抓取到的item數據，給管道文件處理，同時還回來繼續執行後邊的代碼
            yield.item
            #return item
            #return scrapy.Request(url)
            #items.append(item)

命令：

# 測試爬蟲是否正常, sf爲爬蟲的名稱
➜  scrapy check sf

# 運行爬蟲
➜  scrapy crawl sf

3.item pipeline

當 item 在Spider中被收集以後，它將會被傳遞到 item Pipeline, 這些 item Pipeline 組件按定義的順序處理 item.

每一個 Item Pipeline 都是實現了簡單方法的Python 類，好比決定此Item是丟棄或存儲，如下是 item pipeline 的一些典型應用：

驗證爬取得數據（檢查item包含某些字段，好比說name字段）
查重（並丟棄）
將爬取結果保存到文件或者數據庫總（數據持久化）

編寫 item pipeline
編寫 item pipeline 很簡單，item pipeline 組件是一個獨立的Python類，其中 process_item()方法必須實現。

from scrapy.exceptions import DropItem

class PricePipeline(object):

    vat_factor = 1.15

    def process_item(self, item, spider):
        if item['price']:
            if item['price_excludes_vat']:
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

4.選擇器(Selectors)

當抓取網頁時，你作的最多見的任務是從HTML源碼中提取數據。
Selector 有四個基本的方法，最經常使用的仍是Xpath

xpath():傳入xpath表達式，返回該表達式所對應的全部節點的selector list 列表。
extract(): 序列化該節點爲Unicode字符串並返回list
css():傳入CSS表達式，返回該表達式所對應的全部節點的selector list 列表，語法同 BeautifulSoup4
re():根據傳入的正則表達式對數據進行提取，返回Unicode 字符串list 列表

Scrapy提取數據有本身的一套機制。它們被稱做選擇器(seletors)，由於他們經過特定的 XPath 或者 CSS 表達式來「選擇」 HTML文件中的某個部分。

XPath 是一門用來在XML文件中選擇節點的語言，也能夠用在HTML上。 CSS 是一門將HTML文檔樣式化的語言。選擇器由它定義，並與特定的HTML元素的樣式相關連。

Scrapy選擇器構建於 lxml 庫之上，這意味着它們在速度和解析準確性上很是類似。

XPath表達式的例子：

/html/head/title: 選擇<HTML>文檔中<head>標籤內的<title>元素
/html/head/title/text(): 選擇上面提到的<title>元素的問題
//td: 選擇全部的<td> 元素
//div[@class="mine"]:選擇全部具備 class="mine" 屬性的 div 元素

更多XPath 語法總結請看這裏。

5、爬取招聘信息

1.爬取騰訊招聘信息

爬取的地址：http://hr.tencent.com/positio...

1.1 建立項目

> scrapy startproject Tencent

You can start your first spider with:
    cd Tencent
    scrapy genspider example example.com

須要抓取網頁的元素：

咱們須要爬取如下信息：
職位名：positionName
職位連接：positionLink
職位類型：positionType
職位人數：positionNumber
工做地點：workLocation
發佈時點：publishTime

在 items.py 文件中定義爬取的字段：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

# 定義字段
class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    # 職位名
    positionName = scrapy.Field()

    # 職位連接
    positionLink = scrapy.Field()

    # 職位類型
    positionType = scrapy.Field()

    # 職位人數
    positionNumber = scrapy.Field()

    # 工做地點
    workLocation = scrapy.Field()

    # 發佈時點
    publishTime = scrapy.Field()

    pass

1.2 寫spider爬蟲

使用命令建立

➜  Tencent scrapy genspider tencent "tencent.com"
Created spider 'tencent' using template 'basic' in module:
  Tencent.spiders.tencent

生成的 spider 在當前目錄下的 spiders/tencent.py

➜  Tencent tree
.
├── __init__.py
├── __init__.pyc
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
├── settings.pyc
└── spiders
    ├── __init__.py
    ├── __init__.pyc
    └── tencent.py

咱們能夠看下生成的這個初始化文件 tencent.py

# -*- coding: utf-8 -*-
import scrapy


class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['tencent.com']
    start_urls = ['http://tencent.com/']

    def parse(self, response):
        pass

對初識文件tencent.py進行修改：

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem

class TencentSpider(scrapy.Spider):
    name = 'tencent'
    allowed_domains = ['tencent.com']
    baseURL = "http://hr.tencent.com/position.php?&start="
    offset = 0  # 偏移量
    start_urls = [baseURL + str(offset)]

    def parse(self, response):

        # 請求響應
        # node_list = response.xpath("//tr[@class='even'] or //tr[@class='odd']")
         node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

        for node in node_list:
            item = TencentItem()   # 引入字段類

            # 文本內容, 取列表的第一個元素[0], 而且將提取出來的Unicode編碼 轉爲 utf-8
            item['positionName'] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
            item['positionLink'] = node.xpath("./td[1]/a/@href").extract()[0].encode("utf-8")         # 連接屬性
            item['positionType'] = node.xpath("./td[2]/text()").extract()[0].encode("utf-8")
            item['positionNumber'] = node.xpath("./td[3]/text()").extract()[0].encode("utf-8")
            item['workLocation'] = node.xpath("./td[4]/text()").extract()[0].encode("utf-8")
            item['publishTime'] = node.xpath("./td[5]/text()").extract()[0].encode("utf-8")

            # 返回給管道處理
            yield item

        # 先爬 2000 頁數據
        if self.offset < 2000:
            self.offset += 10
            url = self.baseURL + self.offset
            yield scrapy.Request(url, callback = self.parse)






        #pass

寫管道文件 pipelines.py：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

class TencentPipeline(object):
    def __init__(self):
        self.f = open("tencent.json", "w")

    # 全部的item使用共同的管道
    def process_item(self, item, spider):
        content = json.dumps(dict(item), ensure_ascii = False) + ",\n"
        self.f.write(content)
        return item

    def close_spider(self, spider):
        self.f.close()

管道寫好以後，在 settings.py 中啓用管道

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'Tencent.pipelines.TencentPipeline': 300,
}

運行：

> scrapy crawl tencent

File "/Users/kaiyiwang/Code/python/spider/Tencent/Tencent/spiders/tencent.py", line 21, in parse
    item['positionName'] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
IndexError: list index out of range

請求響應這裏寫的有問題，Xpath或應該爲這種寫法：

# 請求響應
        # node_list = response.xpath("//tr[@class='even'] or //tr[@class='odd']")
         node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

而後再執行命令：

> scrapy crawl tencent

執行結果文件 tencent.json ：

{"positionName": "23673-財經運營中心熱點運營組編輯", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=32718&keywords=&tid=0&lid=0", "positionType": "內容編輯類", "workLocation": "北京", "positionNumber": "1"},
{"positionName": "MIG03-騰訊地圖高級算法評測工程師（北京）", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=30276&keywords=&tid=0&lid=0", "positionType": "技術類", "workLocation": "北京", "positionNumber": "1"},
{"positionName": "MIG10-微回收渠道產品運營經理（深圳）", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=32720&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "workLocation": "深圳", "positionNumber": "1"},
{"positionName": "MIG03-iOS測試開發工程師（北京）", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=32715&keywords=&tid=0&lid=0", "positionType": "技術類", "workLocation": "北京", "positionNumber": "1"},
{"positionName": "19332-高級PHP開發工程師（上海）", "publishTime": "2017-12-02", "positionLink": "position_detail.php?id=31967&keywords=&tid=0&lid=0", "positionType": "技術類", "workLocation": "上海", "positionNumber": "2"}

1.3 經過下一頁爬取

咱們上邊是經過總的頁數來抓取每頁數據的，可是沒有考慮到天天的數據是變化的，因此，須要爬取的總頁數不能寫死，那該怎麼判斷是否爬完了數據呢？其實很簡單，咱們能夠根據下一頁來爬取，只要下一頁沒有數據了，就說明數據已經爬完了。

咱們經過 下一頁 看下最後一頁的特徵：

下一頁的按鈕爲灰色，而且連接爲 class='noactive'屬性了，咱們能夠根據此特性來判斷是否到最後一頁了。

# 寫死總頁數，先爬 100 頁數據
        """
  
        if self.offset < 100:
            self.offset += 10
            url = self.baseURL + str(self.offset)
            yield scrapy.Request(url, callback = self.parse)
        """


        # 使用下一頁爬取數據
        if len(response.xpath("//a[@class='noactive' and @id='next']")) == 0:
            url = response.xpath("//a[@id='next']/@href").extract()[0]
            yield scrapy.Request("http://hr.tencent.com/" + url, callback = self.parse)

修改後的tencent.py文件：

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem

class TencentSpider(scrapy.Spider):
    # 爬蟲名
    name = 'tencent'
    # 爬蟲爬取數據的域範圍
    allowed_domains = ['tencent.com']
    # 1.須要拼接的URL
    baseURL = "http://hr.tencent.com/position.php?&start="
    # 須要拼接的URL地址的偏移量
    offset = 0  # 偏移量

    # 爬蟲啓動時，讀取的URL地址列表
    start_urls = [baseURL + str(offset)]

    # 用來處理response
    def parse(self, response):

        # 提取每一個response的數據
        node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

        for node in node_list:

            # 構建item對象，用來保存數據
            item = TencentItem()

            # 文本內容, 取列表的第一個元素[0], 而且將提取出來的Unicode編碼 轉爲 utf-8
            print node.xpath("./td[1]/a/text()").extract()

            item['positionName'] = node.xpath("./td[1]/a/text()").extract()[0].encode("utf-8")
            item['positionLink'] = node.xpath("./td[1]/a/@href").extract()[0].encode("utf-8")         # 連接屬性

            # 進行是否爲空判斷
            if len(node.xpath("./td[2]/text()")):
                item['positionType'] = node.xpath("./td[2]/text()").extract()[0].encode("utf-8")
            else:
                item['positionType'] = ""

            item['positionNumber'] = node.xpath("./td[3]/text()").extract()[0].encode("utf-8")
            item['workLocation'] = node.xpath("./td[4]/text()").extract()[0].encode("utf-8")
            item['publishTime'] = node.xpath("./td[5]/text()").extract()[0].encode("utf-8")

            # yield的重要性，是返回數據後還能回來接着執行代碼，返回給管道處理，若是爲return 整個函數都退出了
            yield item

        # 第一種寫法：拼接URL，適用場景：頁面沒有能夠點擊的請求連接，必須經過拼接URL才能獲取響應
        """
  
        if self.offset < 100:
            self.offset += 10
            url = self.baseURL + str(self.offset)
            yield scrapy.Request(url, callback = self.parse)
        """


        # 第二種寫法：直接從response獲取須要爬取的鏈接，併發送請求處理，直到鏈接所有提取完（使用下一頁爬取數據）
        if len(response.xpath("//a[@class='noactive' and @id='next']")) == 0:
            url = response.xpath("//a[@id='next']/@href").extract()[0]
            yield scrapy.Request("http://hr.tencent.com/" + url, callback = self.parse)


        #pass

OK，經過根據下一頁咱們成功爬完招聘信息的全部數據。

1.4 小結

爬蟲步驟：

1.建立項目 scrapy project XXX
2.scarpy genspider xxx "http://www.xxx.com"
3.編寫 items.py, 明確須要提取的數據
4.編寫 spiders/xxx.py, 編寫爬蟲文件，處理請求和響應，以及提取數據（yield item）
5.編寫 pipelines.py, 編寫管道文件，處理spider返回item數據,好比本地數據持久化，寫文件或存到表中。
6.編寫 settings.py，啓動管道組件ITEM_PIPELINES，以及其餘相關設置
7.執行爬蟲 scrapy crawl xxx

有時候被爬取的網站可能作了不少限制，因此，咱們請求時能夠添加請求報頭，scrapy 給咱們提供了一個很方便的報頭配置的地方，settings.py 中，咱們能夠開啓:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Tencent (+http://www.yourdomain.com)'
User-AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)
              AppleWebKit/537.36 (KHTML, like Gecko)
              Chrome/62.0.3202.94 Safari/537.36"


# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
}

scrapy 最大的適用場景是爬取靜態頁面，性能很是強悍，但若是要爬取動態的json數據，那就不必了。

Scrapy入門教程