爬蟲（九）：scrapy框架回顧

時間 2019-12-11

標籤爬蟲 scrapy 框架回顧欄目網絡爬蟲简体版

原文原文鏈接

scrapy文檔css

一：安裝scrapyhtml

a. pip3 install wheelpython

b. 下載twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twistedshell

c. 進入下載目錄，執行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl數據庫

d. pip3 install scrapyjson

e. 下載並安裝pywin32：https://sourceforge.net/projects/pywin32/files/app

二：基本操做dom

（1）新建工程：在建立工程以前先進入想用來保存代碼的目錄，而後執行scrapy

scrapy startproject xxx #建立項目ide

Microsoft Windows [版本 10.0.16299.309]
(c) 2017 Microsoft Corporation。保留全部權利。

C:\Users\felix>cd C:\Users\felix\PycharmProjects\scrapy_quotes

C:\Users\felix\PycharmProjects\scrapy_quotes>scrapy startproject quotes
New Scrapy project 'quotes', using template directory 'c:\\users\\felix\\appdata\\local\\programs\\python\\python36\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Users\felix\PycharmProjects\scrapy_quotes\quotes

You can start your first spider with:
    cd quotes
    scrapy genspider example example.com

執行這條命令將建立一個新目錄：包括的文件以下：

scrapy.cfg：項目配置文件

quotes/：項目python模塊，待會代碼將從這裏導入

quotes/items：項目items文件

quotes/pipelines.py:項目管道文件

quotes/settings.py：項目配置文件

quotes/spiders：放置spider的目錄

（2）：建立爬蟲

cd quotes # 先進入項目目錄

scrapy genspider name name.com # 建立爬蟲

scrapy crawl name # 運行爬蟲

（3）：建立的爬蟲類解析

import scrapy
from quotes.items import QuotesItem


class QuotespiderSpider(scrapy.Spider):
    name = 'quotespider'  # 爬蟲名稱
    allowed_domains = ['quotes.toscrape.com']  # 容許爬蟲訪問的域名，能夠多個
    start_urls = ['http://quotes.toscrape.com/'] # 爬蟲開始的url地址

    def parse(self, response):  # 爬蟲返回的數據解析函數
        quotes = response.css('.quote')  # 經過css選擇器選擇相應的內容
        for quote in quotes:
            item = QuotesItem()  # item作數據持久化的
            text = quote.css('.text::text').extract_first()  # ::text 表示輸出文本內容
            author = quote.css('.author::text').extract_first()  # ::text 表示輸出文本內容
            tags = quote.css('.tags .tag::text').extract()  # extract_first() 表示找第一個，extract()表示找到全部，並返回一個列表
            item['text'] = text  # 賦值  首先要在items類中建立
            item['tags'] = tags
            item['author'] = author
            yield item  # 生成item 作數據存儲
        next = response.css('.pager .next a::attr(href)').extract_first()  # 獲得相對的url
        url = response.urljoin(next)  # 獲取一個絕對的url，獲取下一頁的url
        yield scrapy.Request(url=url, callback=self.parse)  # 處理連接，將返回的response交給callback的回調函數

# scrapy shell quotes.toscrape.com  # 進入命令行調試
# scrapy crawl quotes -o quotes.json(.csv  .xml)     # 數據保存，能夠保存多個類型

（4）：items類解析

Items是將要裝載抓取的數據的容器，它工做方式像python裏面的字典，但它提供更多的保護，好比對未定義的字段填充以防止拼寫錯誤。

它經過建立一個scrapy.item.Item類來聲明，定義它的屬性爲scrpy.item.Field對象，就像是一個對象關係映射(ORM).
咱們經過將須要的item模型化，來控制得到的站點數據，好比咱們要得到站點的名字，url和網站描述，咱們定義這三種屬性的域。要作到這點，咱們編輯在quotes目錄下的items.py文件，咱們的Item類將會是這樣

import scrapy


class QuotesItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    text=scrapy.Field()  #建立文本字段
    author=scrapy.Field() # 建立做者字段
    tags=scrapy.Field()  # 建立標籤字段

（5）：pipeline類解析

import pymongo
from scrapy.exceptions import DropItem



# 要使用pipline必定要在設置中指定
class QuotesPipeline(object):
    def process_item(self, item, spider):
        return item


# 一個pipeline要麼返回item 要麼返回dropitem
class TextPipeline(object):
    def __init__(self):
        self.limit = 50

    def process_item(self, item, spider):
　　　　　# 這裏的item爲item類中的item
　　　　　# 大於50字的文本進行處理
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
                return item
        else:
            return DropItem('Missing Text')


# 添加數據庫的操做
class MongoPipeline(object):
    def __init__(self, mongo_url, mongo_db):
　　　　 # 初始化數據庫
        self.mongo_url = mongo_url
        self.mongo_db = mongo_db

　　 # 該類方法能夠從設置中讀取數據
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            # 從設置裏面獲取數據庫的設置信息
            mongo_url=crawler.settings.get('MONGO_URL'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):  # 啓動爬蟲時作的操做
　　　　 # 初始化數據庫
        self.client = pymongo.MongoClient(self.mongo_url)
        self.db = self.client[self.mongo_db]

　　 # 處理item的方法，必須實現返回item或者dropitem
    def process_item(self, item, spider):
        name = item.__class__.__name__  # item的名稱
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        self.client.close()  # 結束爬蟲時關閉數據庫