Scrapy（3）將蜘蛛狠狠的踩在地上摩擦摩擦

時間 2020-12-27

標籤 php java python mysql linux redis sql 數據庫 json 網絡欄目 Python 简体版

原文原文鏈接

看到蜘蛛，你可能會想起噁心的真蜘蛛，像這樣的，夠嚇人吧，世界上十種最毒蜘蛛，他算上其中之一。
php

你錯了，只是你影像中的可惡的蜘蛛，你萬萬沒想到，蜘蛛還蠻可愛的，像這樣的，卡姿蘭大眼睛，捨不得狠狠的按在地上摩擦摩擦java

哦，等等，忽然腦子靈光一散，蜘蛛俠，這但是蕩氣迴腸啊，想當年蜘蛛俠還沒稱爲蜘蛛俠的時候，就是被蜘蛛咬了，才稱爲蜘蛛俠的
python

哦，好像扯遠了，仍是回到主題吧，今天的主題是 scrapy 裏面的蜘蛛（spider）是指，網絡爬蟲mysql

今天咱們經過一個完整的例子，爬取虎嗅網新聞列表，我進來網址，看看
linux

https://www.huxiu.com/redis

感受我發現了什麼樣的寶藏同樣，好像能夠學習裏面的文章寫做技巧什麼？sql

建立工程數據庫

scrapy startproject coolscrapy

這一條命令下去，你不得順利服從？咱們先來看看目錄分佈
json

coolscrapy/
    scrapy.cfg            # 部署配置文件

    coolscrapy/           # Python模塊，你全部的代碼都放這裏面
        __init__.py

        items.py          # Item定義文件

        pipelines.py      # pipelines定義文件

        settings.py       # 配置文件

        spiders/          # 全部爬蟲spider都放這個文件夾下面
            __init__.py
            ...

定義咱們本身的 Items
網絡

由於咱們須要爬取虎嗅網的新聞列表的《標題》《簡述》《連接》《發佈時間》，因此咱們須要定義一個 spider.Items 類，來抓取

import scrapy

# 傳入 scrapy.Item 說明是繼承自 scrapy.Item 基類
class HuXiuItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    link = scrapy.Field()
    desc = scrapy.Field()
    posttime = scrapy.Field()

或許你會以爲定義這個東西，有點麻煩，沒有必要，可是你有沒有仔細發現，這個不就像 java 裏面的基類，定義着各類屬性，可能對應了 model 層的數據字段，其實我也不太懂 java，只是公司用的是 java 後臺，因此稍微涉略了一下

接下來就是咱們的蜘蛛了

這些蜘蛛，其實就是一些爬取工具，可是抽象到代碼層面其實就是一個一個的方法，更加抽象的說法就是一個一個的類（class）,Scrapy 使用他們來自 domain（其實就是咱們所說的 url 地址）爬取信息，在蜘蛛類中定義一個初始化 url，以及跟蹤連接，如何解析頁面信息

定義一個Spider，只需繼承scrapy.Spider類並定於一些屬性：

name: Spider名稱，必須是惟一的

start_urls: 初始化下載連接URL

parse(): 用來解析下載後的Response對象，該對象也是這個方法的惟一參數。它負責解析返回頁面數據並提取出相應的Item（返回Item對象），還有其餘合法的連接URL（返回Request對象）

咱們在coolscrapy/spiders文件夾下面新建huxiu_spider.py，內容以下

#!/usr/bin/env python
# -*- encoding: utf-8 -*-
"""
Topic: sample
Desc :
"""
from coolscrapy.items import HuXiuItem
import scrapy


class HuXiuSpider(scrapy.Spider):
    name = 'huxiu'
    allowed_domains = ['huxiu.com']
    start_urls = [
        'http://www/huxiu.com/index.php'
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
            item = HuXiuItem()
            item['title'] = sel.xpath('h3/a/text()')[0].extract()
            item['link'] = sel.xpath('h3/a/@href')[0].extract()
            url = response.urljoin(item['link'])
            item['desc'] = sel.xpath(
                'div[@class="mob-sub"]/text()')[0].extract()
            print(item['title'], item['link'], item['desc'])

運行爬蟲

難哦你投佛，老天爺保佑個人爬蟲安然無事，不出bug，好緊張啊

在根目錄執行下面的命令，其中huxiu是你定義的spider名字

scrapy crawl huxiu

老天爺不包郵啊，仍是報錯了，居然這樣咱們就來解決bug嘍

目前暫且留着這個 bug，咱們先來熟悉一下流程吧，後期再改吧

處理連接

若是想繼續跟蹤每一個新聞連接進去，看看它的詳細內容的話，那麼能夠在parse()方法中返回一個Request對象，而後註冊一個回調函數來解析新聞詳情

from coolscrapy.items import HuXiuItem
import scrapy

class HuxiuSpider(scrapy.Spider):
    name = "huxiu"
    allowed_domains = ["huxiu.com"]
    start_urls = [
        "http://www.huxiu.com/index.php"
    ]

    def parse(self, response):
        for sel in response.xpath('//div[@class="mod-info-flow"]/div/div[@class="mob-ctt"]'):
            item = HuXiuItem()
            item['title'] = sel.xpath('h3/a/text()')[0].extract()
            item['link'] = sel.xpath('h3/a/@href')[0].extract()
            url = response.urljoin(item['link'])
            item['desc'] = sel.xpath('div[@class="mob-sub"]/text()')[0].extract()
            # print(item['title'],item['link'],item['desc'])
            yield scrapy.Request(url, callback=self.parse_article)

    def parse_article(self, response):
        detail = response.xpath('//div[@class="article-wrap"]')
        item = HuXiuItem()
        item['title'] = detail.xpath('h1/text()')[0].extract()
        item['link'] = response.url
        item['posttime'] = detail.xpath(
            'div[@class="article-author"]/span[@class="article-time"]/text()')[0].extract()
        print(item['title'],item['link'],item['posttime'])
        yield item

如今parse只提取感興趣的連接，而後將連接內容解析交給另外的方法去處理了。你能夠基於這個構建更加複雜的爬蟲程序了

導出數據

最簡單的保存抓取數據的方式是使用json格式的文件保存在本地，像下面這樣運行：

scrapy crawl huxiu -o items.json

在演示的小系統裏面這種方式足夠了。不過若是你要構建複雜的爬蟲系統，最好本身編寫Item Pipeline

保存數據到數據庫

上面咱們介紹了能夠將抓取的Item導出爲json格式的文件，不過最多見的作法仍是編寫Pipeline將其存儲到數據庫中。咱們在coolscrapy/pipelines.py定義

# -*- coding: utf-8 -*-
import datetime
import redis
import json
import logging
from contextlib import contextmanager

from scrapy import signals
from scrapy.exporters import JsonItemExporter
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem
from sqlalchemy.orm import sessionmaker
from coolscrapy.models import News, db_connect, create_news_table, Article


class ArticleDataBasePipeline(object):
    """保存文章到數據庫"""

    def __init__(self):
        engine = db_connect()
        create_news_table(engine)
        self.Session = sessionmaker(bind=engine)

    def open_spider(self, spider):
        """This method is called when the spider is opened."""
        pass

    def process_item(self, item, spider):
        a = Article(url=item["url"],
                    title=item["title"].encode("utf-8"),
                    publish_time=item["publish_time"].encode("utf-8"),
                    body=item["body"].encode("utf-8"),
                    source_site=item["source_site"].encode("utf-8"))
        with session_scope(self.Session) as session:
            session.add(a)

    def close_spider(self, spider):
        pass

上面我使用了python中的SQLAlchemy來保存數據庫，這個是一個很是優秀的ORM庫，我寫了篇關於它的入門教程，能夠參考下。

而後在setting.py中配置這個Pipeline，還有數據庫連接等信息：

ITEM_PIPELINES = {
    'coolscrapy.pipelines.ArticleDataBasePipeline': 5,
}

# linux pip install MySQL-python
DATABASE = {'drivername': 'mysql',
            'host': '192.168.203.95',
            'port': '3306',
            'username': 'root',
            'password': 'mysql',
            'database': 'spider',
            'query': {'charset': 'utf8'}
}