初識Scrapy——1—scrapy簡單學習，伯樂在線實戰、json數據保存

時間 2019-11-09

標籤 scrapy 簡單學習伯樂在線實戰 json 數據保存欄目 Python 简体版

原文原文鏈接

Scrapy——1html

目錄python

什麼是Scrapy框架？mysql

Scrapy是用純Python實現一個爲了爬取網站數據、提取結構性數據而編寫的應用框架，用途很是普遍。多用於抓取大量靜態頁面。
框架的力量，用戶只須要定製開發幾個模塊就能夠輕鬆的實現一個爬蟲，用來抓取網頁內容以及各類圖片，很是方便。
Scrapy使用了Twisted[‘twistid](其主要對手是Toronto)異步網絡框架來處理網絡通信，能夠加快咱們的下載速度，不用本身去實現異步框架，而且包含了各類中間件接口，能夠靈活的完成各類需求。

Scrapy的安裝sql

Windows安裝

pip install Scrapy

Windows使用Scrapy須要不少的依賴環境，根據我的的電腦的狀況而定，在cmd的安裝下，缺乏的環境會報錯提示，在此網站下搜索下載，經過wheel方法安裝便可。若是不懂wheel法安裝的，能夠參考我以前的隨筆，方法雷同json

虛擬機Ubuntu的安裝

　　經過以下代碼安裝依賴環境，最後也是經過pip install Scrapy進行安裝網絡

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

Scrapy的運行流程框架

Scrapy Engine（引擎）：負責Spider、ItemPipeline、Downloader、Schedule中間件，信號，數據傳遞等
Schedule（調度器）：它負責接收引擎發送過來的Requests請求，並按照必定的方式進行排序，入隊，當引擎須要時，交還給引擎
Downloader（下載器）：負責下載負責下載Scrapy Engine(引擎)發送的全部Requests請求，並將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理
Spider（爬蟲）：他負責處理全部的Response，從中分析提取數據，獲取Item字段須要的數據，並將須要跟進的URL提交給引擎，再次進入Schedule來處理
ItemPipeline（管道）：它負責處理Spider中獲取到的Item，並進行後期的處理（詳細分析、過濾、存儲等）的地方
Downloader Middlewares（下載器中間件）：能夠看成是一個能夠自定義下載功能的組件
Spider Middlewares（Spider中間件）：能夠理解爲是一個能夠自定義擴展和操做引擎和Spider中間件通訊的功能組件（好比進入Spider的Response；和從Spider出去的Requests）

Scrapy的使用dom

在虛擬機中用命令行輸入

scrapy startproject project_name

根據提示，cd到建立的項目中，再建立根爬蟲(spider_name是爬蟲程序的文件名，spider_url是所要爬取的網站域名)

scrapy genspider spider_name spider_url(scrapy genspider spider tanzhouedu.com)

會相應生成以下文件異步

scrapy.cfg：項目的配置文件
mySpider/：項目的Python模塊，將會從這裏引入代碼
mySpider/items.py ：項目的目標文件
mySpider/pipelines.py ：項目的管道文件
mySpider/settings.py ：項目的設置文件
mySpider/spiders/ ：存儲爬蟲代碼目錄

Scrapy知識點介紹scrapy

製做爬蟲（spiders/xxspider.py）

存儲內容（pipelines.py）：設計管道存儲數據

mySpider/settings.py裏面的註冊管道

程序的運行：在虛擬機中相應文件夾下，輸入scrapy list ,他會顯示能夠運行的scrapy程序，而後輸入scrapy crawl scrapy_name開始運行

實戰：伯樂在線案例（json文件保存）

建立項目

bolezaixain\bolezaixain\items.py 設置須要的數據

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BolezaixainItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    url = scrapy.Field()
    time = scrapy.Field()

bolezaixain\bolezaixain\settings.py 激活管道

bolezaixain\bolezaixain\spiders\blog_jobbole.py 編寫爬蟲代碼

# -*- coding: utf-8 -*-
import scrapy
from ..items import BolezaixainItem #導入本文件夾外的items文件


class BlogJobboleSpider(scrapy.Spider):
    name = 'blog.jobbole'
    allowed_domains = ['blog.jobbole.com/all-posts/']
    start_urls = ['http://blog.jobbole.com/all-posts/']

    def parse(self, response):
        title = response.xpath('//div[@class="post-meta"]/p/a[1]/@title').extract()
        url = response.xpath('//div[@class="post-meta"]/p/a[1]/@href').extract()
        times = response.xpath('//div[@class="post floated-thumb"]/div[@class="post-meta"]/p[1]/text()').extract()
        time = [time.strip().replace('\r\n', '').replace('·', '') for time in times if '/' in time]

        for title, url, time in zip(title, url, time):
            blzx_items = BolezaixainItem() # 實例化管道
            blzx_items['title'] = title
            blzx_items['url'] = url
            blzx_items['time'] = time
            yield blzx_items
        # 翻頁
        next = response.xpath('//a[@class="next page-numbers"]/@href').extract_first()
        # next = http://blog.jobbole.com/all-posts/page/2/
        yield scrapy.Request(url=next, callback=self.parse) # 回調

bolezaixain\bolezaixain\pipelines.py 保存數據

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
import json


class BolezaixainPipeline(object):
    def __init__(self):
        pass
        print('======================')
        self.f = open('blzx.json', 'w', encoding='utf-8')

    def open_spider(self, spider):
        pass

    def process_item(self, item, spider):
        s = json.dumps(dict(item), ensure_ascii=False) + '\n'
        self.f.write(s)
        return item

    def close_spider(self, spider):
        pass
        self.f.close()