python爬蟲30 | scrapy後續，把「糗事百科」的段子爬下來而後存到數據庫中

時間 2019-11-16

標籤 python 爬蟲 scrapy 後續糗事百科段子爬下而後數據庫欄目 Python 简体版

原文原文鏈接

上回咱們說到 php

python爬蟲29 | 使用scrapy爬取糗事百科的例子，告訴你它有多厲害！html

WOW！！python

scrapyios

awesome！！nginx

怎麼會有這麼牛逼的框架web

wow！！數據庫

awesome！！json

用 scrapy 來爬取數據ubuntu

豈！不！是！非！常！爽！vim

wow！！

接下來就是我獨享的moment

哦不

接下來就是

學習 python 的正確姿式

咱們已經建立了爬取糗事百科的項目

而且把糗事百科的前兩頁的做者和段子爬取到 json 文件了

此次

咱們將咱們要爬取全部的數據

使用 scrapy 存儲到 mangodb 中

在此以前仍是先介紹一下咱們使用 scrapy 建立出來的文件目錄

各個文件表明的都是啥意思

省得又有些 b 友當場懵逼

咱們從上往下依個介紹一下

這個 spiders 目錄呢

就是用來存放咱們寫爬蟲文件的地方

items.py

就是用來定義咱們要存儲數據的字段

middlewares.py

就是中間件，在這裏面能夠作一些在爬蟲過程當中想幹的事情，好比爬蟲在響應的時候你能夠作一些操做

pipelines.py

這是咱們用來定義一些存儲信息的文件，好比咱們要鏈接 MySQL或者 MongoDB 就能夠在這裏定義

settings.py

這個文件用來定義咱們的各類配置，好比配置請求頭信息等

以上就是 scrapy 生成的目錄中主要文件的做用

接下來咱們就進入代碼中

咱們上次建立了 QiushiSpider 來寫咱們的爬蟲

當時咱們只是獲取了前兩頁的數據

咱們要獲取全部頁面的數據怎麼玩呢

打開糗事百科的連接能夠看到

13 頁的數據

其實按照之前咱們直接寫個 for 循環就能夠了

不過咱們此次還可使用 scrapy 的 follow 函數

具體使用是這樣的

咱們先獲取下一頁的連接

因爲下一頁這個按鈕都是在最後一個 li 標籤中的

因此用 xpath 獲取就這樣

next_page = response.xpath('//*[@id="content-left"]/ul/li[last()]/a').attrib['href']

接着咱們就可讓它去請求下一頁的內容數據了

  if next_page is not None:      yield response.follow(next_page, callback=self.parse)

你也能夠用 urljoin 的方式

# if next_page is not None:      # next_page = response.urljoin(next_page)      # yield scrapy.Request(next_page, callback=self.parse)

這樣咱們就能夠獲取到全部頁面的數據了

接下來咱們要把全部的數據保存到數據庫

首先咱們在 items.py 中定義一下咱們要存儲的字段

import scrapy

class QiushibaikeItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() author = scrapy.Field() content = scrapy.Field()    _id = scrapy.Field()

接着咱們在 parse 方法中將獲取到的數據賦值給 item

具體來講就是這樣

 def parse(self, response):
 content_left_div = response.xpath('//*[@id="content-left"]') content_list_div = content_left_div.xpath('./div')
 for content_div in content_list_div: item = QiushibaikeItem() item['author'] = content_div.xpath('./div/a[2]/h2/text()').get() item['content'] = content_div.xpath('./a/div/span/text()').getall() item['_id'] = content_div.attrib['id'] yield item
 next_page = response.xpath('//*[@id="content-left"]/ul/li[last()]/a').attrib['href']
 if next_page is not None: yield response.follow(next_page, callback=self.parse)

第 7 行就是獲取咱們剛剛定義的 item 的類

8-10 行就是相應的賦值

那麼咱們定義好了要存儲的字段以及寫好了數據爬取

接下來還有一步

就是定義好咱們要存儲的數據庫

到 pipelines.py 中

class QiushibaikePipeline(object):
 def __init__(self): self.connection = pymongo.MongoClient('localhost', 27017) self.db = self.connection.scrapy  self.collection = self.db.qiushibaike 
 def process_item(self, item, spider): if not self.connection or not item: return self.collection.save(item)
 def __del__(self): if self.connection: self.connection.close()

在這裏咱們鏈接到本地的 MongoDB

創建了 scrapy 數據庫及如下的 qiushibaike

接下來還要在 settings.py 文件中配置下

# See https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'qiushibaike.pipelines.QiushibaikePipeline': 300,}

這樣纔可使用到pipelines

固然咱們還能夠在 settings.py 裏面作更多的設置

好比設置請求頭

# Crawl responsibly by identifying yourself (and your website) on the user-agentUSER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/73.0.3683.86 Chrome/73.0.3683.86 Safari/537.36'

搞定了以後

咱們使用命令來抓取一下