Scrapy爬取知乎用戶信息以及人際拓撲關係

時間 2019-11-19

原文原文鏈接

Scrapy爬取知乎用戶信息以及人際拓撲關係

1.生成項目

scrapy提供一個工具來生成項目，生成的項目中預置了一些文件，用戶須要在這些文件中添加本身的代碼。打開命令行，執行：scrapy startproject tutorial，生成的項目相似下面的結構linux

tutorial/  
   scrapy.cfg  
   tutorial/  
    __init__.py  
       items.py  
       pipelines.py  
       settings.py  
       spiders/  
           __init__.py

2.源碼分析

2.1文件成員介紹

####2.1.1 items.py
類須要繼承 scrapy.Item,爬取的主要目標就是從非結構性的數據源提取結構性數據，例如網頁。 Scrapy提供Item類來知足這樣的需求。該項目中定義了在爬取元素構造的知乎用戶屬性和知乎用戶關係兩大塊Item,爲爬取後的數據轉化爲結構性數據,同時,也爲後面的持久化作準備. ####2.1.2 pipelines.py :當Item在Spider中被收集以後，它將會被傳遞到Item Pipeline，一些組件會按照必定的順序執行對Item的處理。git

每一個item pipeline組件(有時稱之爲「Item Pipeline」)是實現了簡單方法的Python類。他們接收到Item並經過它執行一些行爲，同時也決定此Item是否繼續經過pipeline，或是被丟棄而再也不進行處理。github

如下是item pipeline的一些典型應用：web

清理HTML數據
驗證爬取的數據(檢查item包含某些字段)
查重(並丟棄)
將爬取結果保存到數據庫中編寫你本身的item pipeline 編寫你本身的item pipeline很簡單，每一個item pipeline組件是一個獨立的Python類，同時必須實現如下方法:

process_item(self, item, spider) 每一個item pipeline組件都須要調用該方法，這個方法必須返回一個 Item (或任何繼承類)對象，或是拋出 DropItem 異常，被丟棄的item將不會被以後的pipeline組件所處理。
該項目中就在該類中實現把數據持久化到mango數據庫中數據庫

參數: item (Item 對象) – 被爬取的item spider (Spider 對象) – 爬取該item的spidercookie

import os

from pymongo import MongoClient
from zhihu.settings import MONGO_URI, PROJECT_DIR
from zhihu.items import ZhihuPeopleItem, ZhihuRelationItem
from zhihu.tools.async import download_pic


class ZhihuPipeline(object):
    """
    存儲數據
    """
    def __init__(self, mongo_uri, mongo_db, image_dir):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
        self.image_dir = image_dir
        self.client = None
        self.db= None
    def process_item(self, item, spider):
        """
        處理item
        """
        if isinstance(item, ZhihuPeopleItem):
            self._process_people(item)
        elif isinstance(item, ZhihuRelationItem):
            self._process_relation(item)
        return item

####2.1.3 settings.py
配置文件,提供爬蟲系統參數的可配置化
####2.1.4 spiders 這個是一個包,能夠定製本身的爬蟲代碼.如ZhihuSipder這個類就是知乎爬蟲的啓動類.dom

2.2 技術要點

####2.2.1 CrawlSpider
這裏採用CrawlSpider,這個繼承了Spider的且,很是適合作全站數據爬取的類.scrapy

class ZhihuSipder(CrawlSpider):
    name = "zhihu"
    allowed_domains = ["www.zhihu.com"]
    start_url = "https://www.zhihu.com/people/hua-zi-xu-95"

簡析上面的三個參數: 1.name:這個參數是每一個自定義爬蟲必需要有的,惟一辨別你的自定義爬蟲,name固然也是使用命令啓動爬蟲的一個決定因素.啓動命令:scrapy crawl zhihu,其中的zhihu就是name屬性. 2.allowed_domains:連接中必須包含的域名. 3.start_url:爬蟲的入口地址,這裏就是以我本人的知乎主頁爲入口. ####2.2.2 FormRequest作登錄模塊async

from scrapy.http import Request, FormRequest
    def post_login(self, response):
        """
        解析登錄頁面，發送登錄表單
        """
        self.xsrf = Selector(response).xpath(
            '//input[@name="_xsrf"]/@value'
        ).extract()[0]
        return [FormRequest(
            'https://www.zhihu.com/login/email',
            method='POST',
            meta={'cookiejar': response.meta['cookiejar']},
            formdata={
                '_xsrf': self.xsrf,
                'email': 'xxxxxxxxx',
                'password': 'xxxxxxxxx',
                'remember_me': 'true'},
            callback=self.after_login
        )]

####2.2.3 Requests and Responses Scrapy使用 Request 和 Response 對象爬取web站點。ide

通常來講，Request 對象在spiders中被生成而且最終傳遞到下載器(Downloader)，下載器對其進行處理並返回一個 Response 對象， Response 對象還會返回到生成request的spider中。

全部 Request and Response 的子類都會實現一些在基類中非必要的功能。它們會在 Request subclasses 和 Response subclasses 兩部分進行詳細的說明。下面是Request構造函數

class Request(object_ref):
    def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                 cookies=None, meta=None, encoding='utf-8', priority=0,
                 dont_filter=False, errback=None):
        self._encoding = encoding  # this one has to be set first
        self.method = str(method).upper()
        self._set_url(url)
        self._set_body(body)
        assert isinstance(priority, int), "Request priority not an integer: %r" % priority
       self.priority = priority
        assert callback or not errback, "Cannot use errback without a callback"
        self.callback = callback
        self.errback = errback
        self.cookies = cookies or {}
        self.headers = Headers(headers or {}, encoding=encoding)
        self.dont_filter = dont_filter
        self._meta = dict(meta) if meta else None

特別注意:參數:dont_filter
dont_filter(boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.
解釋:scrapy url去重 Request對象參數的dont_filter=False,默認就是false 如程序所示,默認就是要進行url去重
show code:

def enqueue_request(self, request):
    //核心判斷 注意  not 取否
        if not request.dont_filter and self.df.request_seen(request):
            self.df.log(request, self.spider)
            return False
        dqok = self._dqpush(request)
        if dqok:
            self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider)
        else:
            self._mqpush(request)
            self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider)
        self.stats.inc_value('scheduler/enqueued', spider=self.spider)
        return True

ps: scrapy 中判斷重複內容的方法(RFPDupeFilter)