scrapy爬取知乎用戶信息並存入mongodb

時間 2019-11-24

原文原文鏈接

大體的思路流程以下：
html

經過這樣的爬取流程，能夠不斷遞歸爬去大量數。 mongodb

咱們以輪子哥做爲選定起始人，利用Google chrome 的審查觀察網頁的network chrome

能夠看出實際上咱們在瀏覽器在得到這follower 和followee數據時，是調用了一個知乎的api.而且咱們觀察每一個用戶的數據，能夠知道url_token這個參數是很是重要的，由於它標實了每一個用戶，咱們能夠用過url_token來定位這個用戶。 json

爲了獲取單個用戶更加詳細的信息，咱們把鼠標放在用戶頭像上 api

這時候network中會出現一個相對應的A-jax請求：瀏覽器

打開cmd，經過scrapy startproject zhihuuser 命令建立項目開始正式寫代碼服務器

須要先把settings.py中的ROBOTSTXT_OBEY選擇爲false，讓爬蟲暢通無阻不受協議的約束，不然有些地方是訪問不了的。
進入zhihuuser的文件夾，使用scrapy genspider zhihu www.zhihu.com建立知乎的爬蟲
首先咱們測試這個爬蟲，直接在cmd中運行爬蟲。運行過程當中出現了卡頓，而且出現了400代號的服務器錯誤相應。這是由於知乎服務器有user-agent識別，因而咱們修改settings.py中的DEFAULT_REQUEST_HEADERS 參數，添加上瀏覽器默認的user-agent
改修初始請求：

第一步建立了幾個user_url follow_url的格式，而且添加上了他的include參數（即follow_query） dom

第二步是重寫了start_requests，初始的用戶選擇輪子哥（start_usr），經過format 補全了網址，而且分別選擇回調函數parse_user和parse_follows scrapy
建立儲存數據的容器items

由於數據太多，我這邊沒有寫全。
6.儲存每一個用戶的信息。ide

咱們在爬蟲中寫parse_user函數，在其中引入UserItem()對象經過item.fields屬性，能夠獲得useritem中的fields。從而填充容器

7.對於followers和followees的Ajax請求

在Ajax請求中有兩個數據塊：data和paging

①對於data，咱們主要須要的信息是url_token,它是某一個用戶的惟一標識，經過這個標識能夠構建出這個用戶的主頁網址。

②對於paging

paging是判斷followers和followees的Ajax請求，也就是粉絲列表是否已是最後一列了。Is_end是判斷是不是最後一頁，next是下一頁的url，這兩個數據是咱們須要的。

首先判斷是否有paging，當is_end是false的時候獲取到下一頁的followes和folloee列表的網址，而後yield Request

8.把用戶信息存到Mongodb中，即改寫pipeline

Scrapy官方文檔有mongodb的代碼，直接複製

爲了去重，須要修改一條代碼：

固然還須要在settings.py中修改設置參數

最後運行代碼

代碼：

爬蟲代碼（zhihu.py）:

# -*- coding: utf-8 -*-
import json

import scrapy
from scrapy import Request, Spider

from zhihuuser.items import UserItem


class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['www.zhihu.com']
    start_user = 'excited-vczh'
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    user_query = 'allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(type=best_answerer)].topics'

    follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}'
    followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    def start_requests(self):
        yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user)
        yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, offset=0, limit=20),
                      self.parse_follows)
        yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, offset=0, limit=20),
                      self.parse_followers)

    def parse_user(self, response):
        result = json.loads(response.text)
        item = UserItem()  # 聲明一個item對象的引用
        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        yield item
        # 獲取關注列表 followees
        yield Request(
            self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0),
            callback=self.parse_user)
        # 獲取粉絲列表 followers
        yield Request(
            self.followers_url.format(user=result.get('url_token'), include=self.followers_query, limit=20, offset=0),
            callback=self.parse_user)

    def parse_follows(self, response):
        results = json.loads(response.text)
        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              callback=self.parse_user)
        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            yield Request(next_page, callback=self.parse_follows)

    def parse_followers(self, response):
        results = json.loads(response.text)
        if 'data' in results.keys():
            for result in results.get('data'):
                yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                              callback=self.parse_user)
        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            yield Request(next_page, callback=self.parse_followers)

pipeline代碼:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


import pymongo
from scrapy import item


class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        try:
            self.db['user'].update({'url_token': item['url_token']},{'$set':item}, True)
            #第一個參數是一個去重的條件 #第二個參數是傳入一個item變量， 第三個True表示若是找到就會執行更新，else執行插入
        except:
            pass
        return item

items.py：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy import Field as f


class UserItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    id = f()
    name = f()
    employments = f()
    name = f()
    url_token = f()
    follower_count = f()
    url = f()
    answer_count = f()
    headline = f()