爬蟲（十六）：scrapy爬取知乎用戶信息

時間 2019-11-29

原文原文鏈接

一：爬取思路html

首先咱們應該找到一個帳號，這個帳號被關注的人和關注的人都相對比較多的，就是下圖中金字塔頂端的人，而後經過爬取這個帳號的信息後，再爬取他關注的人和被關注的人的帳號信息，而後爬取被關注人的帳號信息和被關注信息的關注列表，爬取這些用戶的信息，經過這種遞歸的方式從而爬取整個知乎的全部的帳戶信息。整個過程經過下面兩個圖表示：git

二：爬蟲過程分析github

這裏咱們找的帳號地址是：https://www.zhihu.com/people/excited-vczh/answersmongodb

下圖是大V的主要信息：數據庫

而後咱們獲取他關注的人和關注他的人的信息：json

這裏咱們須要經過抓包分析若是獲取這些列表的信息以及用戶的我的信息內容
當咱們查看他關注人的列表的時候咱們能夠看到他請求了以下圖中的地址，而且咱們能夠看到返回去的結果是一個json數據，而這裏就存着一頁關乎的用戶信息。api

上面雖然能夠獲取單個用戶的我的信息，可是不是特別完整，這個時候咱們獲取一我的的完整信息地址是當咱們將鼠標放到用戶名字上面的時候，能夠看到發送了一個請求：app

咱們能夠看這個地址的返回結果能夠知道，這個地址請求獲取的是用戶的詳細信息:dom

經過上面的分析咱們知道了如下兩個地址：scrapy

1關注列表：https://www.zhihu.com/api/v4/members/excited-vczh/followees?include=data%5B*%5D.answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics&offset=0&limit=20

二、詳情信息：https://www.zhihu.com/api/v4/members/nan-xia-95-92?include=allow_message%2Cis_followed%2Cis_following%2Cis_org%2Cis_blocking%2Cemployments%2Canswer_count%2Cfollower_count%2Carticles_count%2Cgender%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics

這裏咱們能夠從請求的這兩個地址裏發現一個問題，關於用戶信息裏的url_token其實就是獲取單個用戶詳細信息的一個憑證也是請求的一個重要參數，而且當咱們點開關注人的的連接時發現請求的地址的惟一標識也是這個url_token。

三：建立項目實戰

經過命令建立項目
scrapy startproject zhihu_user
cd zhihu_user
scrapy genspider zhihu zhihu.com

建立好後用pycharm打開：

更改settings文件：

# 是否遵循爬取規則，咱們改爲False
ROBOTSTXT_OBEY = False
# 添加請求頭信息，由於知乎默認檢測請求頭的

DEFAULT_REQUEST_HEADERS = {

    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
    'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20',
}

四：代碼實現

（1）：items中的代碼主要是咱們要爬取的字段的定義

import scrapy


class UserItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    id = scrapy.Field()
    name = scrapy.Field()
    allow_message = scrapy.Field()
    answer_count = scrapy.Field()
    articles_count = scrapy.Field()
    avatar_url = scrapy.Field()
    avatar_url_template = scrapy.Field()
    badge = scrapy.Field()
    employments = scrapy.Field()
    follower_count = scrapy.Field()
    gender = scrapy.Field()
    headline = scrapy.Field()
    is_advertiser = scrapy.Field()
    is_blocking = scrapy.Field()
    is_followed = scrapy.Field()
    is_following = scrapy.Field()
    is_org = scrapy.Field()
    type = scrapy.Field()
    url = scrapy.Field()
    url_token = scrapy.Field()
    user_type = scrapy.Field()

（2）：spiders中的主要代碼

# -*- coding: utf-8 -*-
import json

import scrapy

from scrapy_zhihuuser.items import UserItem


class ZhihuSpider(scrapy.Spider):
    name = 'zhihu'
    allowed_domains = ['zhihu.com']
    start_urls = ['http://zhihu.com/']

    # 起始的大V帳號
    start_user = 'excited-vczh'

    # 這裏把查詢的參數單獨存儲爲user_query,user_url存儲的爲查詢用戶信息的url地址
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
    user_query = 'allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(type=best_answerer)].topics'

    # #follows_url存儲的爲關注列表的url地址,fllows_query存儲的爲查詢參數。這裏涉及到offset和limit是關於翻頁的參數，0，20表示第一頁
    follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}'
    follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    # #followers_url是獲取粉絲列表信息的url地址，followers_query存儲的爲查詢參數。
    followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}'
    followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics'

    # 第一次訪問的方法重寫
    def start_requests(self):
        """
        這裏重寫了start_requests方法，分別請求了用戶查詢的url和關注列表的查詢以及粉絲列表信息查詢
        :return:
        """
        yield scrapy.Request(self.user_url.format(user=self.start_user, include=self.user_query),
                             callback=self.parse_user)
        yield scrapy.Request(
            self.follows_url.format(user=self.start_user, include=self.follows_query, offset=0, limit=20),
            callback=self.parse_follows)
        yield scrapy.Request(
            self.follows_url.format(user=self.start_user, include=self.followers_query, offset=0, limit=20),
            callback=self.parse_followers)

    def parse_user(self, response):
        """
        由於返回的是json格式的數據，因此這裏直接經過json.loads獲取結果
        :param response:
        :return:
        """
        result = json.loads(response.text)
        item = UserItem()
        # 這裏循環判斷獲取的字段是否在本身定義的字段中，而後進行賦值
        for field in item.fields:
            if field in result.keys():
                item[field] = result.get(field)
        # 這裏在返回item的同時返回Request請求，繼續遞歸拿關注用戶信息的用戶獲取他們的關注列表
        yield item
        yield scrapy.Request(
            self.follows_url.format(user=result.get("url_token"), include=self.follows_query, offset=0, limit=20),
            callback=self.parse_follows)
        yield scrapy.Request(
            self.followers_url.format(user=result.get("url_token"), include=self.followers_query, offset=0, limit=20),
            callback=self.parse_followers)

    def parse_follows(self, response):
        # 用戶關注列表的解析，這裏返回的也是json數據 這裏有兩個字段data和page，其中page是分頁信息
        results = json.loads(response.text)
        if 'data' in results.keys():
            for result in results.get('data'):
                yield scrapy.Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                                     self.parse_user)
        # 這裏判斷page是否存在而且判斷page裏的參數is_end判斷是否爲False，若是爲False表示不是最後一頁，不然則是最後一頁
        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            # 獲取下一頁的地址而後經過yield繼續返回Request請求，繼續請求本身再次獲取下頁中的信息
            yield scrapy.Request(next_page, self.parse_follows)

    def parse_followers(self, response):
        """
        這裏其實和關乎列表的處理方法是同樣的
        用戶粉絲列表的解析，這裏返回的也是json數據 這裏有兩個字段data和page，其中page是分頁信息
        :param response:
        :return:
        """

        results = json.loads(response.text)
        if 'data' in results.keys():
            for result in results.get('data'):
                yield scrapy.Request(self.user_url.format(user=result.get('url_token'), include=self.user_query),
                                     self.parse_user)
        if 'paging' in results.keys() and results.get('paging').get('is_end') == False:
            next_page = results.get('paging').get('next')
            yield scrapy.Request(next_page, self.parse_followers)

關於上面爬蟲的簡單描述：

1. 當重寫start_requests，一會有三個yield，分別的回調函數調用了parse_user,parse_follows,parse_followers，這是第一次會分別獲取咱們所選取的大V的信息以及關注列表信息和粉絲列表信息
2. 而parse分別會再次回調parse_follows和parse_followers信息，分別遞歸獲取每一個用戶的關注列表信息和分析列表信息
3. parse_follows獲取關注列表裏的每一個用戶的信息回調了parse_user，並進行翻頁獲取回調了本身parse_follows
4. parse_followers獲取粉絲列表裏的每一個用戶的信息回調了parse_user，並進行翻頁獲取回調了本身parse_followers

（3）：關於數據存儲到mongodb

更改pipeline代碼：

import pymongo


class MongoPipeline(object):
    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
　　　　　　  # 在settings中定義數據庫相關的操做
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db['user'].update({'url_token': item['url_token']}, {'$set': item}, True)  # 更新去重
        # self.db[self.collection_name].insert_one(dict(item))
        return item