大體的思路流程以下: html
經過這樣的爬取流程,能夠不斷遞歸爬去大量數。 mongodb
咱們以輪子哥做爲選定起始人,利用Google chrome 的審查觀察網頁的network chrome
能夠看出實際上咱們在瀏覽器在得到這follower 和followee數據時,是調用了一個知乎的api.而且咱們觀察每一個用戶的數據,能夠知道url_token這個參數是很是重要的,由於它標實了每一個用戶,咱們能夠用過url_token來定位這個用戶。 json
爲了獲取單個用戶更加詳細的信息,咱們把鼠標放在用戶頭像上 api
這時候network中會出現一個相對應的A-jax請求: 瀏覽器
打開cmd,經過scrapy startproject zhihuuser 命令建立項目開始正式寫代碼 服務器
第一步建立了幾個user_url follow_url的格式,而且添加上了他的include參數(即follow_query) dom
第二步是重寫了start_requests,初始的用戶選擇輪子哥(start_usr),經過format 補全了網址,而且分別選擇回調函數parse_user和parse_follows scrapy
由於數據太多,我這邊沒有寫全。
6.儲存每一個用戶的信息。ide
咱們在爬蟲中寫parse_user函數,在其中引入UserItem()對象經過item.fields屬性,能夠獲得useritem中的fields。從而填充容器
7.對於followers和followees的Ajax請求
在Ajax請求中有兩個數據塊:data和paging
①對於data,咱們主要須要的信息是url_token,它是某一個用戶的惟一標識,經過這個標識能夠構建出這個用戶的主頁網址。
②對於paging
paging是判斷followers和followees的Ajax請求,也就是粉絲列表是否已是最後一列了。Is_end是判斷是不是最後一頁,next是下一頁的url,這兩個數據是咱們須要的。
首先判斷是否有paging,當is_end是false的時候獲取到下一頁的followes和folloee列表的網址,而後yield Request
8.把用戶信息存到Mongodb中,即改寫pipeline
Scrapy官方文檔有mongodb的代碼,直接複製
爲了去重,須要修改一條代碼:
固然還須要在settings.py中修改設置參數
最後運行代碼
代碼 :
爬蟲代碼(zhihu.py):
# -*- coding: utf-8 -*- import json import scrapy from scrapy import Request, Spider from zhihuuser.items import UserItem class ZhihuSpider(scrapy.Spider): name = 'zhihu' allowed_domains = ['www.zhihu.com'] start_user = 'excited-vczh' user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}' user_query = 'allow_message,is_followed,is_following,is_org,is_blocking,employments,answer_count,follower_count,articles_count,gender,badge[?(type=best_answerer)].topics' follows_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&offset={offset}&limit={limit}' follows_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics' followers_url = 'https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&offset={offset}&limit={limit}' followers_query = 'data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics' def start_requests(self): yield Request(self.user_url.format(user=self.start_user, include=self.user_query), self.parse_user) yield Request(self.follows_url.format(user=self.start_user, include=self.follows_query, offset=0, limit=20), self.parse_follows) yield Request(self.followers_url.format(user=self.start_user, include=self.followers_query, offset=0, limit=20), self.parse_followers) def parse_user(self, response): result = json.loads(response.text) item = UserItem() # 聲明一個item對象的引用 for field in item.fields: if field in result.keys(): item[field] = result.get(field) yield item # 獲取關注列表 followees yield Request( self.follows_url.format(user=result.get('url_token'), include=self.follows_query, limit=20, offset=0), callback=self.parse_user) # 獲取粉絲列表 followers yield Request( self.followers_url.format(user=result.get('url_token'), include=self.followers_query, limit=20, offset=0), callback=self.parse_user) def parse_follows(self, response): results = json.loads(response.text) if 'data' in results.keys(): for result in results.get('data'): yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query), callback=self.parse_user) if 'paging' in results.keys() and results.get('paging').get('is_end') == False: next_page = results.get('paging').get('next') yield Request(next_page, callback=self.parse_follows) def parse_followers(self, response): results = json.loads(response.text) if 'data' in results.keys(): for result in results.get('data'): yield Request(self.user_url.format(user=result.get('url_token'), include=self.user_query), callback=self.parse_user) if 'paging' in results.keys() and results.get('paging').get('is_end') == False: next_page = results.get('paging').get('next') yield Request(next_page, callback=self.parse_followers)
pipeline代碼:
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html import pymongo from scrapy import item class MongoPipeline(object): collection_name = 'scrapy_items' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): try: self.db['user'].update({'url_token': item['url_token']},{'$set':item}, True) #第一個參數是一個去重的條件 #第二個參數是傳入一個item變量, 第三個True表示若是找到就會執行更新,else執行插入 except: pass return item
items.py:
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # https://doc.scrapy.org/en/latest/topics/items.html import scrapy from scrapy import Field as f class UserItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() id = f() name = f() employments = f() name = f() url_token = f() follower_count = f() url = f() answer_count = f() headline = f()