上一次爬取了知乎問題和答案,這一次來爬取知乎用戶的信息html
知乎用戶信息都是放在一個json文件中,咱們找到存放這個json文件的url,就能夠請求這個json文件,獲得咱們的數據.git
url="https://www.zhihu.com/api/v4/members/liu-qian-chi-24?include=locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics",github
這是個人知乎的url,前面加顏色的部分是用戶名,後面一部分你要請求的內容,這個內容固然不是我手寫的,是瀏覽在請求時對應的參數.mongodb
能夠看到,這就是請求用戶信息時,include後面所包含的內容,由些,用戶信息的url構成爲 json
user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}'
include參數以下:api
user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
接下來 構造關注個人人 的url:
能夠經過瀏覽器看到 關注個人人 的url是 https://www.zhihu.com/api/v4/members/liu-qian-chi-24/followees?include=data%5B%2A%5D.
answer_count%2Carticles_count%2Cgender%2Cfollower_count%2Cis_followed%2Cis_following%2Cbadge%5B%3F%28type%3Dbest
_answerer%29%5D.topics&limit=20&offset=0
咱們要構造這四部分,能夠看到初次請時的參數:
這裏有三個參數了 liu-qian-chi-24 是用戶名,加上用戶名後就能夠構成了關注他的人 信息的url,了,構成以下瀏覽器
followed_url = "https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&limit={limit}&offset={offset}"
include參數以下:app
followed_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics"
以一樣的方法能夠構造 我關注的人 的信息url. 瀏覽器中參數以下:dom
由此把我關注的人的url 構造出來:scrapy
following_url = "https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&limit={limit}&offset={offset}"
include參數以下:
following_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics"
由於知乎在沒有登陸狀況下是不能訪問的,因此必定要模擬登陸,登陸過程當中會出現輸入驗證碼的狀況 ,我沒有使用雲打碼的方式,此次我把驗證碼下載到本地,經過手工輸入,
代碼以下:
def start_requests(self): """請求登陸頁面""" return [scrapy.Request(url="https://www.zhihu.com/signup", callback=self.get_captcha)] def get_captcha(self, response): """這一步主要是獲取驗證碼""" post_data = { "email": "lq574343028@126.com", "password": "lq534293223", "captcha": "", # 先把驗證碼設爲空,這樣知乎就會提示輸入驗證碼 } t = str(int(time.time() * 1000)) # 這裏是關鍵,這也是我找了很久才找到的方法,這就是知乎每次的驗證碼圖片的url captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t) return [scrapy.FormRequest(url=captcha_url, meta={"post_data": post_data}, callback=self.after_get_captcha)] def after_get_captcha(self, response): """把驗證碼存放到本地,手工輸入""" with open("E:/outback/zhihu/zhihu/utils/captcha.png", "wb") as f: f.write(response.body) try: # 這一句就是讓程序自動打打圖片 img = Image.open("E:/outback/zhihu/zhihu/utils/captcha.png") img.show() except: pass captcha = input("input captcha") post_data = response.meta.get("post_data", {}) post_data["captcha"] = captcha post_url = "https://www.zhihu.com/login/email" return [scrapy.FormRequest(url=post_url, formdata=post_data, callback=self.check_login)] def check_login(self, response): """驗證是否登陸成功""" text_json = json.loads(response.text) if "msg" in text_json and text_json["msg"] == "登陸成功": yield scrapy.Request("https://www.zhihu.com/", dont_filter=True, callback=self.start_get_info) else: # 若是不成功就再登陸一次 return [scrapy.Request(url="https://www.zhihu.com/signup", callback=self.get_captcha)] def start_get_info(self, response): """登陸成功後就能夠去請求用戶信息""" yield scrapy.Request(url=self.user_url.format(user="liu-qian-chi-24", include=self.user_query), callback=self.parse_user)
能夠看到用戶信息就是一個json文件,咱們解析這個json文件就行:
只是一點要注意 : is_end 意思是 是否還有下一頁,注意 這裏不能用 next 是否能打開來判斷,由於不管怎麼樣都能打開,
user_token 就是網頁中的用戶名,咱們用這個用戶名加上其餘參數來構造關注他,和他關注的url 信息.代碼以下:
def parse_user(self, response): user_data = json.loads(response.text) zhihu_item = ZhihuUserItem() # 解析這個json文件,若是這個key在item中,就存出item,這裏用字典的get()方法,即便字典中沒有這個值也不會出錯 for field in zhihu_item.fields: if field in user_data.keys(): zhihu_item[field] = user_data.get(field) yield zhihu_item # 經過url_token信息把followed_url yield出去 yield scrapy.Request( url=self.followed_url.format(user=user_data.get('url_token'), include=self.followed_query, limit=20,offset=0, ), callback=self.parse_followed) # 經過url_token信息把following_url yield出去 yield scrapy.Request( url=self.following_url.format(user=user_data.get('url_token'), include=self.following_query,limit=20, offset=0, ), callback=self.parse_following)
接下來就是分別解析following_url 和 followed_url 解析方法和解析user_url一下,這裏就不詳細說明了,代碼以下:
def parse_following(self, response): user_data = json.loads(response.text) zhihu_item = ZhihuUserItem() # 請求下一個頁面 if "paging" in user_data.keys() and user_data.get("paging").get("next") == "false": next_url = user_data.get("paging").get("next") yield scrapy.Request(url=next_url, callback=self.parse_following) if "data" in user_data.keys(): for result in user_data.get("data"): url_token = result.get("url_token") yield scrapy.Request(url=self.user_url.format(user=url_token, include=self.user_query), callback=self.parse_user) def parse_followed(self, response): user_data = json.loads(response.text) zhihu_item = ZhihuUserItem() # 請求下一個頁面 if "paging" in user_data.keys() and user_data.get("paging").get("next") == "false": next_url = user_data.get("paging").get("next") yield scrapy.Request(url=next_url, callback=self.parse_followed) if "data" in user_data.keys(): for result in user_data.get("data"): url_token = result.get("url_token") yield scrapy.Request(url=self.user_url.format(user=url_token, include=self.user_query), callback=self.parse_user)
至些 spider 中主要的邏輯結束.接下來咱們來把數據存入到mongodb 中
class ZhihuUserMongoPipeline(object): collection_name = 'scrapy_items' def __init__(self, mongo_uri, mongo_db): self.mongo_uri = mongo_uri self.mongo_db = mongo_db @classmethod def from_crawler(cls, crawler): return cls( mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DATABASE', 'items') ) def open_spider(self, spider): self.client = pymongo.MongoClient(self.mongo_uri) self.db = self.client[self.mongo_db] def close_spider(self, spider): self.client.close() def process_item(self, item, spider): # self.db[self.collection_name].insert_one(dict(item)) # 爲了使用數據不重複,咱們這裏以ID爲準進行更新 self.db[self.collection_name].update({'id': item['id']}, dict(item), True) return item
ROBOTSTXT_OBEY = False DOWNLOAD_DELAY = 3 DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', "HOST": "www.zhihu.com", "Referer": "https://www.zhizhu.com", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:58.0) Gecko/20100101 Firefox/58.0", } ITEM_PIPELINES = { # 'zhihu.pipelines.ZhihuPipeline': 300, 'zhihu.pipelines.ZhihuUserMongoPipeline': 300, } MONGO_URI="127.0.0.1:27017" MONGO_DATABASE="outback"
編寫item很是簡單,由於我此次是把數據存入mongodb中,因此我儘可能多抓取數據,咱們在請求那三個url時,每一個url都有一個include, 這個就是知乎的全部字段,咱們去重就行
following_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics" followed_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics" user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics'
item 代碼以下:
class ZhihuUserItem(scrapy.Item): # url info item 這幾個字段構成了url answer_count = scrapy.Field() articles_count = scrapy.Field() gender = scrapy.Field() follower_count = scrapy.Field() is_followed = scrapy.Field() is_following = scrapy.Field() badge = scrapy.Field() id = scrapy.Field() # 其餘咱們須要的信息 locations = scrapy.Field() employments = scrapy.Field() educations = scrapy.Field() business = scrapy.Field() voteup_count = scrapy.Field() thanked_Count = scrapy.Field() following_count = scrapy.Field() cover_url = scrapy.Field() following_topic_count = scrapy.Field() following_question_count = scrapy.Field() following_favlists_count = scrapy.Field() following_columns_count = scrapy.Field() pins_count = scrapy.Field() question_count = scrapy.Field() commercial_question_count= scrapy.Field() favorite_count = scrapy.Field() favorited_count = scrapy.Field() logs_count = scrapy.Field() marked_answers_count = scrapy.Field() marked_answers_text = scrapy.Field() message_thread_token = scrapy.Field() account_status = scrapy.Field() is_active = scrapy.Field() is_force_renamed = scrapy.Field() is_bind_sina = scrapy.Field() sina_weibo_urlsina_weibo_name = scrapy.Field() show_sina_weibo = scrapy.Field() is_blocking = scrapy.Field() is_blocked = scrapy.Field() mutual_followees_count = scrapy.Field() vote_to_count = scrapy.Field() vote_from_count = scrapy.Field() thank_to_count = scrapy.Field() thank_from_count = scrapy.Field() thanked_count = scrapy.Field() description = scrapy.Field() hosted_live_count = scrapy.Field() participated_live_count = scrapy.Field() allow_message = scrapy.Field() industry_category = scrapy.Field() org_name = scrapy.Field() org_homepage = scrapy.Field()
固然,爲了邊寫邊調試,咱們還少不了一個mian()函數,這樣方便打斷點調試
from scrapy.cmdline import execute import sys import os sys.path.insert(0,os.path.dirname(os.path.abspath(__file__))) print(os.path.dirname(os.path.abspath(__file__))) execute(["scrapy", "crawl", "zhihu_user"])
到此項目完成,照例把完整的spider代碼寫在這裏:
# -*- coding: utf-8 -*- import scrapy import time from PIL import Image import json from zhihu.items import ZhihuUserItem class ZhihuUserSpider(scrapy.Spider): name = 'zhihu_user' allowed_domains = ['zhihu.com'] start_urls = ["liu-qian-chi-24"] custom_settings = { "COOKIES_ENABLED": True } # 他關注的人的url following_url = "https://www.zhihu.com/api/v4/members/{user}/followees?include={include}&limit={limit}&offset={offset}" # 關注他的人的url followed_url = "https://www.zhihu.com/api/v4/members/{user}/followers?include={include}&limit={limit}&offset={offset}" # 用戶信息url user_url = 'https://www.zhihu.com/api/v4/members/{user}?include={include}' # 他關注的人的url構成參數 following_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics" # 關於他的人的url構成參數 followed_query = "data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics" # 用戶信息url構成參數信息 user_query = 'locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics' def start_requests(self): """請求登陸頁面""" return [scrapy.Request(url="https://www.zhihu.com/signup", callback=self.get_captcha)] def get_captcha(self, response): """這一步主要是獲取驗證碼""" post_data = { "email": "lq573320328@126.com", "password": "lq132435", "captcha": "", # 先把驗證碼設爲空,這樣知乎就會提示輸入驗證碼 } t = str(int(time.time() * 1000)) # 這裏是關鍵,這也是我找了很久才找到的方法,這就是知乎每次的驗證碼圖片的url captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t) return [scrapy.FormRequest(url=captcha_url, meta={"post_data": post_data}, callback=self.after_get_captcha)] def after_get_captcha(self, response): """把驗證碼存放到本地,手工輸入""" with open("E:/outback/zhihu/zhihu/utils/captcha.png", "wb") as f: f.write(response.body) try: # 這一句就是讓程序自動打打圖片 img = Image.open("E:/outback/zhihu/zhihu/utils/captcha.png") img.show() except: pass captcha = input("input captcha") post_data = response.meta.get("post_data", {}) post_data["captcha"] = captcha post_url = "https://www.zhihu.com/login/email" return [scrapy.FormRequest(url=post_url, formdata=post_data, callback=self.check_login)] def check_login(self, response): """驗證是否登陸成功""" text_json = json.loads(response.text) if "msg" in text_json and text_json["msg"] == "登陸成功": yield scrapy.Request("https://www.zhihu.com/", dont_filter=True, callback=self.start_get_info) else: # 若是不成功就再登陸一次 return [scrapy.Request(url="https://www.zhihu.com/signup", callback=self.get_captcha)] def start_get_info(self, response): """登陸成功後就能夠去請求用戶信息""" yield scrapy.Request(url=self.user_url.format(user="liu-qian-chi-24", include=self.user_query), callback=self.parse_user) def parse_user(self, response): user_data = json.loads(response.text) zhihu_item = ZhihuUserItem() # 解析這個json文件,若是這個key在item中,就存出item,這裏用字典的get()方法,即便字典中沒有這個值也不會出錯 for field in zhihu_item.fields: if field in user_data.keys(): zhihu_item[field] = user_data.get(field) yield zhihu_item # 經過url_token信息把followed_url yield出去 yield scrapy.Request( url=self.followed_url.format(user=user_data.get('url_token'), include=self.followed_query, limit=20,offset=0, ), callback=self.parse_followed) # 經過url_token信息把following_url yield出去 yield scrapy.Request( url=self.following_url.format(user=user_data.get('url_token'), include=self.following_query,limit=20, offset=0, ), callback=self.parse_following) def parse_following(self, response): user_data = json.loads(response.text) zhihu_item = ZhihuUserItem() # 請求下一個頁面 if "paging" in user_data.keys() and user_data.get("paging").get("next") == "false": next_url = user_data.get("paging").get("next") yield scrapy.Request(url=next_url, callback=self.parse_following) if "data" in user_data.keys(): for result in user_data.get("data"): url_token = result.get("url_token") yield scrapy.Request(url=self.user_url.format(user=url_token, include=self.user_query), callback=self.parse_user) def parse_followed(self, response): user_data = json.loads(response.text) zhihu_item = ZhihuUserItem() # 請求下一個頁面 if "paging" in user_data.keys() and user_data.get("paging").get("next") == "false": next_url = user_data.get("paging").get("next") yield scrapy.Request(url=next_url, callback=self.parse_followed) if "data" in user_data.keys(): for result in user_data.get("data"): url_token = result.get("url_token") yield scrapy.Request(url=self.user_url.format(user=url_token, include=self.user_query), callback=self.parse_user)
項目結構以下:
項目還有一些不足的地方
1, 應該把存入mongdo的函數寫在Item類中,在Pipeline 統一調用這個類的接口就行.這樣若是項目中用不少個爬蟲的話,就能夠共用這個類,
2.不是每一個user的全部字段都有值,應該在存入數據前把沒有值的字段過濾了.
3.沒有加異常處理機制,我在跑這個爬蟲的過程當中沒有出現異常,因此也就沒有加異常處理機制.
4,沒有作成分步式.
github https://github.com/573320328/zhihu.git 記得點start哦