光棍節專題：python程序員如何爬取知乎用戶找女友

時間 2019-11-16

標籤光棍節專題 python 程序員如何用戶欄目 Python 简体版

原文原文鏈接

前言：本文主要講scrapy框架的原理和使用，建議至少在理解掌握python爬蟲原理後再使用框架(不要問我爲何，我哭給你看)。html

雙十一立刻就要來了，在舉國一片「買買買」的呼聲中，單身汪的咆哮聲也愈發淒厲了。做爲一個Python程序員，要如何找到小姐姐，避開暴擊傷害，在智中取勝呢？因而就有了如下的對話：python

so~今天咱們的目標是，爬社區的小姐姐~並且，咱們又要用到新的姿式(霧)了~scrapy爬蟲框架~mysql

1.scrapy原理

在寫過幾個爬蟲程序以後，咱們就知道，利用爬蟲獲取數據大概的步驟：請求網頁，獲取網頁，匹配信息，下載數據，數據清洗，存入數據庫。git

scrapy是一個頗有名的爬蟲框架，能夠很方便的進行網頁信息爬取。那麼scrapy究竟是如何工做的呢？以前在網上看了很多scrapy入門的教程，大多數入門教程都配有這張圖。程序員

_(:зゝ∠)_也不知道是這張圖實在太經典了，仍是程序員們都懶得畫圖，第一次看到這個圖的時候，米醬的心情是這樣的github

通過了一番深刻的理解，大概知道這幅圖的意思，讓我來舉個栗子(是的，我又要舉奇怪的栗子了)：sql

當咱們想吃東西的時候，咱們會出門，走到街上，尋找一家想吃的點，而後點餐，服務員再通知廚房去作，最後菜到餐桌上，或者被打包帶走。這就是爬蟲程序在作的事，它要將全部獲取數據須要進行的操做，都寫好。chrome

而scrapy就像一個點餐app通常的存在，在訂餐列表(spiders)選取本身目標餐廳裏想吃的菜(items)，在收貨(pipeline)處寫上本身的收貨地址(存儲方式)，點餐系統(scrapy engine)會根據訂餐狀況要求商鋪(Internet)的廚房(download)將菜作好，因爲會產生多個外賣取貨訂單(request)，系統會根據派單(schedule)分配外賣小哥從廚房取貨(request)和送貨(response)。說着說着我都餓了。。。。數據庫

什麼意思呢？在使用scrapy時，咱們只須要設置spiders(想要爬取的內容)，pipeline(數據的清洗，數據的存儲方式)，還有一個middlewares，是各功能間對接時的一些設置，就能夠不用操心其餘的過程，一切交給scrapy模塊來完成。json

2.建立scrapy工程

安裝scrapy以後，建立一個新項目

$ scrapy startproject zhihuxjj
複製代碼

我用的是pycharm編譯器，在spiders文件下建立zhihuxjj.py

在zhihuxjj.py這個文件中，咱們要編寫咱們的爬取規則。

3.爬取規則制定(spider)

建立好了項目，讓咱們來看一下咱們要吃的店和菜…哦不，要爬的網站和數據。

我選用了知乎做爲爬取平臺，知乎是沒有用戶從1到n的序列id的，每一個人能夠設置本身的我的主頁id，且爲惟一。因此採選了選取一枚種子用戶，爬取他的關注者，也能夠關注者和粉絲一塊兒爬，考慮到粉絲中有些三無用戶，我僅選擇了爬取關注者列表，再經過關注者主頁爬取關注者的關注者，如此遞歸。

對於程序的設計，是這樣的。

start url是scrapy中的一個標誌性的值，它用於設置爬蟲程序的開始，也就是從哪裏開始爬，按照設定，從種子用戶我的主頁開始爬即是正義，可是考慮到我的主頁的連接會進行重複使用，因此在這裏我將起始url設成了知乎主頁。

以後就是種子用戶的我的主頁，知乎粉絲多的大V不少，可是關注多的人就比較難發現了，這裏我選擇了知乎的黃繼新，聯合創始人，想必關注了很多優質用戶(≖‿≖)✧。

分析一下我的主頁可知，我的主頁由'www.zhihu.com/people/' + 用戶id 組成，咱們要獲取的信息是用callback回調函數(敲黑板！！劃重點！！)的方式設計，這裏一共設計了倆個回調函數：用戶的關注列表和關注者的我的信息。

使用chrome瀏覽器查看上圖的頁面可知獲取關注列表的url，以及關注者的用戶id。

將鼠標放在用戶名上。

能夠得到我的用戶信息的url。分析url可知：

關注者列表連接構成：'https://www.zhihu.com/api/v4/members/' + '用戶id' + '/followees?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset=0&limit=20'
我的信息連接構成：'https://www.zhihu.com/api/v4/members/' + '用戶id' + '?include=allow_message%2Cis_followed%2Cis_following%2Cis_org%2Cis_blocking%2Cemployments%2Canswer_count%2Cfollower_count%2Carticles_count%2Cgender%2Cbadge%5B%3F(type%3Dbest_answerer)%5D.topics'
複製代碼

so，咱們在上一節中建立的zhihuxjj.py文件中寫入如下代碼。

import json
from zhihuxjj.items import ZhihuxjjItem
from scrapy import Spider,Request

class ZhihuxjjSpider(Spider):
    name='zhihuxjj' #scrapy用於區別其餘spider的名字，具備惟一性。
    allowed_domains = ["www.zhihu.com"] #爬取範圍
    start_urls = ["https://www.zhihu.com/"]
    start_user = "jixin"
    followees_url = 'https://www.zhihu.com/api/v4/members/{user}/followees?include=data[*].answer_count,articles_count,gender,follower_count,is_followed,is_following,badge[?(type=best_answerer)].topics&offset={offset}&limit=20' #關注列表網址
    user_url = 'https://www.zhihu.com/api/v4/members/{user}?include=locations,employments,gender,educations,business,voteup_count,thanked_Count,follower_count,following_count,cover_url,following_topic_count,following_question_count,following_favlists_count,following_columns_count,avatar_hue,answer_count,articles_count,pins_count,question_count,commercial_question_count,favorite_count,favorited_count,logs_count,marked_answers_count,marked_answers_text,message_thread_token,account_status,is_active,is_force_renamed,is_bind_sina,sina_weibo_url,sina_weibo_name,show_sina_weibo,is_blocking,is_blocked,is_following,is_followed,mutual_followees_count,vote_to_count,vote_from_count,thank_to_count,thank_from_count,thanked_count,description,hosted_live_count,participated_live_count,allow_message,industry_category,org_name,org_homepage,badge[?(type=best_answerer)].topics' #我的信息連接
    def start_requests(self):
        yield Request(self.followees_url.format(user=self.start_user,offset=0),callback=self.parse_fo) #回調種子用戶的關注列表
        yield Request(self.user_url.format(user=self.start_user,include = self.user_include),callback=self.parse_user) #回調種子用戶的我的信息

    def parse_user(self, response):
        result = json.loads(response.text)
        print(result)
        item = ZhihuxjjItem()
        item['user_name'] = result['name']
        item['sex'] = result['gender']  # gender爲1是男，0是女，-1是未設置
        item['user_sign'] = result['headline']
        item['user_avatar'] = result['avatar_url_template'].format(size='xl')
        item['user_url'] = 'https://www.zhihu.com/people/' + result['url_token']
        if len(result['locations']):
            item['user_add'] = result['locations'][0]['name']
        else:
            item['user_add'] = ''
        yield item

    def parse_fo(self, response):
        results = json.loads(response.text)
        for result in results['data']:
            yield Request(self.user_url.format(user=result['url_token'], include=self.user_include),callback=self.parse_user)
            yield Request(self.followees_url.format(user=result['url_token'], offset=0),callback=self.parse_fo)  # 對關注者的關注者進行遍歷，爬取深度depth+=1
        if results['paging']['is_end'] is False: #關注列表頁是否爲尾頁
            next_url = results['paging']['next'].replace('http','https')
            yield Request(next_url,callback=self.parse_fo)
        else:
            pass
複製代碼

這裏須要劃重點的是yield的用法，以及item['name']，將爬取結果賦值給item，就是告訴系統，這是咱們要選的菜…啊呸…要爬的目標數據。

4.設置其餘信息

在items.py文件中，按照spider中設置的目標數據item，添加對應的代碼。

import scrapy
class ZhihuxjjItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    user_name = scrapy.Field()
    sex  = scrapy.Field()
    user_sign = scrapy.Field()
    user_url = scrapy.Field()
    user_avatar = scrapy.Field()
    user_add = scrapy.Field()
    pass
複製代碼

在pipeline.py中添加存入數據庫的代碼(數據庫咋用上一篇文章寫了哦~)。

import pymysql

def dbHandle():
    conn = pymysql.connect(
        host='localhost',
        user='root',
        passwd='數據庫密碼',
        charset='utf8',
        use_unicode=False
    )
    return conn

class ZhihuxjjPipeline(object):
    def process_item(self, item, spider):
        dbObject = dbHandle()  # 寫入數據庫
        cursor = dbObject.cursor()
        sql = "insert into xiaojiejie.zhihu(user_name,sex,user_sign,user_avatar,user_url,user_add) values(%s,%s,%s,%s,%s,%s)"
        param = (item['user_name'],item['sex'],item['user_sign'],item['user_avatar'],item['user_url'],item['user_add'])
        try:
            cursor.execute(sql, param)
            dbObject.commit()
        except Exception as e:
            print(e)
            dbObject.rollback()
        return item
複製代碼

由於使用了pipeline.py，因此咱們還須要再setting.py文件中，將ITEM_PIPELINE註釋解除，這裏起到鏈接兩個文件的做用。

到這裏，基本就都設置好了，程序基本上就能夠跑了。不過由於scrapy是遵循robots.txt法則的，因此讓咱們來觀察一下知乎的法則https://www.zhihu.com/robots.txt

emmmmmmm，看完法則了嗎，很好，而後咱們在setting.py中，將ROBOTSTXT_OBEY 改爲 False。(逃

好像…還忘了點什麼，對了，忘記設置headers了。通用的設置headers的方法一樣是在setting.py文件中，將DEFAULT_REQUEST_HEADERS的代碼註釋狀態取消，並設置模擬瀏覽器頭。知乎是要模擬登陸的，若是使用遊客方式登陸，就須要添加authorization，至於這個authorization是如何獲取的，我，就，不，告，訴，你(逃

DEFAULT_REQUEST_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
    'authorization': 'oauth c3cef7c66a1843f8b3a9e6a1e3160e20'
}
複製代碼

爲了減小服務器壓力&防止被封，解除DOWNLOAD_DELAY註釋狀態，這是設置下載延遲，將下載延遲設爲3(robots法則裏要求是10，但10實在太慢了_(:зゝ∠)_知乎的程序員小哥哥看不見這句話看不見這句話…