Python爬蟲入門教程 35-100 知乎網全站用戶爬蟲 scrapy

時間 2019-11-16

原文原文鏈接

爬前叨叨

全站爬蟲有時候作起來其實比較容易，由於規則相對容易創建起來，只須要作好反爬就能夠了，今天我們爬取知乎。繼續使用scrapy固然對於這個小需求來講，使用scrapy確實用了牛刀，不過畢竟本博客這個系列到這個階段須要不斷使用scrapy進行過分，so，我寫了一會就寫完了。html

你第一步找一個爬取種子，算做爬蟲入口正則表達式

https://www.zhihu.com/people/zhang-jia-wei/followingmongodb

咱們須要的信息以下，全部的框圖都是咱們須要的信息。json

獲取用戶關注名單

經過以下代碼獲取網頁返回數據，會發現數據是由HTML+JSON拼接而成，增長了不少解析成本數組

class ZhihuSpider(scrapy.Spider):
    name = 'Zhihu'
    allowed_domains = ['www.zhihu.com']
    start_urls = ['https://www.zhihu.com/people/zhang-jia-wei/following']

    def parse(self, response):
        all_data = response.body_as_unicode()
        print(all_data)

首先配置一下基本的環境，好比間隔秒數，爬取的UA，是否存儲cookies,啓用隨機UA的中間件DOWNLOADER_MIDDLEWAREScookie

middlewares.py 文件app

from zhihu.settings import USER_AGENT_LIST # 導入中間件
import random

class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        rand_use  = random.choice(USER_AGENT_LIST)
        if rand_use:
            request.headers.setdefault('User-Agent', rand_use)

setting.py 文件dom

BOT_NAME = 'zhihu'

SPIDER_MODULES = ['zhihu.spiders']
NEWSPIDER_MODULE = 'zhihu.spiders'
USER_AGENT_LIST=[  # 能夠寫多個，測試用，寫了一個
    "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
]
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 2
# Disable cookies (enabled by default)
COOKIES_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
}
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'zhihu.middlewares.RandomUserAgentMiddleware': 400,
}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'zhihu.pipelines.ZhihuPipeline': 300,
}

主要爬取函數,內容說明scrapy

start_requests 用來處理首次爬取請求，做爲程序入口
下面的代碼主要處理了2種狀況，一種是HTML部分，一種是JSON部分
JSON部分使用re模塊進行匹配，在經過json模塊格式化
extract_first() 獲取xpath匹配數組的第一項
dont_filter=False scrapy URL去重

# 起始位置
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url.format("zhang-jia-wei"), callback=self.parse)

    def parse(self, response):

        print("正在獲取 {} 信息".format(response.url))
        all_data = response.body_as_unicode()

        select = Selector(response)

        # 全部知乎用戶都具有的信息
        username = select.xpath("//span[@class='ProfileHeader-name']/text()").extract_first()       # 獲取用戶暱稱
        sex = select.xpath("//div[@class='ProfileHeader-iconWrapper']/svg/@class").extract()
        if len(sex) > 0:
            sex = 1 if str(sex[0]).find("male") else 0
        else:
            sex = -1
        answers = select.xpath("//li[@aria-controls='Profile-answers']/a/span/text()").extract_first()
        asks = select.xpath("//li[@aria-controls='Profile-asks']/a/span/text()").extract_first()
        posts = select.xpath("//li[@aria-controls='Profile-posts']/a/span/text()").extract_first()
        columns = select.xpath("//li[@aria-controls='Profile-columns']/a/span/text()").extract_first()
        pins = select.xpath("//li[@aria-controls='Profile-pins']/a/span/text()").extract_first()
        # 用戶有可能設置了隱私，必須登陸以後看到，或者記錄cookie！
        follwers = select.xpath("//strong[@class='NumberBoard-itemValue']/@title").extract()



        item = ZhihuItem()
        item["username"] = username
        item["sex"] = sex
        item["answers"] = answers
        item["asks"] = asks
        item["posts"] = posts
        item["columns"] = columns
        item["pins"] = pins
        item["follwering"] = follwers[0] if len(follwers) > 0 else 0
        item["follwers"] = follwers[1] if len(follwers) > 0 else 0

        yield item



        # 獲取第一頁關注者列表
        pattern = re.compile('<script id=\"js-initialData\" type=\"text/json\">(.*?)<\/script>')
        json_data = pattern.search(all_data).group(1)
        if json_data:
            users = json.loads(json_data)["initialState"]["entities"]["users"]
        for user in users:
            yield scrapy.Request(self.start_urls[0].format(user),callback=self.parse, dont_filter=False)

在獲取數據的時候，我繞開了一部分數據，這部分數據能夠經過正則表達式去匹配。
ide

數據存儲，採用的依舊是mongodb