全站爬蟲有時候作起來其實比較容易,由於規則相對容易創建起來,只須要作好反爬就能夠了,今天我們爬取知乎。繼續使用scrapy
固然對於這個小需求來講,使用scrapy確實用了牛刀,不過畢竟本博客這個系列到這個階段須要不斷使用scrapy
進行過分,so,我寫了一會就寫完了。html
你第一步找一個爬取種子,算做爬蟲入口正則表達式
https://www.zhihu.com/people/zhang-jia-wei/following
mongodb
咱們須要的信息以下,全部的框圖都是咱們須要的信息。json
經過以下代碼獲取網頁返回數據,會發現數據是由HTML+JSON拼接而成,增長了不少解析成本數組
class ZhihuSpider(scrapy.Spider): name = 'Zhihu' allowed_domains = ['www.zhihu.com'] start_urls = ['https://www.zhihu.com/people/zhang-jia-wei/following'] def parse(self, response): all_data = response.body_as_unicode() print(all_data)
首先配置一下基本的環境,好比間隔秒數,爬取的UA,是否存儲cookies,啓用隨機UA的中間件DOWNLOADER_MIDDLEWARES
cookie
middlewares.py
文件app
from zhihu.settings import USER_AGENT_LIST # 導入中間件 import random class RandomUserAgentMiddleware(object): def process_request(self, request, spider): rand_use = random.choice(USER_AGENT_LIST) if rand_use: request.headers.setdefault('User-Agent', rand_use)
setting.py
文件dom
BOT_NAME = 'zhihu' SPIDER_MODULES = ['zhihu.spiders'] NEWSPIDER_MODULE = 'zhihu.spiders' USER_AGENT_LIST=[ # 能夠寫多個,測試用,寫了一個 "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36" ] # Obey robots.txt rules ROBOTSTXT_OBEY = False # See also autothrottle settings and docs DOWNLOAD_DELAY = 2 # Disable cookies (enabled by default) COOKIES_ENABLED = False # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'zhihu.middlewares.RandomUserAgentMiddleware': 400, } # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'zhihu.pipelines.ZhihuPipeline': 300, }
主要爬取函數,內容說明scrapy
extract_first()
獲取xpath匹配數組的第一項dont_filter=False
scrapy URL去重# 起始位置 def start_requests(self): for url in self.start_urls: yield scrapy.Request(url.format("zhang-jia-wei"), callback=self.parse) def parse(self, response): print("正在獲取 {} 信息".format(response.url)) all_data = response.body_as_unicode() select = Selector(response) # 全部知乎用戶都具有的信息 username = select.xpath("//span[@class='ProfileHeader-name']/text()").extract_first() # 獲取用戶暱稱 sex = select.xpath("//div[@class='ProfileHeader-iconWrapper']/svg/@class").extract() if len(sex) > 0: sex = 1 if str(sex[0]).find("male") else 0 else: sex = -1 answers = select.xpath("//li[@aria-controls='Profile-answers']/a/span/text()").extract_first() asks = select.xpath("//li[@aria-controls='Profile-asks']/a/span/text()").extract_first() posts = select.xpath("//li[@aria-controls='Profile-posts']/a/span/text()").extract_first() columns = select.xpath("//li[@aria-controls='Profile-columns']/a/span/text()").extract_first() pins = select.xpath("//li[@aria-controls='Profile-pins']/a/span/text()").extract_first() # 用戶有可能設置了隱私,必須登陸以後看到,或者記錄cookie! follwers = select.xpath("//strong[@class='NumberBoard-itemValue']/@title").extract() item = ZhihuItem() item["username"] = username item["sex"] = sex item["answers"] = answers item["asks"] = asks item["posts"] = posts item["columns"] = columns item["pins"] = pins item["follwering"] = follwers[0] if len(follwers) > 0 else 0 item["follwers"] = follwers[1] if len(follwers) > 0 else 0 yield item # 獲取第一頁關注者列表 pattern = re.compile('<script id=\"js-initialData\" type=\"text/json\">(.*?)<\/script>') json_data = pattern.search(all_data).group(1) if json_data: users = json.loads(json_data)["initialState"]["entities"]["users"] for user in users: yield scrapy.Request(self.start_urls[0].format(user),callback=self.parse, dont_filter=False)
在獲取數據的時候,我繞開了一部分數據,這部分數據能夠經過正則表達式去匹配。
ide
數據存儲,採用的依舊是mongodb