useragentstring.com 網站幾乎廊括了全部的User-Agent,剛學了scrapy,打算那它練手,把上面的 user-agent 爬取下來。php
本文只爬取常見的 FireFox, Chrome, Opera, Safri, Internet Explorerpython
$ scrapy startproject useragent
$ cd useragent
這一步不是必須的,不過有了就方便些json
$ scrapy genspider ua useragentstring.com
# useragent\items.py import scrapy class UseragentItem(scrapy.Item): # define the fields for your item here like: ua_name = scrapy.Field() ua_string = scrapy.Field()
# useragent\spiders\ua.py import scrapy from useragent.items import UseragentItem class UaSpider(scrapy.Spider): name = "ua" allowed_domains = ["useragentstring.com"] start_urls = ( 'http://www.useragentstring.com/pages/useragentstring.php?name=Firefox', 'http://www.useragentstring.com/pages/useragentstring.php?name=Internet+Explorer', 'http://www.useragentstring.com/pages/useragentstring.php?name=Opera', 'http://www.useragentstring.com/pages/useragentstring.php?name=Safari', 'http://www.useragentstring.com/pages/useragentstring.php?name=Chrome', ) def parse(self, response): ua_name = response.url.splite('=')[-1] for ua_string in response.xpath('//li/a/text()').extract(): item = UseragentItem() item['ua_name'] = ua_name item['ua_string'] = ua_string.strip() yield item
經過參數-o,控制爬蟲輸出爲 json 文件瀏覽器
$ scrapy crawl ua -o item.json
結果如圖:
dom
看起來沒有獲得想要的結果,注意到那個robot.txt。我猜想多是網站禁止爬蟲scrapy
猜的對不對先無論,先模擬瀏覽器再說,給全部的 request 添加 headers:ide
# useragent\spiders\ua.py import scrapy from useragent.items import UseragentItem class UaSpider(scrapy.Spider): name = "ua" allowed_domains = ["useragentstring.com"] start_urls = ( 'http://www.useragentstring.com/pages/useragentstring.php?name=Firefox', 'http://www.useragentstring.com/pages/useragentstring.php?name=Internet+Explorer', 'http://www.useragentstring.com/pages/useragentstring.php?name=Opera', 'http://www.useragentstring.com/pages/useragentstring.php?name=Safari', 'http://www.useragentstring.com/pages/useragentstring.php?name=Chrome', ) # 在全部的請求發生以前執行 def start_requests(self): for url in self.start_urls: headers = {"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"} yield scrapy.Request(url, callback=self.parse, headers=headers) def parse(self, response): ua_name = response.url.split('=')[-1] for ua_string in response.xpath('//li/a/text()').extract(): item = UseragentItem() item['ua_name'] = ua_name item['ua_string'] = ua_string.strip() yield item
在運行,OK了!
效果圖以下:
網站
好了,之後不愁沒有 User Agent用了。url