scrapy 爬取 useragent

時間 2019-11-29

原文原文鏈接

useragentstring.com 網站幾乎廊括了全部的User-Agent，剛學了scrapy，打算那它練手，把上面的 user-agent 爬取下來。php

本文只爬取常見的 FireFox, Chrome, Opera, Safri, Internet Explorerpython

1、建立爬蟲項目

1.建立爬蟲項目useragent

$ scrapy startproject useragent

2.進入項目目錄

$ cd useragent

3.生成爬蟲文件 ua

這一步不是必須的，不過有了就方便些json

$ scrapy genspider ua useragentstring.com

2、編輯 item 文件

# useragent\items.py
import scrapy

class UseragentItem(scrapy.Item):
    # define the fields for your item here like:
    ua_name = scrapy.Field()
    ua_string = scrapy.Field()

3、編輯爬蟲文件

# useragent\spiders\ua.py 

import scrapy

from useragent.items import UseragentItem

class UaSpider(scrapy.Spider):
    name = "ua"
    allowed_domains = ["useragentstring.com"]
    start_urls = (
        'http://www.useragentstring.com/pages/useragentstring.php?name=Firefox',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Internet+Explorer',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Opera',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Safari',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Chrome',
    )

    def parse(self, response):
        ua_name = response.url.splite('=')[-1]
        for ua_string in response.xpath('//li/a/text()').extract():
            item = UseragentItem()
            item['ua_name'] = ua_name
            item['ua_string'] = ua_string.strip()
            yield item

4、運行爬蟲

經過參數-o，控制爬蟲輸出爲 json 文件瀏覽器

$ scrapy crawl ua -o item.json

結果如圖：
dom

看起來沒有獲得想要的結果，注意到那個robot.txt。我猜想多是網站禁止爬蟲scrapy

猜的對不對先無論，先模擬瀏覽器再說，給全部的 request 添加 headers:ide

# useragent\spiders\ua.py 

import scrapy

from useragent.items import UseragentItem

class UaSpider(scrapy.Spider):
    name = "ua"
    allowed_domains = ["useragentstring.com"]
    start_urls = (
        'http://www.useragentstring.com/pages/useragentstring.php?name=Firefox',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Internet+Explorer',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Opera',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Safari',
        'http://www.useragentstring.com/pages/useragentstring.php?name=Chrome',
    )
    
    # 在全部的請求發生以前執行
    def start_requests(self):
        for url in self.start_urls:
            headers = {"User-Agent": "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"}
            yield scrapy.Request(url, callback=self.parse, headers=headers)

    def parse(self, response):
        ua_name = response.url.split('=')[-1]
        for ua_string in response.xpath('//li/a/text()').extract():
            item = UseragentItem()
            item['ua_name'] = ua_name
            item['ua_string'] = ua_string.strip()
            yield item

在運行，OK了！
效果圖以下：
網站

好了，之後不愁沒有 User Agent用了。url

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。