平常反反爬蟲

這裏介紹幾種工做中遇到過的常見反爬蟲機制及應對策略。php

爬蟲的君子協議

有些網站但願被搜索引擎抓住,有些敏感信息網站不但願被搜索引擎發現。web

網站內容的全部者是網站管理員,搜索引擎應該尊重全部者的意願,爲了知足以上等等,就須要提供一種網站和爬蟲進行溝通的途徑,給網站管理員表達本身意願的機會。有需求就有供應,robots協議就此誕生。docker

scrapy是默認遵照robots協議的,須要咱們在settings.py文件中將代碼改爲數據庫

ROBOTSTXT_OBEY = Flase

封鎖請求頭

當訪問過於頻繁的時候,網站後臺會去識別你的請求頭,判斷你是瀏覽器訪問,仍是程序訪問。api

咱們只須要僞造請求頭信息,製造出瀏覽器訪問的假象。瀏覽器

如下分別提供三個爬蟲代碼的請求頭更改服務器

1 import requests
2 import re
3 
4 url = 'https://list.tmall.com/search_product.htm?q=%B0%D7%BE%C6&type=p&vmarket=&spm=875.7931836%2FB.a2227oh.d100&from=mallfp..pc_1_searchbutton'
5 
6 headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063'}
7 
8 content = requests.get(url,headers=headers)
requests版
 1 from selenium import webdriver
 2 from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 3 
 4 dcap = dict(DesiredCapabilities.PHANTOMJS)
 5 dcap["phantomjs.page.settings.userAgent"] = (
 6 "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 6 Build/LYZ28E) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36"
 7 )
 8 driver = webdriver.PhantomJS(desired_capabilities=dcap)
 9 driver.get("https://httpbin.org/get?show_env=1")
10 driver.get_screenshot_as_file('01.png')
11 driver.quit()
selenium+phantomjs版

scrapy隨機更改請求頭,首先在settings.py文件中添加請求頭,再告訴scrapy去哪取請求頭。dom

USER_AGENT_LIST=[
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
DOWNLOADER_MIDDLEWARES = {
    'onenine.middleware.RandomUserAgentMiddleware': 400,
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}

在settings.py中,咱們的路徑設置應該是scrapy

yourproject.middlewares(文件名).middleware類

再在Middleware.py中定義請求頭處理模塊。分佈式

from settings import USER_AGENT_LIST

import random
from scrapy import log

class RandomUserAgentMiddleware(object):
    def process_request(self, request, spider):
        ua  = random.choice(USER_AGENT_LIST)
        if ua:
            request.headers.setdefault('User-Agent', ua)

若是實在不想在網上找各類請求頭,可使用fake-useragent僞造請求頭

import requests
from fake_useragent import UserAgent
ua = UserAgent()
headers = {'User-Agent': ua.random}
url = '待爬網頁的url'
resp = requests.get(url, headers=headers)
View Code

js渲染頁面

咱們在頁面上看的網頁是js渲染事後的頁面,當咱們去經過源碼拿數據的時候就會發現找不到咱們相要的數據。

如下提供三種解決思路。

  • selenium+webdriver

  使用瀏覽器驅動加上selenium能夠徹底模擬人爲操做。在解決驗證碼問題時有奇效。可是,通常狀況下是不會建議使用selenium,主要是效率問題。雖然使用phantomjs等無頭瀏覽器會減小內存消耗,效率比較其餘兩種方法仍是低了不少,且不適用作分佈式開發。(最新版本的selenium和phantomjs分手了。)

  • scrapy_splash

  有些相似於selenium,但較於selenium,scrapy_splash是基於Twisted和QT開發的輕量瀏覽器引擎,而且提供直接的http api。快速、輕量的特色使其容易進行分佈式開發。scrapy和splash集合兼容作的特別好,不過須要在docker環境下運行。有興趣的能夠去自行了解。(目前使用遇到js渲染不徹底)

  • 抓包

  經過分析頁面的api接口拿到想要數據。這是效率最快的一種方法,也是最麻煩的方法。有些網站對本身的api接口封鎖特別嚴苛,淘寶其中一個api接口甚至會出現每拿四條數據就須要驗證碼識別的狀況。

ip封鎖

面對ip封鎖,咱們只能使用代理ip。

通常作爬蟲的只是出於學習目的的狀況能夠本身作一個ip池,經過抓取免費ip代理網站的ip放到數據庫中本身用。可是通常免費的ip是不穩定的,存活時間短暫且不高匿。

商業的代理ip有不少,魚龍混雜,ip切換模式也各有不一樣。咱們如今使用的ip代理是能達到秒切,有效防止由於ip的緣由被ban,固然價格也可觀。

如下是三種使用ip代理接口的代碼

  import requests

    # 要訪問的目標頁面
    targetUrl = "http://test.abuyun.com/proxy.php"

    # 代理服務器
    proxyHost = "http-dyn.abuyun.com"
    proxyPort = "9020"

    # 代理隧道驗證信息
    proxyUser = "H01234567890123D"
    proxyPass = "0123456789012345"

    proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % {
      "host" : proxyHost,
      "port" : proxyPort,
      "user" : proxyUser,
      "pass" : proxyPass,
    }

    proxies = {
        "http"  : proxyMeta,
        "https" : proxyMeta,
    }

    resp = requests.get(targetUrl, proxies=proxies)
requests版
 from selenium import webdriver

    # 代理服務器
    proxyHost = "http-dyn.abuyun.com"
    proxyPort = "9020"

    # 代理隧道驗證信息
    proxyUser = "H01234567890123D"
    proxyPass = "0123456789012345"

    service_args = [
        "--proxy-type=http",
        "--proxy=%(host)s:%(port)s" % {
            "host" : proxyHost,
            "port" : proxyPort,
        },
        "--proxy-auth=%(user)s:%(pass)s" % {
            "user" : proxyUser,
            "pass" : proxyPass,
        },
    ]

    # 要訪問的目標頁面
    targetUrl = "http://test.abuyun.com/proxy.php"

    phantomjs_path = r"./phantomjs"

    driver = webdriver.PhantomJS(executable_path=phantomjs_path, service_args=service_args)
    driver.get(targetUrl)

    print driver.title
    print driver.page_source.encode("utf-8")

    driver.quit()                        
                    
selenium+phantonjs版

scrapy需在中間件middlewares.py中定義

import base64

    # 代理服務器
    proxyServer = "http://http-dyn.abuyun.com:9020"

    # 代理隧道驗證信息
    proxyUser = "H01234567890123D"
    proxyPass = "0123456789012345"

    # for Python3
    #proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8")

    class ProxyMiddleware(object):
        def process_request(self, request, spider):
            request.meta["proxy"] = proxyServer

            request.headers["Proxy-Authorization"] = proxyAuth                        
                    

再到settings.py文件中修改參數

DOWNLOADER_MIDDLEWARES = {  
#    'myproxies.middlewares.MyCustomDownloaderMiddleware': 543,  
     'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware':543,  
     'yourproject.middlewares.ProxyMiddleware':125  
}  
相關文章
相關標籤/搜索