Nginx反爬蟲：禁止某些User Agent抓取網站

時間 2019-11-11

標籤 nginx 爬蟲禁止某些 user agent 抓取網站欄目 Nginx 简体版

原文原文鏈接

問題

以前客戶可以正常訪問的一個網站這幾天訪問很慢，甚至有時候還拒絕訪問。經過Nginx訪問日誌排查，發現有大量的請求指向同一個頁面，並且訪問的客戶端IP地址在不斷變化且沒有太多規律，很難經過限制IP來拒絕訪問。但請求的user-agent都帶有Bytespider標記，這是一種流氓爬蟲。訪問日誌以下圖所示：php

解決

解決思路：由於user-agent帶有Bytespider爬蟲標記，這能夠經過Nginx規則來限定流氓爬蟲的訪問，直接返回403錯誤。nginx

一、在/etc/nginx/conf.d目錄下（因Nginx的安裝區別，可能站點配置文件的路徑有所不一樣）新建文件deny_agent.config配置文件：sql

#forbidden Scrapy
if ($http_user_agent ~* (Scrapy|Curl|HttpClient))
{
    return 403;
}

#forbidden UA
if ($http_user_agent ~ "Bytespider|FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" )
{
    return 403;
}

#forbidden not GET|HEAD|POST method access
if ($request_method !~ ^(GET|HEAD|POST)$)
{
    return 403;
}

二、在對應站點配置文件中包含deny_agent.config配置文件（注意是在server裏面）：curl

三、重啓Nginx，建議經過nginx -s reload平滑重啓的方式。重啓以前請先使用nginx -t命令檢測配置文件是否正確。tcp

四、經過curl命令模擬訪問，看配置是否生效（返回403 Forbidden，則表示配置OK）：ide

附錄：UA收集

FeedDemon             內容採集
BOT/0.1 (BOT for JCE) sql注入
CrawlDaddy            sql注入
Java                  內容採集
Jullo                 內容採集
Feedly                內容採集
UniversalFeedParser   內容採集
ApacheBench           cc攻擊器
Swiftbot              無用爬蟲
YandexBot             無用爬蟲
AhrefsBot             無用爬蟲
YisouSpider           無用爬蟲（已被UC神馬搜索收購，此蜘蛛能夠放開！）
jikeSpider            無用爬蟲
MJ12bot               無用爬蟲
ZmEu phpmyadmin       漏洞掃描
WinHttp               採集cc攻擊
EasouSpider           無用爬蟲
HttpClient            tcp攻擊
Microsoft URL Control 掃描
YYSpider              無用爬蟲
jaunty                wordpress爆破掃描器
oBot                  無用爬蟲
Python-urllib         內容採集
Indy Library          掃描
FlightDeckReports Bot 無用爬蟲
Linguee Bot           無用爬蟲