服務器反爬蟲攻略：nginx禁止某些User Agent抓取網站

時間 2019-11-13

標籤服務器爬蟲攻略 nginx 禁止某些 user agent 抓取網站欄目網絡爬蟲简体版

原文原文鏈接

網絡上的爬蟲很是多，有對網站收錄有益的，好比百度蜘蛛（Baiduspider），也有不但不遵照robots規則對服務器形成壓力，還不能爲網站帶來流量的無用爬蟲，好比宜搜蜘蛛（YisouSpider）。php

下面介紹怎麼禁止這些無用的user agent訪問網站。nginx

進入到nginx安裝目錄下的conf目錄，將以下代碼保存爲 agent_deny.confsql

cd /usr/local/nginx/confvim

vim agent_deny.conf服務器

#禁止Scrapy等工具的抓取
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) 
{
     return 403;
}
#禁止指定UA及UA爲空的訪問
if ($http_user_agent ~ "FeedDemon|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|YisouSpider|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) 
{
     return 403;            
}
#禁止非GET|HEAD|POST方式的抓取
if ($request_method !~ ^(GET|HEAD|POST)$) 
{
    return 403;
}

而後，在網站相關配置中的 location / { 以後插入以下代碼：網絡

include agent_deny.conf;curl

保存後，執行以下命令，平滑重啓nginx便可：tcp

/usr/local/nginx/sbin/nginx -s reloadide

模擬宜搜蜘蛛的抓取：wordpress

curl -I -A 'YisouSpider' http://網站連接

結果返回403

模擬UA爲空的抓取：

curl -I -A '' http://網站連接

結果返回403

模擬百度蜘蛛的抓取：

curl -I -A 'Baiduspider' http://網站連接

結果返回200

下面是網絡上常見的垃圾UA列表

FeedDemon             內容採集 BOT/0.1 (BOT for JCE) sql注入 CrawlDaddy            sql注入 Java                  內容採集 Jullo                 內容採集 Feedly                內容採集 UniversalFeedParser   內容採集 ApacheBench           cc攻擊器 Swiftbot              無用爬蟲 YandexBot             無用爬蟲 AhrefsBot             無用爬蟲 YisouSpider           無用爬蟲 jikeSpider            無用爬蟲 MJ12bot               無用爬蟲 ZmEu phpmyadmin       漏洞掃描 WinHttp               採集cc攻擊 EasouSpider           無用爬蟲 HttpClient            tcp攻擊 Microsoft URL Control 掃描 YYSpider              無用爬蟲 jaunty                wordpress爆破掃描器 oBot                  無用爬蟲 Python-urllib         內容採集 Indy Library          掃描 FlightDeckReports Bot 無用爬蟲 Linguee Bot           無用爬蟲