PHP爬蟲 -- 015 應對反爬蟲

時間 2019-12-01

標籤 php 爬蟲應對欄目 PHP 简体版

原文原文鏈接

什麼是反爬蟲?

人家的服務器, 是給用戶服務的, 而爬蟲會佔用服務器資源
服務器能夠獲取向服務器發送請求的IP
若是某個IP, 短期高頻訪問, 服務器就會禁掉該IP

爬蟲倫理

一般狀況下，服務器不太會在乎小爬蟲，
可是，服務器會拒絕頻率很高的大型爬蟲和惡意爬蟲，由於這會給服務器帶來極大的壓力或傷害
服務器在一般狀況下，對搜索引擎是歡迎的態度（谷歌和百度的核心技術之一就是爬蟲）。
固然，這是有條件的，而這些條件會寫在Robots協議。
Robots協議是互聯網爬蟲的一項公認的道德規範，
全稱是「網絡爬蟲排除標準」（Robots exclusion protocol），
這個協議用來告訴爬蟲，哪些頁面是能夠抓取的，哪些不能夠。
如何查看網站的robots協議呢，很簡單，在網站的域名後加上/robots.txt就能夠了。
淘寶的robots協議（ www.taobao.com/robots.txt）。
在截取的部分，能夠看到淘寶對百度和谷歌這兩個爬蟲的訪問規定，以及對其它爬蟲的規定。

User-agent:  Baiduspider #百度爬蟲
Allow:  /article #容許訪問 /article.htm
Allow:  /oshtml #容許訪問 /oshtml.htm
Allow:  /ershou #容許訪問 /ershou.htm
Allow: /$ #容許訪問根目錄，即淘寶主頁
Disallow:  /product/ #禁止訪問/product/
Disallow:  / #禁止訪問除 Allow 規定頁面以外的其餘全部頁面

User-Agent:  Googlebot #谷歌爬蟲
Allow:  /article
Allow:  /oshtml
Allow:  /product #容許訪問/product/
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Allow:  /ershou
Allow: /$
Disallow:  / #禁止訪問除 Allow 規定頁面以外的其餘全部頁面

…… # 文件太長，省略了對其它爬蟲的規定，想看全文的話，點擊上面的連接

User-Agent:  * #其餘爬蟲
Disallow:  / #禁止訪問全部頁面
複製代碼

網站的服務器被爬蟲爬得多了，也會受到較大的壓力，所以，各大網站也會作一些反爬蟲的措施。
不過呢，有反爬蟲，也就有相應的反反爬蟲
限制好爬蟲的速度，對提供數據的服務器心存感謝，避免給它形成太大壓力，維持良好的互聯網秩序

若是反反爬蟲?

有反爬蟲, 就有反反爬蟲
既然是高頻訪問會受限制, 那麼解決方案有兩個
第一, 設置sleep, 下降同一IP的訪問頻率
第二, 使用IP代理池, 每次都用不一樣的IP

選擇一個靠譜的IP代理商

cuiqingcai.com/5094.html

PHP使用阿布雲建立IP代理池

PHP文檔: www.abuyun.com/http-proxy/…

須要打開curl擴展

測試是否成功, 示例代碼, 輸出當前IP

<?php
    // 要訪問的目標頁面
    $targetUrl = "http://test.abuyun.com";

    // 代理服務器
    $proxyServer = "http://http-dyn.abuyun.com:9020";

    // 隧道身份信息
    $proxyUser   = "H19D75L76VK89Q8D";
    $proxyPass   = "8C17B0A80F475BD8";

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $targetUrl);

    curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, false);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

    // 設置代理服務器
    curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_HTTP);
    curl_setopt($ch, CURLOPT_PROXY, $proxyServer);

    // 設置隧道驗證信息
    curl_setopt($ch, CURLOPT_PROXYAUTH, CURLAUTH_BASIC);
    curl_setopt($ch, CURLOPT_PROXYUSERPWD, "{$proxyUser}:{$proxyPass}");

    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727;)");

    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 3);
    curl_setopt($ch, CURLOPT_TIMEOUT, 5);

    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

    $result = curl_exec($ch);
    //$info = curl_getinfo($ch);

    curl_close($ch);

    var_dump($result);   
複製代碼

和querylist結合, 每次請求獲取請求IP

<?php
require 'vendor/autoload.php';
use QL\QueryList;
// 生成一個querylist對象
$ql = new QueryList();

$data = $ql->html(get_html_source('http://ip.tool.chinaz.com/'))->rules([
    'ip'=>['#rightinfo > dl > dd.fz24','text'],
    'address'=>['#rightinfo > dl > dd:nth-child(4)','text']
])->queryData();

var_dump($data);

function get_html_source($url) {
    $result = false;
    // 加while循環, 防止有些ip不能用, 取不到html代碼
    // 若是取不到代碼, $result是false, 繼續執行while裏面的代碼, 換一個新的ip再試一次
    while (!$result) {
        // 要訪問的目標頁面
        $targetUrl = $url;
        // 代理服務器
        $proxyServer = "http://http-dyn.abuyun.com:9020";
        // 隧道身份信息
        $proxyUser = "H19D75L76VK89Q8D";
        $proxyPass = "8C17B0A80F475BD8";
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $targetUrl);
        curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, false);
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
        // 設置代理服務器
        curl_setopt($ch, CURLOPT_PROXYTYPE, CURLPROXY_HTTP);
        curl_setopt($ch, CURLOPT_PROXY, $proxyServer);
        // 設置隧道驗證信息
        curl_setopt($ch, CURLOPT_PROXYAUTH, CURLAUTH_BASIC);
        curl_setopt($ch, CURLOPT_PROXYUSERPWD, "{$proxyUser}:{$proxyPass}");
        curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727;)");
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 3);
        curl_setopt($ch, CURLOPT_TIMEOUT, 5);
        // curl_setopt($ch, CURLOPT_HEADER, true);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $result = curl_exec($ch);
        // 若是請求失敗, 不該該當即開始換ip, 稍等一下子, 由於每秒的請求數有限制
        if(!$result){
            sleep(2);
        }
        //$info = curl_getinfo($ch);
        curl_close($ch);
    }
    return $result;
}
複製代碼