Node爬蟲之——使用async.mapLimit控制請求併發

時間 2019-12-08

標籤 node 爬蟲使用 async.maplimit async maplimit 控制請求併發欄目網絡爬蟲简体版

原文原文鏈接

通常咱們在寫爬蟲的時候，不少網站會由於你併發請求數太多當作是在惡意請求，封掉你的IP，爲了防止這種狀況的發生，咱們通常會在代碼裏控制併發請求數，Node裏面通常藉助async模塊來實現。javascript

1. async.mapLimit方法

mapLimit(arr, limit, iterator, callback)java

arr中通常是多個請求的url，limit爲併發限制次數，mapLimit方法將arr中的每一項依次拿給iterator去執行，執行結果傳給最後的callback；node

2. async.mapLimit方法應用

下面是以前寫過的一個簡單的爬蟲示例，將爬取到的新聞標題和路徑保存在一個Excel表格中，限制併發數爲3，代碼以下git

webSpider.js:github

//request調用url主函數 (mapLimit iterator)
function main(option, callback) {
    n++;
    timeline[option] = new Date().getTime();
    console.log('如今的併發數是', n, '，正在抓取的是', option);
    request(option, function(err, res, body) {
        if(!err && res.statusCode == 200){
            var $ = cheerio.load(body);
            $('#post_list .post_item').each(function(index, element) {
                // console.log(element);
                var item = [$(element).find('.post_item_body h3 a').text(),$(element).find('.post_item_body h3 a').attr('href')];
                dataArr[0].data.push(item);
            });
            console.log('抓取', option, '結束，耗時：', new Date().getTime()-timeline[option], '毫秒');
            n--;
            callback(null, 'done!');
        }else{
            console.log(err);
            n--;
            callback(err, null);
        }
    });
}

//限制請求併發數爲3
async.mapLimit(options, 3, main.bind(this), function(err, result){
    if(err) {
        console.log(err);
    } else {
        fs.writeFile('data/cnbNews.xlsx', xlsx.build(dataArr), 'utf-8', function(err){
            if(err){
                console.log('write file error!');
            }else{
                console.log('write file success!');
            }
        });
    }
});

這裏迭代器裏面第二個參數callback（即請求每一條url完成以後的回調方法）是關鍵，沒有異常的狀況下全部options中的url都請求完成以後會回調mapLimit方法的回調方法進行後續操做（如這裏的生成文件），若是單條url請求異常，回調方法中會接收到err並報出錯誤，不能執行後續生成文件的操做。web

async.mapLimit(options, 3, function(option, callback) {
    request(option, main);
    callback(null);
}, function(err, result) {
    if(err) {
        console.log(err);
    } else {
        console.log('done!');
    }
});

如上，網上有些資料中是在迭代器中request方法執行完成以後調用callback，由於request方法異步接收請求數據，這種寫法會使async.mapLimit方法limit參數無效，致使沒法達到限制請求併發數的目的，這裏須要注意下。併發

執行webSpider.js，異步