nodeJs 爬蟲路上的技術點

時間 2019-12-13

標籤 nodejs 爬蟲路上技術欄目網絡爬蟲简体版

原文原文鏈接

背景

最近打算把以前看過的nodeJs相關的內容在複習下，順便寫幾個爬蟲來打發無聊，在爬的過程當中發現一些問題，記錄下以便備忘。node

依賴

用到的是在網上爛大街的cheerio庫來處理爬取的內容，使用superagent處理請求，log4js來記錄日誌。npm

日誌配置

話很少說，直接上代碼：json

const log4js = require('log4js');

log4js.configure({
  appenders: {
    cheese: {
      type: 'dateFile',
      filename: 'cheese.log',
      pattern: '-yyyy-MM-dd.log',
      // 包含模型
      alwaysIncludePattern: true,

      maxLogSize: 1024,
      backups: 3 }
  },
  categories: { default: { appenders: ['cheese'], level: 'info' } }
});

const logger = log4js.getLogger('cheese');
logger.level = 'INFO';

module.exports = logger;

以上直接導出一個logger對象，在業務文件裏直接調用logger.info()等函數添加日誌信息就能夠，會按天生成日誌。相關信息網絡上一堆。網絡

爬取內容並處理

superagent.get(cityItemUrl).end((err, res) => {
    if (err) {
      return console.error(err);
    }

    const $ = cheerio.load(res.text);
    // 解析當前頁面,獲取當前頁面的城市連接地址
    const cityInfoEle = $('.newslist1 li a');
    cityInfoEle.each((idx, element) => {
      const $element = $(element);
      const sceneURL = $element.attr('href'); // 頁面地址
      const sceneName = $element.attr('title'); // 城市名稱
      if (!sceneName) {
        return;
      }
      logger.info(`當前解析到的目的地是: ${sceneName}, 對應的地址爲: ${sceneURL}`);

      getDesInfos(sceneURL, sceneName); // 獲取城市詳細信息

      ep.after('getDirInfoComplete', cityInfoEle.length, (dirInfos) => {
        const content = JSON.parse(fs.readFileSync(path.join(__dirname, './imgs.json')));

        dirInfos.forEach((element) => {
          logger.info(`本條數據爲:${JSON.stringify(element)}`);
          Object.assign(content, element);
        });

        fs.writeFileSync(path.join(__dirname, './imgs.json'), JSON.stringify(content));
      });
    });
  });

使用superagent請求頁面，請求成功後使用cheerio 來加載頁面內容，而後使用相似Jquery的匹配規則來查找目的資源。併發

多個資源加載完成，使用eventproxy來代理事件，處理一次資源處罰一次事件，全部事件觸發完成後處理數據。app

以上就是最基本的爬蟲了，接下來就是一些可能會出問題或者須要特別注意的地方了。。。dom

讀寫本地文件

建立文件夾異步

function mkdirSync(dirname) {
  if (fs.existsSync(dirname)) {
    return true;
  }
  if (mkdirSync(path.dirname(dirname))) {
    fs.mkdirSync(dirname);
    return true;
  }

  return false;
}

讀寫文件函數

const content = JSON.parse(fs.readFileSync(path.join(__dirname, './dir.json')));

      dirInfos.forEach((element) => {
        logger.info(`本條數據爲:${JSON.stringify(element)}`);
        Object.assign(content, element);
      });

      fs.writeFileSync(path.join(__dirname, './dir.json'), JSON.stringify(content));

批量下載資源

下載資源可能包括圖片、音頻等等。ui

使用Bagpipe處理異步併發參考

const Bagpipe = require('bagpipe');

const bagpipe = new Bagpipe(10);

    bagpipe.push(downloadImage, url, dstpath, (err, data) => {
      if (err) {
        console.log(err);
        return;
      }
      console.log(`[${dstpath}]: ${data}`);
    });

下載資源，使用stream來完成文件寫入。

function downloadImage(src, dest, callback) {
  request.head(src, (err, res, body) => {
    if (src && src.indexOf('http') > -1 || src.indexOf('https') > -1) {
      request(src).pipe(fs.createWriteStream(dest)).on('close', () => {
        callback(null, dest);
      });
    }
  });
}

編碼

有時候直接使用 cheerio.load處理的網頁內容，寫入文件後發現是編碼後的文字，能夠經過

const $ = cheerio.load(buf, { decodeEntities: false });

來禁止編碼，

ps: encoding庫和iconv-lite未能實現將utf-8編碼的字符轉換爲中文，多是還對API不熟悉，稍後能夠關注下。

最後，附上一個匹配全部dom標籤的正則

const reg = /<.*?>/g;

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。