Node.JS爬蟲實戰 - 爬取圖片並下載到本地

時間 2020-07-10

標籤 node.js node 爬蟲實戰圖片下載本地欄目 Node.js 简体版

原文原文鏈接

前言

爬蟲應該遵循：robots 協議html

什麼是爬蟲

引用百度百科：node

網絡爬蟲（又稱爲網頁蜘蛛，網絡機器人，在 FOAF 社區中間，更常常的稱爲網頁追逐者），是一種按照必定的規則，自動地抓取萬維網信息的程序或者腳本。另一些不常使用的名字還有螞蟻、自動索引、模擬程序或者蠕蟲。ios

通俗的講就是經過機器自動地獲取想要的信息，當你訪問一個網站，發現有不少好看的圖片，因而你會選擇右鍵保存到本地，當你保存了幾張以後你會想我爲何不能寫個腳本，自動的去下載這些圖片到本地呢？因而爬蟲誕生了......npm

常見的爬蟲類型

服務端渲染的頁面(ssr) 就是服務端已經返回了渲染好的 html 片斷
客戶端渲染的頁面(csr) 常見的單頁面應用就是客戶端渲染

第二種須要經過分析接口爬蟲，本文講解的是使用第一種，使用 nodejs 實現爬取遠程圖片下載到本地

最終效果： json

準備

1 目錄axios

┌── cache
│   └── img 圖片目錄
├── app.js
└──  package.json
複製代碼

2 安裝依賴數組

axios 請求庫

npm i axios --save
複製代碼

cheerio 服務端的'jq'

npm i cheerio --save
複製代碼

fs 文件模塊

npm i fs --save
複製代碼

開始爬蟲

爬取某戶外網站，爬取首頁推薦的圖片並下載到本地網絡

1 流程分析

分析頁面結構，肯定要爬取的內容
node 端 http 請求獲取到頁面內容
用 cheerio 獲得圖片數組
遍歷圖片數組，並下載到本地

2 編寫代碼 axios 拿到 html 片斷分析發現該圖片在'newsimg'塊裏，cheerio 使用跟 jq 基本沒什麼區別,拿到圖片標題和下載連接 app

const res = await axios.get(target_url);
const html = res.data;
const $ = cheerio.load(html);
const result_list = [];
$('.newscon').each(element => {
  result_list.push({
    title: $(element).find('.newsintroduction').text(),
    down_loda_url: $(element).find('img').attr('src').split('!')[0],
  });
});
this.result_list.push(...result_list);
複製代碼

已經拿到一個下載連接數組，接下來要作的是遍歷該數組，發送請求而後用 fs 保存到本地dom

const target_path = path.resolve(__dirname, `./cache/image/${href.split('/').pop()}`);
const response = await axios.get(href, { responseType: 'stream' });
await response.data.pipe(fs.createWriteStream(target_path));
複製代碼

3 請求優化避免太頻繁請求會被封 ip，比較簡單的方法有幾個:

避免短期內頻繁請求，間隔必定時間再請求
axios 攔截器中設置 User-Agent，每次請求到用一個不一樣的
ip 庫，每次請求都用不同的 ip

完整代碼

class stealData {

  constructor() {
    this.base_url = ''; //要爬取的網站
    this.current_page = 1;
    this.result_list = [];
  }

  async init() {
    try {
      await this.getPageData();
      await this.downLoadPictures();
    } catch (e) {
      console.log(e);
    }
  }

  sleep(time) {
    return new Promise((resolve) => {
      console.log(`自動睡眠中，${time / 1000}秒後從新發送請求......`)
      setTimeout(() => {
        resolve();
      }, time);
    });
  }

  async getPageData() {
    const target_url = this.base_url;
    try {
      const res = await axios.get(target_url);
      const html = res.data;
      const $ = cheerio.load(html);
      const result_list = [];
      $('.newscon').each((index, element) => {
        result_list.push({
          title: $(element).find('.newsintroduction').text(),
          down_loda_url: $(element).find('img').attr('src').split('!')[0],
        });
      });
      this.result_list.push(...result_list);
      return Promise.resolve(result_list);
    } catch (e) {
      console.log('獲取數據失敗');
      return Promise.reject(e);
    }
  }

  async downLoadPictures() {
    const result_list = this.result_list;
    try {
      for (let i = 0, len = result_list.length; i < len; i++) {
        console.log(`開始下載第${i + 1}張圖片!`);
        await this.downLoadPicture(result_list[i].down_loda_url);
        await this.sleep(3000 * Math.random());
        console.log(`第${i + 1}張圖片下載成功!`);
      }
      return Promise.resolve();
    } catch (e) {
      console.log('寫入數據失敗');
      return Promise.reject(e)
    }
  }

  async downLoadPicture(href) {
    try {
      const target_path = path.resolve(__dirname, `./cache/image/${href.split('/').pop()}`);
      const response = await axios.get(href, { responseType: 'stream' });
      await response.data.pipe(fs.createWriteStream(target_path));
      console.log('寫入成功');
      return Promise.resolve();
    } catch (e) {
      console.log('寫入數據失敗');
      return Promise.reject(e)
    }
  }

}

const thief = new stealData('xxx_url');
thief.init();
複製代碼

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。