爬蟲爬不到數據？試試puppeteer（Node.js）

時間 2019-11-07

標籤爬蟲不到數據試試 puppeteer node.js node 欄目網絡爬蟲简体版

原文原文鏈接

場景

前不久，在學校仿微博鮮知微信小程序的時候，正愁數據從哪來，翻到了數據同樣的頁面微博新鮮事（需退出登陸狀態），接着用cheerio爬取數據。結果翻車了，檢查了一下發現發出請求拿到的body是空的，到微博新鮮事的網頁源代碼一看，發現...人家的html是js渲染的，應該是還有一次跳轉。哇，好狠！javascript

本着只要思想不滑坡，辦法總比困難多的精神，我用上了puppeteer。php

puppeteer

就個人使用體驗來說，puppeteer就像是一個完整的瀏覽器同樣，它真正的去解析、渲染頁面，因此上面提到由於跳轉拿不到頁面結構的問題也能夠解決。html

話很少說，先來安裝試試吧，由於puppeteer還挺大的，安裝就用cnpm吧（淘寶鏡像的快不少）。vue

安裝cnpm，有的話直接跳過java

npm install -g cnpm --registry=https://registry.npm.taobao.orgnode
安裝puppeteerreact

cnpm i puppeteer -Sgit
Hello Worldgithub

舉一個栗子試試就爬微博新鮮事的第一個選項的標題吧 npm
```
const puppeteer = require('puppeteer');

const url = 'https://weibo.com/?category=novelty';
const sleep = (time) => new Promise((resolve, reject) => { // 由於中間包含一次人爲設置的跳轉因此只好搞一個sleep等跳轉
  setTimeout(() => {
    resolve(true);
  }, time);
})
async function getindex(url) {
  const browser = await puppeteer.launch({  // 一個瀏覽器對象
    headless: false // puppeteer的功能很強大，但這裏用不到無頭，就關了
  });
  const page = await browser.newPage(); // 建立一個新頁面
  await page.goto(url, { // 跳轉到想要的url，並設置跳轉等待時間
    timeout: 60000
  });
  await sleep(60000); // 等待第二次跳轉完成
  const data = await page.$$eval('.UG_list_b', (lists) => { // 至關於document.querySelectorAll('.UG_list_b')
    var newarr = [Array.from(lists)[0]] // 由於只要第一個，因此把其餘的去掉了，若要全部的結果直接取Array.from(lists)便可
    return newarr.map(node => { // 遍歷數組選擇標題
      const title = node.querySelector('.list_des .list_title_b a').innerText;
      return title;
    })
  });
  browser.close(); // 關閉瀏覽器
  return data;
}

getindex(url)
  .then(res => {
    console.log(res);
  })
複製代碼
```
結果以下

簡要的解釋一下這裏用到的API：
- puppeteer.launch([object]):
  
  經過 puppeteer.launch([object]) 建立一個 Browser 對象，它經過接收一個非必須的對象參數進行配置。
  
  能夠設置字段包括 defaultViewport (默認視口大小), ignoreHTTPSErrors (是否在導航期間忽略 HTTPS 錯誤), timeout (超時時限)等。
- browser.newPage()
  
  經過 browser.newPage() 建立一個新的 Page 對象，在瀏覽器中會打開一個新的標籤頁。
- page.goto(url[, options])
  
  根據傳入的url，頁面導航去相應的頁面，它也經過接收一個非必須的對象參數進行配置。
  
  能夠設置字段包括 timeout (跳轉等待時限), waitUntil (知足什麼條件認爲頁面跳轉完成，默認爲load)。
  
  可是從這個demo的邏輯來講，只有第二次跳轉 passport.weibo.com/visitor/vis… 並渲染完成才認爲頁面跳轉完成，而這第二次跳轉是人爲設計的，因此直接訪問微博新鮮事未跳轉完成時返回的狀態碼還是200而不是3開頭的，使得難以區分是否跳轉完成。
- page.$$eval(selector, pageFunction[, ...args])
  
  selector是選擇器，如'.class', '#id', 'a[href]'等
  
  pageFunction是在瀏覽器實例上下文中要執行的方法
  
  ...args是要傳給 pageFunction 的參數。
  
  其做用至關於在頁面上執行 Array.from(document.querySelectorAll(selector))，而後把匹配到的元素數組做爲第一個參數傳給 pageFunction 並執行，返回的結果也是 pageFunction 返回的。
  
  而 page.evaluate(pageFunction) 有大體相同的功能，還更靈活。
- browser.close()
  
  這個沒什麼好說的，就是關閉瀏覽器，畢竟谷歌瀏覽器佔用內存仍是很多的，要是家裏有礦的當我沒說。
  
  更多詳細信息可查詢文檔

實際使用

需求是抓取微博新鮮事頁面的標題、頭圖、做者、時間等信息。

並抓取對應話題點擊進去的頁面信息，包括其左邊分類線戳的類別，類別對應下的全部微博，包括博文、博主、時間、轉發數、評論數、點贊數。

還有，要抓取對應話題裏全部對應博文的頁面信息，包括博文博主，相應的轉發、評論、點贊數，以及博文下的全部評論，包括評論層主頭像、暱稱、評論內容和點贊數。

並將信息都存成json文件。

分析

須要暫存爬下來的url地址，並遍歷存下爬取的信息。

並且微博設置的障礙還不止有二次跳轉，還有隨機跳到到訪問過於頻繁，請24小時後試的頁面和未登陸狀態下隨機跳轉到 weibo.com/login.php 以及 504，這個只要用 page.url() 獲取當前網址比對處理便可。

麻煩的只有頁面類型複雜繁瑣，話題頁面有四圖、一圖、純文本、視頻類型，他們的DOM結構都不一樣，博文頁面也有些許不一樣，但都不是技術難點。

實際代碼

小項目我就不分目錄了，直接上代碼吧

const puppeteer = require('puppeteer');
const fs = require('fs');
const baseurl = 'https://weibo.com';
const Dir = './data/';

const sleep = (time) => new Promise((resolve, reject) => {
  setTimeout(() => {
    resolve(true);
  }, time);
})

async function doSpider(url, pageFunction) { // 爬蟲函數
  const browser = await puppeteer.launch({
    headless: false
  });
  const page = await browser.newPage();
  await page.goto(url, {
    timeout: 100000
  });
  await sleep(60321);
  if (page.url().indexOf(url) === -1) {
    console.log('+------------------------------------');
    console.log('|失敗,當前頁面:' + page.url());
    console.log('|再次跳轉： ' + url);
    console.log('+------------------------------------');
    browser.close();
    return doSpider(url, pageFunction);
  }
  const data = await page.evaluate(pageFunction, url);
  browser.close();
  return data;
}

async function runPromiseByQueue(myPromises) {  // 利用數組reduce給爬蟲執行的結果排序
  return await myPromises.reduce(
    async (previousPromise, nextPromise) => previousPromise.then(await nextPromise),
    Promise.resolve([])
  );
}

function saveLocalData(name, data) {// 數據存文件
  fs.writeFile(Dir + `${name}.json`, JSON.stringify(data), 'utf-8', err => {
    if (!err) {
      console.log(`${name}.json保存成功！`);
    }
  })
}

const spiderIndex = () => { //新鮮事頁面的選擇器
  const lists = document.querySelectorAll('.UG_list_b');
  var newarr = [Array.from(lists)[0], Array.from(lists)[1]]
  return newarr.map(node => {
    const title = node.querySelector('.list_des .list_title_b a').innerText;
    const picUrl = node.querySelector('.pic.W_piccut_v img').getAttribute('src');
    const newUrl = node.querySelector('.list_des .list_title_b a').getAttribute('href');
    const userImg = node.querySelector('.subinfo_box a .subinfo_face img').getAttribute('src');
    const userName = node.querySelector('.subinfo_box a .subinfo').innerText;
    const time = node.querySelector('.subinfo_box>.subinfo.S_txt2').innerText;
    return {
      title,
      picUrl,
      newUrl,
      userImg,
      userName,
      time
    }
  })
}

const spiderTopic = (url) => {  // 話題頁面的選擇器
  const picUrl = document.querySelector('.UG_list_e .list_nod .pic img').getAttribute('src');
  const topicTitle = document.querySelector('.UG_list_e .list_title').innerText;
  const des = document.querySelector('.UG_list_e .list_nod .list_des').innerText;
  const userImg = document.querySelector('.UG_list_e .list_nod .subinfo_box a .subinfo_face img').getAttribute('src');
  const userName = document.querySelector('.UG_list_e .list_nod .subinfo_box a .subinfo').innerText;
  const time = document.querySelector('.UG_list_e .list_nod .subinfo_box>.subinfo.S_txt2').innerText;
  const lists = document.querySelectorAll('.UG_content_row');
  const types = Array.from(lists).map(node => {
    const title = node.querySelector('.UG_row_title').innerText;

    const v2list = node.querySelectorAll('.UG_list_v2 .list_des');

    var v2Item = Array.from(v2list).map(node => {
      const content = node.querySelector('h3').innerText;
      const userImg = node.querySelector('.subinfo_box a .subinfo_face img').getAttribute('src');
      const userName = node.querySelector('.subinfo_box a .subinfo').innerText;
      const time = node.querySelector('.subinfo_box>.subinfo.S_txt2').innerText;
      const like = node.querySelector('.subinfo_box.subinfo_box_btm .subinfo_rgt:nth-of-type(1) em:nth-of-type(2)').innerText;
      const comment = node.querySelector('.subinfo_box.subinfo_box_btm .subinfo_rgt:nth-of-type(3) em:nth-of-type(2)').innerText;
      const relay = node.querySelector('.subinfo_box.subinfo_box_btm .subinfo_rgt:nth-of-type(5) em:nth-of-type(2)').innerText;
      const newUrl = node.getAttribute('href');
      return newitem = {
        content,
        userImg,
        userName,
        time,
        like,
        comment,
        relay,
        newUrl,
        img: []
      }
    })

    const alist = node.querySelectorAll('.UG_list_a');

    var aItem = Array.from(alist).map(node => {
      const content = node.querySelector('h3').innerText;
      const newUrl = node.getAttribute('href');
      const img1 = node.querySelector('.list_nod .pic:nth-of-type(1) img').getAttribute('src');
      const img2 = node.querySelector('.list_nod .pic:nth-of-type(2) img').getAttribute('src');
      const img3 = node.querySelector('.list_nod .pic:nth-of-type(3) img').getAttribute('src');
      const img4 = node.querySelector('.list_nod .pic:nth-of-type(4) img').getAttribute('src');
      const userImg = node.querySelector('.subinfo_box a .subinfo_face img').getAttribute('src');
      const userName = node.querySelector('.subinfo_box a .subinfo').innerText;
      const time = node.querySelector('.subinfo_box>.subinfo.S_txt2').innerText;
      const like = node.querySelector('.subinfo_box .subinfo_rgt:nth-of-type(2) em:nth-of-type(2)').innerText;
      const comment = node.querySelector('.subinfo_box .subinfo_rgt:nth-of-type(4) em:nth-of-type(2)').innerText;
      const relay = node.querySelector('.subinfo_box .subinfo_rgt:nth-of-type(6) em:nth-of-type(2)').innerText;
      return newitem = {
        img: [img1, img2, img3, img4],
        content,
        userImg,
        userName,
        time,
        like,
        comment,
        relay,
        newUrl
      }
    })

    const blist = node.querySelectorAll('.UG_list_b');

    var bItem = Array.from(blist).map(node => {
      const content = node.querySelector('.list_des h3').innerText;
      const newUrl = node.getAttribute('href');
      var img = ''
      if (node.querySelector('.pic img') != null) {
        img = node.querySelector('.pic img').getAttribute('src');
      }
      const userImg = node.querySelector('.list_des .subinfo_box a .subinfo_face img').getAttribute('src');
      const userName = node.querySelector('.list_des .subinfo_box a .subinfo').innerText;
      const time = node.querySelector('.list_des .subinfo_box>.subinfo.S_txt2').innerText;
      const like = node.querySelector('.list_des .subinfo_box .subinfo_rgt:nth-of-type(2) em:nth-of-type(2)').innerText;
      const comment = node.querySelector('.list_des .subinfo_box .subinfo_rgt:nth-of-type(4) em:nth-of-type(2)').innerText;
      const relay = node.querySelector('.list_des .subinfo_box .subinfo_rgt:nth-of-type(6) em:nth-of-type(2)').innerText;
      return newitem = {
        img: [img],
        content,
        userImg,
        userName,
        time,
        like,
        comment,
        relay,
        newUrl
      }
    })
    return {
      title,
      list: [...v2Item, ...aItem, ...bItem]
    }
  })
  return {
    topicTitle,
    picUrl,
    des,
    userImg,
    userName,
    time,
    newUrl: url,
    types
  }
}

const spiderPage = (url) => {   // 博文頁面的選擇器
  var data = {}
  const content = document.querySelector('.WB_text.W_f14').innerText;
  const piclist = document.querySelectorAll('.S_bg1.S_line2.bigcursor.WB_pic');
  const picUrls = Array.from(piclist).map(node => {
    const picUrl = node.querySelector('img').getAttribute('src');
    return picUrl;
  })
  const time = document.querySelector('.WB_detail>.WB_from.S_txt2>a').innerText;
  const userImg = document.querySelector('.WB_face>.face>a>img').getAttribute('src');
  const userName = document.querySelector('.WB_info>a').innerText;
  const like = document.querySelector('.WB_row_line>li:nth-of-type(4) em:nth-of-type(2)').innerText;
  const comment = document.querySelector('.WB_row_line>li:nth-of-type(3) em:nth-of-type(2)').innerText;
  const relay = document.querySelector('.WB_row_line>li:nth-of-type(2) em:nth-of-type(2)').innerText;

  const lists = document.querySelectorAll('.list_box>.list_ul>.list_li[comment_id]');
  const commentItems = Array.from(lists).map(node => {
    const userImg = node.querySelector('.WB_face>a>img').getAttribute('src');
    const userName = node.querySelector('.list_con>.WB_text>a[usercard]').innerText;

    const [frist, ...contentkey] = [...node.querySelector('.list_con>.WB_text').innerText.split('：')]
    const content = [...contentkey].toString();
    const time = node.querySelector('.WB_func>.WB_from').innerText
    const like = node.querySelector('.list_con>.WB_func [node-type=like_status]>em:nth-of-type(2)').innerText

    return {
      userImg,
      userName,
      content,
      time,
      like
    }
  })
  return data = {
    newUrl: url,
    content,
    picUrls,
    time,
    userImg,
    userName,
    like,
    comment,
    relay,
    commentItems
  }
}

function start() {  //主要操做，不封起來太難看了
  doSpider(baseurl + '/?category=novelty', spiderIndex) //爬完新鮮事頁面給數據
    .then(async data => {
      await saveLocalData('Index', data);// 爬完新鮮事頁面給的數據存成叫Index的文件
      return data.map(item => item.newUrl);// 把下一次爬的url都取出來
    })
    .then(async (urls) => {
      let doSpiders = await urls.map(async url => {//把url所有爬上就緒，返回函數待排序處理
        if (url[1] === '/') {// 有部分url只缺協議部分不缺baseurl，加個區分
          let newdata = await doSpider('https:' + url, spiderTopic);
          return async (data) => [...data, newdata]
        } else {
          let newdata = await doSpider(baseurl + url, spiderTopic);
          return async (data) => [...data, newdata]
        }
      })
      let datas = await runPromiseByQueue(doSpiders);//挖坑排序並執行
      return datas;
    })
    .then(async data => {
      saveLocalData('Topic', data);// 數據存文件
      let list = [];
      await data.forEach(item =>// 取出下一次爬的全部url
        item.types.forEach(type =>
          type.list.forEach(item =>
            list.push(item.newUrl)
          )
        )
      )
      return list;
    })
    .then(async (urls) => {
      let doSpiders = await urls.map(async url => {// 重複上述的爬蟲就緒
        if (url[1] === '/') {
          let newdata = await doSpider('https:' + url, spiderPage);
          return (data) => [...data, newdata]
        } else {
          let newdata = await doSpider(baseurl + url, spiderPage);
          return (data) => [...data, newdata]
        }
      })
      let datas = await runPromiseByQueue(doSpiders);// 重複同樣的爬蟲操做，並根據挖好的坑對結果排序
      return datas;
    })
    .then(async data => {
      saveLocalData('Page', data);// 數據存成文件
    })
}

start();
複製代碼

運行過程簡要說明：

運行能夠分三個過程（爬新鮮事頁面獲取包括多個話題頁面url在內的信息並存儲、併發爬 話題頁面獲取包括多個博文頁面url在內的信息並存儲、併發爬 博文頁面包括全部評論在內的信息並存儲），而且數據能夠經過存下來的newUrl字段來匹配主從，創建聯繫。

我原來寫過一個把全部爬數據的異步操做都串聯起來的版本，可是效率過低了，就用數組的reduce方法挖坑排序後直接併發操做，大大提高了效率（我跟你港哦，這個reduce真的好好用豁）。

下面是控制檯輸出和3個爬下來的數據文件（行數太多影響觀感就格式化後截成圖）：

這份代碼只爬了新鮮事裏的頭兩個話題，爬的頁面加起來13個，並無爬列表裏全部項，要爬全部項的話，改相應那行的代碼就好spiderIndex裏newarr的值便可，如var newarr = Array.from(lists)。剩下的，就交給時間吧...建議睡個覺享受生活，起來講不定就行了。也有另外一種可能，大眼怪（新浪）看你請求太多，暫時把你IP拒了。

最後奉上github庫代碼和數據文件都在這

寫在最後

立刻就要大四，學到如今，h5,小程序,vue,react,node,java都寫過，設計模式、函數式編程、懶加載雜七雜八之類啥的平時逮到啥學啥。其實也挺開心走了編程，一步一步實現也感受不錯，秋招我應該也會去找實習，在這問問大佬們去實習前還有沒有啥要注意的，第一次準備有點無從下手的感受。

有啥錯誤請務必指出來互相交流學習，畢竟我還菜嘛，若是方便的話，能留個贊麼，謝謝啦。