使用 Headless Chrome 進行頁面渲染

使用 Headless Chrome 進行頁面渲染 從屬於筆者的 Web 開發基礎與工程實踐系列文章,主要介紹了使用 Node.js 利用 Chrome Remote Protocol 遠程控制 Headless Chrome 渲染界面的基礎用法。本文涉及的參考與引用資料統一列舉在這裏node

近日筆者在爲 declarative-crawler 編寫動態頁面的蜘蛛,即在使用 declarative-crawler 爬取知乎美圖 一文中介紹的 HeadlessChromeSpider 時,須要選擇某個無界面瀏覽器以執行 JavaScript 代碼來動態生成頁面。以前筆者每每是使用 PhantomJS 或者 Selenium 執行動態頁面渲染,而在 Chrome 59 以後 Chrome 提供了 Headless 模式,其容許在命令行中使用 Chromium 以及 Blink 渲染引擎提供的完整的現代 Web 平臺特性。須要注意的是,Headless Chrome 仍然存在必定的侷限,相較於 Nightmare 或 Phantom 這樣的工具, Chrome 的遠程接口仍然沒法提供較好的開發者體驗。咱們在下文介紹的代碼示例中也會發現,目前咱們仍須要大量的模板代碼進行控制。linux

安裝與啓動

在 Chrome 安裝完畢後咱們能夠利用其包體內自帶的命令行工具啓動:git

$ chrome --headless --remote-debugging-port=9222 https://chromium.org

筆者爲了部署方便,使用 Docker 鏡像來進行快速部署,若是你本地存在 Docker 環境,能夠使用以下命令快速啓動:github

docker run -d -p 9222:9222 justinribeiro/chrome-headless

若是是在 Mac 下本地使用的話咱們還能夠建立命令別名:chrome

alias chrome="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome"
alias chrome-canary="/Applications/Google\ Chrome\ Canary.app/Contents/MacOS/Google\ Chrome\ Canary"
alias chromium="/Applications/Chromium.app/Contents/MacOS/Chromium"

若是是在 Ubuntu 環境下咱們能夠使用 deb 進行安裝:docker

# Install Google Chrome
# https://askubuntu.com/questions/79280/how-to-install-chrome-browser-properly-via-command-line
sudo apt-get install libxss1 libappindicator1 libindicator7
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome*.deb  # Might show "errors", fixed by next line
sudo apt-get install -f

chrome 命令行也支持豐富的命令行參數,--dump-dom 參數能夠將 document.body.innerHTML 打印到標準輸出中:npm

chrome --headless --disable-gpu --dump-dom https://www.chromestatus.com/

--print-to-pdf 標識則會將網頁輸出位 PDF:ubuntu

chrome --headless --disable-gpu --print-to-pdf https://www.chromestatus.com/

初次以外,咱們也能夠使用 --screenshot 參數來獲取頁面截圖:瀏覽器

chrome --headless --disable-gpu --screenshot https://www.chromestatus.com/

# Size of a standard letterhead.
chrome --headless --disable-gpu --screenshot --window-size=1280,1696 https://www.chromestatus.com/

# Nexus 5x
chrome --headless --disable-gpu --screenshot --window-size=412,732 https://www.chromestatus.com/

若是咱們須要更復雜的截圖策略,譬如進行完整頁面截圖則須要利用代碼進行遠程控制。網絡

代碼控制

啓動

在上文中咱們介紹瞭如何利用命令行來手動啓動 Chrome,這裏咱們嘗試使用 Node.js 來啓動 Chrome,最簡單的方式就是使用 child_process 來啓動:

const exec = require('child_process').exec;

function launchHeadlessChrome(url, callback) {
  // Assuming MacOSx.
  const CHROME = '/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome';
  exec(`${CHROME} --headless --disable-gpu --remote-debugging-port=9222 ${url}`, callback);
}

launchHeadlessChrome('https://www.chromestatus.com', (err, stdout, stderr) => {
  ...
});

遠程控制

這裏咱們使用 chrome-remote-interface 來遠程控制 Chrome ,實際上 chrome-remote-interface 是對於 Chrome DevTools Protocol 的遠程封裝,咱們能夠參考協議文檔瞭解詳細的功能與參數。使用 npm 安裝完畢以後,咱們能夠用以下代碼片進行簡單控制:

const CDP = require('chrome-remote-interface');

CDP((client) => {
    // extract domains
    const {Network, Page} = client;
    // setup handlers
    Network.requestWillBeSent((params) => {
        console.log(params.request.url);
    });
    Page.loadEventFired(() => {
        client.close();
    });
    // enable events then start!
    Promise.all([
        Network.enable(),
        Page.enable()
    ]).then(() => {
        return Page.navigate({url: 'https://github.com'});
    }).catch((err) => {
        console.error(err);
        client.close();
    });
}).on('error', (err) => {
    // cannot connect to the remote endpoint
    console.error(err);
});

咱們也能夠使用 chrome-remote-interface 提供的命令行功能,譬如咱們能夠在命令行中訪問某個界面而且記錄全部的網絡請求:

$ chrome-remote-interface inspect
>>> Network.enable()
{ result: {} }
>>> Network.requestWillBeSent(params => params.request.url)
{ 'Network.requestWillBeSent': 'params => params.request.url' }
>>> Page.navigate({url: 'https://www.wikipedia.org'})
{ 'Network.requestWillBeSent': 'https://www.wikipedia.org/' }
{ result: { frameId: '5530.1' } }
{ 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia_wordmark.png' }
{ 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/img/Wikipedia-logo-v2.png' }
{ 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/js/index-3b68787aa6.js' }
{ 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/js/gt-ie9-c84bf66d33.js' }
{ 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/img/sprite-bookshelf_icons.png?16ed124e8ca7c5ce9d463e8f99b2064427366360' }
{ 'Network.requestWillBeSent': 'https://www.wikipedia.org/portal/wikipedia.org/assets/img/sprite-project-logos.png?9afc01c5efe0a8fb6512c776955e2ad3eb48fbca' }

咱們也能夠直接查看內置的接口文檔:

>>> Page.navigate
{ [Function]
  category: 'command',
  parameters: { url: { type: 'string', description: 'URL to navigate the page to.' } },
  returns:
   [ { name: 'frameId',
       '$ref': 'FrameId',
       hidden: true,
       description: 'Frame id that will be navigated.' } ],
  description: 'Navigates current page to the given URL.',
  handlers: [ 'browser', 'renderer' ] }>>> Page.navigate
{ [Function]
  category: 'command',
  parameters: { url: { type: 'string', description: 'URL to navigate the page to.' } },
  returns:
   [ { name: 'frameId',
       '$ref': 'FrameId',
       hidden: true,
       description: 'Frame id that will be navigated.' } ],
  description: 'Navigates current page to the given URL.',
  handlers: [ 'browser', 'renderer' ] }

咱們在上文中還提到須要以代碼控制瀏覽器進行完整頁面截圖,這裏須要利用 Emulation 模塊控制頁面視口縮放:

const CDP = require('chrome-remote-interface');
const argv = require('minimist')(process.argv.slice(2));
const file = require('fs');

// CLI Args
const url = argv.url || 'https://www.google.com';
const format = argv.format === 'jpeg' ? 'jpeg' : 'png';
const viewportWidth = argv.viewportWidth || 1440;
const viewportHeight = argv.viewportHeight || 900;
const delay = argv.delay || 0;
const userAgent = argv.userAgent;
const fullPage = argv.full;

// Start the Chrome Debugging Protocol
CDP(async function(client) {
  // Extract used DevTools domains.
  const {DOM, Emulation, Network, Page, Runtime} = client;

  // Enable events on domains we are interested in.
  await Page.enable();
  await DOM.enable();
  await Network.enable();

  // If user agent override was specified, pass to Network domain
  if (userAgent) {
    await Network.setUserAgentOverride({userAgent});
  }

  // Set up viewport resolution, etc.
  const deviceMetrics = {
    width: viewportWidth,
    height: viewportHeight,
    deviceScaleFactor: 0,
    mobile: false,
    fitWindow: false,
  };
  await Emulation.setDeviceMetricsOverride(deviceMetrics);
  await Emulation.setVisibleSize({width: viewportWidth, height: viewportHeight});

  // Navigate to target page
  await Page.navigate({url});

  // Wait for page load event to take screenshot
  Page.loadEventFired(async () => {
    // If the `full` CLI option was passed, we need to measure the height of
    // the rendered page and use Emulation.setVisibleSize
    if (fullPage) {
      const {root: {nodeId: documentNodeId}} = await DOM.getDocument();
      const {nodeId: bodyNodeId} = await DOM.querySelector({
        selector: 'body',
        nodeId: documentNodeId,
      });
      const {model: {height}} = await DOM.getBoxModel({nodeId: bodyNodeId});

      await Emulation.setVisibleSize({width: viewportWidth, height: height});
      // This forceViewport call ensures that content outside the viewport is
      // rendered, otherwise it shows up as grey. Possibly a bug?
      await Emulation.forceViewport({x: 0, y: 0, scale: 1});
    }

    setTimeout(async function() {
      const screenshot = await Page.captureScreenshot({format});
      const buffer = new Buffer(screenshot.data, 'base64');
      file.writeFile('output.png', buffer, 'base64', function(err) {
        if (err) {
          console.error(err);
        } else {
          console.log('Screenshot saved');
        }
        client.close();
      });
    }, delay);
  });
}).on('error', err => {
  console.error('Cannot connect to browser:', err);
});
相關文章
相關標籤/搜索