一個簡單的puppeteer例子

時間 2020-03-04

標籤一個簡單 puppeteer 例子简体版

原文原文鏈接

工具和資料

QQ羣 - Javascript高級爬蟲 (832946826) - 做者自建羣，歡迎加入！
awesome-java-crawler - 做者收集的爬蟲相關工具和資料

前言

本腳本做用是抓取掌閱書城裏男頻女頻各分類的已完結書籍信息，按好評排序只抓前三頁。
這個頁面沒有任何反爬措施，適合做爲簡單例子。javascript

大概開發流程：

人工分析頁面，解析URL和分頁、分類等關鍵參數
人工分析頁面內容，控制檯驗證數據提取方法
編碼

代碼說明：

前面的pids和cids這兩個常數數組都是事先在頁面上查看超連接收集的。
用來進行數據提取的f方法，其實是轉化爲字符串後經過page.evaluate在瀏覽器中執行的，而不是在node環境中。但由於兩側語言一致，能夠直接和node源碼寫在一塊兒，從而獲取IDE支持，而在其它語言中js代碼只能以字符串形式存在
puppeteer的page.evaluate能夠直接將瀏覽器側腳本返回的對象直接傳遞到node側，很是方便

源碼

const fs = require("fs")
const puppeteer = require('puppeteer');

const url = "http://www.ireader.com/index.php?ca=booksort.index&pca=booksort.index&pid=$pid&order=score&status=3&cid=$cid&page=$page"
const pids = [10, 68]; // 男頻，女頻
const cids = [[11, 27, 19, 22, 16, 39, 42, 50, 54, 57, 60], [69, 74, 82, 86, 89, 90, 91, 723]]; // 頻道中的分類ID

(async () => {
    const browser = await puppeteer.launch({ // 啓動chrome瀏覽器
        // headless: false, // 是否無頭模式，能夠先在有頭模式下調試，無誤後切換成無頭模式以提高效率
        ignoreDefaultArgs: ["--enable-automation"], // 去掉chrome啓動參數中的--enable-automation
    });
    const page = await browser.newPage();
    const f = () => {
        return Array.from($('.bookMation')).map(e => {
            const id = $('h3 a', e).attr('href').match(/bid=(\d+)/)[1] // 用正則提取連接中的bid
            const title = $('h3 a', e).text()
            const author = $('p.tryread', e).text().replace('試讀', '').trim()
            const desc = $('p.introduce', e).text()
            return {id, title, author, desc}
        })
    }
    let result = [];
    for (const i in pids) {
        const pid = pids[i]
        for (cid of cids[i]) {
            for (let pg = 1; pg < 4; pg++) { // 只抓前三頁
                const u = url.replace("$cid", cid).replace("$pid", pid).replace("$page", pg)
                await page.goto(u);
                const res = await page.evaluate(f)
                res.forEach(e => { e.cid = cid; e.pid = pid })
                result = result.concat(res)
                console.log("page " + pg + " done")
            }
            console.log("cid " + cid + " done")
        }
        console.log("pid " + pid + " done")
    }
    fs.writeFileSync("d:/tmp/ireader_hot.json", JSON.stringify(result), {encoding: "utf-8"})
    console.log("all done")
      await browser.close(); // 關閉瀏覽器
})();

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。