首先看了這篇文章前端使用puppeteer 爬蟲生成《React.js 小書》PDF併合並,發現最後的pdf沒有書籤,很難受,因此主要在此基礎上加了加書籤的功能。前端
爬去的示例網站爲React.js 小書,僅作學習交流node
使用puppeteer爬取網頁並生成pdfreact
const puppeteer = require('puppeteer'); (async () => { const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto('https://news.ycombinator.com', {waitUntil: 'networkidle2'}); await page.pdf({path: 'hn.pdf', format: 'A4'}); await browser.close(); })();
pdf-merge:合併pdfgithub
依賴於pdftkshell
pdftk:一個處理pdf的工具npm
利用update_info_utf8
給pdf增長書籤:segmentfault
pdftk 'd:\OpenSource\My\genpfdforrsb\React 小書(無書籤).pdf' update_info_utf8 'd:\OpenSource\My\genpfdforrsb\bookmarks.txt' output 'd:\OpenSource\My\genpfdforrsb\React 小書.pdf'
api
也就是bookmarks.txtasync
書籤格式:
BookmarkBegin BookmarkTitle: PDF Reference (Version 1.5) BookmarkLevel: 1 BookmarkPageNumber: 1 BookmarkBegin BookmarkTitle: Contents BookmarkLevel: 2 BookmarkPageNumber: 3
pdfjs-dist:獲取單個pdf頁數,用於bookmarks.txt中指定頁碼
const pageArr = result.map(c => c.numPages); let txt = '' for (let index = 0; index < pageArr.length; index++) { let temp = `BookmarkBegin\r\nBookmarkTitle: ${titleArr[index]}\r\nBookmarkLevel: 1\r\nBookmarkPageNumber: ${pageIndex}\r\n` txt += temp pageIndex += pageArr[index] } fs.writeFileSync('bookmarks.txt', txt);
參考pdf-merge
源碼,增長runshell.js
用於在node中執行pdftk
的命令
runshell.js以下:
'use strict'; const child = require('child_process'); const Promise = require('bluebird'); const exec = Promise.promisify(child.exec); module.exports = (scripts) => new Promise((resolve, reject) => { exec(scripts) .then(resolve) .catch(reject); });
執行pdftk update_info_utf8
const nobkname = 'React 小書(無書籤).pdf' const hasbkname = 'React 小書.pdf' mergepdf(nobkname).then(buffer => { console.log('starting add bookmarks!') runshell(`pdftk "${__dirname}/${nobkname}" update_info_utf8 "${__dirname}/bookmarks.txt" output "${__dirname}/${hasbkname}"`).then(() => { console.log('completed add bookmarks!') fs.unlinkSync(`${__dirname}/${nobkname}`); fs.unlinkSync(`${__dirname}/bookmarks.txt`); console.log('all completed!') }) })
源碼:genpfdforrsb
合併後的pdf頁碼不是連續的,仍是單個pdf的頁碼