因爲最近要實現一個爬取H5遊戲的代理服務器,隧看到這麼一篇不錯的文章(http://blog.miguelgrinberg.com/post/easy-web-scraping-with-nodejs),加之最近在學習Node.js,因此就準備翻譯出來加深一下印象。html
轉載請註明來源:(www.cnblogs.com/xdxer )前端
在這篇文章中,我將會向你們展現如何用JavaScript配合Node.js寫一個網絡爬取的腳本。node
在大部分狀況下,一個網絡抓取的腳本只須要一種方法去下載整個頁面,而後搜索裏面的數據。全部的現代語言都會提供方法去下載網頁,或者至少會有人實現了某個library或者一些其餘的擴展包,因此這並非一個難點。而後,要精肯定位而且找出在HTML中的數據是比較有難度的。一個HTML頁面混雜了許多內容、佈局和樣式的變量,因此去解釋而且識別出那些咱們關注的部分是一個挺不容易的工做。web
舉個例子,考慮以下的HTML頁面:npm
<html> <head>...</head> <body> <div id="content"> <div id="sidebar"> ... </div> <div id="main"> <div class="breadcrumbs"> ... </div> <table id="data"> <tr><th>Name</th><th>Address</th></tr> <tr><td class="name">John</td><td class="address">Address of John</td></tr> <tr><td class="name">Susan</td><td class="address">Address of Susan</td></tr> </table> </div> </div> </body> </html>
若是咱們須要獲取到出如今 id = 「data」這個表中的人名,那麼應該怎麼作呢?編程
通常的,網頁會被下載成一個字符串的形式,而後只須要很簡單的對這個網頁進行檢索,檢索出那些出如今<td class = 「name」> 以後,以</td>結尾的字符串就能夠了。api
可是這種方式很容易會獲取到不正確的數據。網頁可能會有別的table,或者更加糟糕的是,原先的<td class="name"> 變成了 <td align="left" class="name"> ,這將會讓咱們以前所制定的方案什麼都找不到。雖說網頁的變化很容易致使一個爬取腳本失效,可是假如咱們能夠清楚的知道元素是如何在HTML中組織的,那麼咱們就沒必要老是重寫咱們的爬取腳本,當網頁改變的時候。瀏覽器
若是你寫過前端的js代碼,使用過jQuery,那麼你就會發現使用CSS selector 來選擇DOM中的元素是一件很是簡單的事情。舉個例子,在瀏覽器中,咱們能夠很簡單的爬取到那些名字使用以下的方式:服務器
$('#data .name').each(function() { alert($(this).text()); });
http://nodejs.org (get it here!)網絡
Javascript 是一個嵌入web瀏覽器的語言,感謝Node.js工程,咱們如今能夠編寫可以獨立運行,而且甚至能夠做爲一個web server 的編程語言。
有不少現成的庫,例如jQuery那樣的。因此使用Javrscript+Node.js去實現這麼一個任務就很是便利了,由於咱們可使用那些現有的操做DOM元素的技術,這些技術在web瀏覽器上已經應用的比較成熟了。
Node.js有不少的庫,它是模塊化的。本例子中須要用到兩個庫,request 和 cheerio。 request主要是用於下載那些網頁,cheerio 會在本地生成一棵DOM樹,而後提供一個jQuery子集去操做它們。安裝Node.js模塊須要用到npm操做,相似於Ruby的gem 或者 Python的easy_install
有關於cheerio的一些API 能夠參考這一篇CNode社區的文章 (https://cnodejs.org/topic/5203a71844e76d216a727d2e)
$ mkdir scraping
$ cd scraping
$ npm install request cheerio
如以上代碼所示,首先咱們建立了一個目錄「scraping」,而且咱們在在這個目錄下安裝了request 和 cheerio模塊,事實上,nodejs的模塊是能夠進行全局性的安裝的,可是我更加喜歡locally的安裝,安裝的效果以下圖所示。
那接下來咱們就看看如何使用cheerio,來爬取上面的例子中的name,咱們建立一個.js文件 example.js,代碼以下:
var cheerio = require('cheerio'); $ = cheerio.load('<html><head></head><body><div id="content">
<div id="sidebar"></div><div id="main">
<div id="breadcrumbs"></div><table id="data"><tr>
<th>Name</th><th>Address</th></tr><tr><td class="name">
John</td><td class="address">Address of John</td></tr>
<tr><td class="name">Susan</td><td class="address">
Address of Susan</td></tr></table></div></div></body></html>'); $('#data .name').each(function() { console.log($(this).text()); });
輸出以下:
$ node example.js John Susan
http://www.thprd.org/schedules/schedule.cfm?cs_id=15 爬取這個網站中的日程表
代碼以下:
var request = require('request'); var cheerio = require('cheerio'); days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']; pools = { 'Aloha': 3, 'Beaverton': 15, 'Conestoga': 12, 'Harman': 11, 'Raleigh': 6, 'Somerset': 22, 'Sunset': 5, 'Tualatin Hills': 2 }; for (pool in pools) { var url = 'http://www.thprd.org/schedules/schedule.cfm?cs_id=' + pools[pool]; request(url, (function(pool) { return function(err, resp, body) { $ = cheerio.load(body); $('#calendar .days td').each(function(day) { $(this).find('div').each(function() { event = $(this).text().trim().replace(/\s\s+/g, ',').split(','); if (event.length >= 2 && (event[1].match(/open swim/i) || event[1].match(/family swim/i))) console.log(pool + ',' + days[day] + ',' + event[0] + ',' + event[1]); }); }); }})(pool)); }
輸出以下:
$ node thprd.js Conestoga,Monday,4:15p-5:15p,Open Swim - M/L Conestoga,Monday,7:45p-9:00p,Open Swim - M/L Conestoga,Tuesday,7:30p-9:00p,Open Swim - M/L Conestoga,Wednesday,4:15p-5:15p,Open Swim - M/L Conestoga,Wednesday,7:45p-9:00p,Open Swim - M/L Conestoga,Thursday,7:30p-9:00p,Open Swim - M/L Conestoga,Friday,6:30p-8:30p,Open Swim - M/L Conestoga,Saturday,1:00p-4:15p,Open Swim - M/L Conestoga,Sunday,2:00p-4:15p,Open Swim - M/L Aloha,Monday,1:05p-2:20p,Open Swim Aloha,Monday,7:50p-8:25p,Open Swim Aloha,Tuesday,1:05p-2:20p,Open Swim Aloha,Tuesday,8:45p-9:30p,Open Swim Aloha,Wednesday,1:05p-2:20p,Open Swim Aloha,Wednesday,7:50p-8:25p,Open Swim Aloha,Thursday,1:05p-2:20p,Open Swim Aloha,Thursday,8:45p-9:30p,Open Swim Aloha,Friday,1:05p-2:20p,Open Swim Aloha,Friday,7:50p-8:25p,Open Swim Aloha,Saturday,2:00p-3:30p,Open Swim Aloha,Saturday,4:30p-6:00p,Open Swim Aloha,Sunday,2:00p-3:30p,Open Swim Aloha,Sunday,4:30p-6:00p,Open Swim Harman,Monday,4:25p-5:30p,Open Swim* Harman,Monday,7:30p-8:55p,Open Swim Harman,Tuesday,4:25p-5:10p,Open Swim* Harman,Wednesday,4:25p-5:30p,Open Swim* Harman,Wednesday,7:30p-8:55p,Open Swim Harman,Thursday,4:25p-5:10p,Open Swim* Harman,Friday,2:00p-4:55p,Open Swim* Harman,Saturday,1:30p-2:25p,Open Swim Harman,Sunday,2:00p-2:55p,Open Swim Beaverton,Tuesday,10:45a-12:55p,Open Swim (No Diving Well) Beaverton,Tuesday,8:35p-9:30p,Open Swim No Diving Well Beaverton,Thursday,10:45a-12:55p,Open Swim (No Diving Well) Beaverton,Thursday,8:35p-9:30p,Open Swim No Diving Well Beaverton,Saturday,2:30p-4:00p,Open Swim Beaverton,Sunday,4:15p-6:00p,Open Swim Sunset,Tuesday,1:00p-2:30p,Open Swim/One Lap Lane Sunset,Thursday,1:00p-2:30p,Open Swim/One Lap Lane Sunset,Sunday,1:30p-3:00p,Open Swim/One Lap Lane Tualatin Hills,Monday,7:35p-9:00p,Open Swim-Diving area opens at 8pm Tualatin Hills,Wednesday,7:35p-9:00p,Open Swim-Diving area opens at 8pm Tualatin Hills,Sunday,1:30p-3:30p,Open Swim Tualatin Hills,Sunday,4:00p-6:00p,Open Swim
要注意的幾個問題: 異步js的做用域問題,還有對網站結構的分析,我會在其餘博客中提到。
其實我只翻譯了不多的一部分,有興趣的能夠去看一下原文,每一步都說的很仔細。