使用Node.js實現簡單的網絡爬取

因爲最近要實現一個爬取H5遊戲的代理服務器,隧看到這麼一篇不錯的文章(http://blog.miguelgrinberg.com/post/easy-web-scraping-with-nodejs),加之最近在學習Node.js,因此就準備翻譯出來加深一下印象。html

轉載請註明來源:(www.cnblogs.com/xdxer前端

 

在這篇文章中,我將會向你們展現如何用JavaScript配合Node.js寫一個網絡爬取的腳本。node

網絡抓取工具

在大部分狀況下,一個網絡抓取的腳本只須要一種方法去下載整個頁面,而後搜索裏面的數據。全部的現代語言都會提供方法去下載網頁,或者至少會有人實現了某個library或者一些其餘的擴展包,因此這並非一個難點。而後,要精肯定位而且找出在HTML中的數據是比較有難度的。一個HTML頁面混雜了許多內容、佈局和樣式的變量,因此去解釋而且識別出那些咱們關注的部分是一個挺不容易的工做。web

舉個例子,考慮以下的HTML頁面:npm

<html>
    <head>...</head>
    <body>
        <div id="content">
            <div id="sidebar">
            ...
            </div>
            <div id="main">
                <div class="breadcrumbs">
                ...
                </div>
                <table id="data">
                    <tr><th>Name</th><th>Address</th></tr>
                    <tr><td class="name">John</td><td class="address">Address of John</td></tr>
                    <tr><td class="name">Susan</td><td class="address">Address of Susan</td></tr>
                </table>
            </div>
        </div>
    </body>
</html>

若是咱們須要獲取到出如今 id = 「data」這個表中的人名,那麼應該怎麼作呢?編程

通常的,網頁會被下載成一個字符串的形式,而後只須要很簡單的對這個網頁進行檢索,檢索出那些出如今<td class = 「name」> 以後,以</td>結尾的字符串就能夠了。api

可是這種方式很容易會獲取到不正確的數據。網頁可能會有別的table,或者更加糟糕的是,原先的<td class="name"> 變成了 <td align="left" class="name"> ,這將會讓咱們以前所制定的方案什麼都找不到。雖說網頁的變化很容易致使一個爬取腳本失效,可是假如咱們能夠清楚的知道元素是如何在HTML中組織的,那麼咱們就沒必要老是重寫咱們的爬取腳本,當網頁改變的時候。瀏覽器

若是你寫過前端的js代碼,使用過jQuery,那麼你就會發現使用CSS selector 來選擇DOM中的元素是一件很是簡單的事情。舉個例子,在瀏覽器中,咱們能夠很簡單的爬取到那些名字使用以下的方式:服務器

$('#data .name').each(function() {
    alert($(this).text());
});

 

介紹一下Node.js

http://nodejs.org (get it here!)網絡

Javascript 是一個嵌入web瀏覽器的語言,感謝Node.js工程,咱們如今能夠編寫可以獨立運行,而且甚至能夠做爲一個web server 的編程語言。

有不少現成的庫,例如jQuery那樣的。因此使用Javrscript+Node.js去實現這麼一個任務就很是便利了,由於咱們可使用那些現有的操做DOM元素的技術,這些技術在web瀏覽器上已經應用的比較成熟了。

Node.js有不少的庫,它是模塊化的。本例子中須要用到兩個庫,request 和 cheerio。 request主要是用於下載那些網頁,cheerio 會在本地生成一棵DOM樹,而後提供一個jQuery子集去操做它們。安裝Node.js模塊須要用到npm操做,相似於Ruby的gem 或者 Python的easy_install

有關於cheerio的一些API 能夠參考這一篇CNode社區的文章 (https://cnodejs.org/topic/5203a71844e76d216a727d2e

$ mkdir scraping
$ cd scraping
$ npm install request cheerio

如以上代碼所示,首先咱們建立了一個目錄「scraping」,而且咱們在在這個目錄下安裝了request 和 cheerio模塊,事實上,nodejs的模塊是能夠進行全局性的安裝的,可是我更加喜歡locally的安裝,安裝的效果以下圖所示。

wps3372.tmp

那接下來咱們就看看如何使用cheerio,來爬取上面的例子中的name,咱們建立一個.js文件 example.js,代碼以下:

var cheerio = require('cheerio');
$ = cheerio.load('<html><head></head><body><div id="content">
<div id="sidebar"></div><div id="main">
<div id="breadcrumbs"></div><table id="data"><tr>
<th>Name</th><th>Address</th></tr><tr><td class="name">
John</td><td class="address">Address of John</td></tr>
<tr><td class="name">Susan</td><td class="address">
Address of Susan</td></tr></table></div></div></body></html>');

$('#data .name').each(function() {
    console.log($(this).text());
});

輸出以下:

$ node example.js
John
Susan
 

實際例子

http://www.thprd.org/schedules/schedule.cfm?cs_id=15 爬取這個網站中的日程表

代碼以下:

var request = require('request');
var cheerio = require('cheerio');

days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'];
pools = {
    'Aloha': 3,
    'Beaverton': 15,
    'Conestoga': 12,
    'Harman': 11,
    'Raleigh': 6,
    'Somerset': 22,
    'Sunset': 5,
    'Tualatin Hills': 2
};
for (pool in pools) {
    var url = 'http://www.thprd.org/schedules/schedule.cfm?cs_id=' + pools[pool];

    request(url, (function(pool) { return function(err, resp, body) {
        $ = cheerio.load(body);
        $('#calendar .days td').each(function(day) {
            $(this).find('div').each(function() {
                event = $(this).text().trim().replace(/\s\s+/g, ',').split(',');
                if (event.length >= 2 && (event[1].match(/open swim/i) || event[1].match(/family swim/i)))
                    console.log(pool + ',' + days[day] + ',' + event[0] + ',' + event[1]);
            });
        });
    }})(pool));
}

輸出以下:

$ node thprd.js
Conestoga,Monday,4:15p-5:15p,Open Swim - M/L
Conestoga,Monday,7:45p-9:00p,Open Swim - M/L
Conestoga,Tuesday,7:30p-9:00p,Open Swim - M/L
Conestoga,Wednesday,4:15p-5:15p,Open Swim - M/L
Conestoga,Wednesday,7:45p-9:00p,Open Swim - M/L
Conestoga,Thursday,7:30p-9:00p,Open Swim - M/L
Conestoga,Friday,6:30p-8:30p,Open Swim - M/L
Conestoga,Saturday,1:00p-4:15p,Open Swim - M/L
Conestoga,Sunday,2:00p-4:15p,Open Swim - M/L
Aloha,Monday,1:05p-2:20p,Open Swim
Aloha,Monday,7:50p-8:25p,Open Swim
Aloha,Tuesday,1:05p-2:20p,Open Swim
Aloha,Tuesday,8:45p-9:30p,Open Swim
Aloha,Wednesday,1:05p-2:20p,Open Swim
Aloha,Wednesday,7:50p-8:25p,Open Swim
Aloha,Thursday,1:05p-2:20p,Open Swim
Aloha,Thursday,8:45p-9:30p,Open Swim
Aloha,Friday,1:05p-2:20p,Open Swim
Aloha,Friday,7:50p-8:25p,Open Swim
Aloha,Saturday,2:00p-3:30p,Open Swim
Aloha,Saturday,4:30p-6:00p,Open Swim
Aloha,Sunday,2:00p-3:30p,Open Swim
Aloha,Sunday,4:30p-6:00p,Open Swim
Harman,Monday,4:25p-5:30p,Open Swim*
Harman,Monday,7:30p-8:55p,Open Swim
Harman,Tuesday,4:25p-5:10p,Open Swim*
Harman,Wednesday,4:25p-5:30p,Open Swim*
Harman,Wednesday,7:30p-8:55p,Open Swim
Harman,Thursday,4:25p-5:10p,Open Swim*
Harman,Friday,2:00p-4:55p,Open Swim*
Harman,Saturday,1:30p-2:25p,Open Swim
Harman,Sunday,2:00p-2:55p,Open Swim
Beaverton,Tuesday,10:45a-12:55p,Open Swim (No Diving Well)
Beaverton,Tuesday,8:35p-9:30p,Open Swim No Diving Well
Beaverton,Thursday,10:45a-12:55p,Open Swim (No Diving Well)
Beaverton,Thursday,8:35p-9:30p,Open Swim No Diving Well
Beaverton,Saturday,2:30p-4:00p,Open Swim
Beaverton,Sunday,4:15p-6:00p,Open Swim
Sunset,Tuesday,1:00p-2:30p,Open Swim/One Lap Lane
Sunset,Thursday,1:00p-2:30p,Open Swim/One Lap Lane
Sunset,Sunday,1:30p-3:00p,Open Swim/One Lap Lane
Tualatin Hills,Monday,7:35p-9:00p,Open Swim-Diving area opens at 8pm
Tualatin Hills,Wednesday,7:35p-9:00p,Open Swim-Diving area opens at 8pm
Tualatin Hills,Sunday,1:30p-3:30p,Open Swim
Tualatin Hills,Sunday,4:00p-6:00p,Open Swim
 
要注意的幾個問題: 異步js的做用域問題,還有對網站結構的分析,我會在其餘博客中提到。
其實我只翻譯了不多的一部分,有興趣的能夠去看一下原文,每一步都說的很仔細。
相關文章
相關標籤/搜索