爬蟲Larbin解析(一)——Larbin配置與使用

時間 2019-11-13

標籤爬蟲 larbin 解析配置使用欄目網絡爬蟲简体版

原文原文鏈接

介紹html

功能：網絡爬蟲java

開發語言：c++ios

開發者：Sébastien Ailleret（法國）c++

特色：只抓取網頁，高效（一個簡單的larbin的爬蟲能夠天天獲取500萬的網頁）web

安裝瀏覽器

安裝平臺：Ubuntu 12.10服務器

下載：http://sourceforge.net/projects/larbin/files/larbin/2.6.3/larbin-2.6.3.tar.gz/downloadcookie

安裝：網絡

tar -zxvf  larbin-2.6.3.tar.gz
cd larbin-2.6.3
./configure
make

期間會出現錯誤，解決less

1. adns文件夾下internal.h文件569-571行：

adns_status adns__parse_domain(adns_state ads, int serv, adns_queryqu,
         vbuf *vb, parsedomain_flags flags,
         const byte *dgram, int dglen, int *cbyte_io, int max);

改成

adns_status adns__parse_domain(adns_state ads, int serv, adns_query qu,
         vbuf *vb, adns_queryflags flags,
         const byte *dgram, int dglen, int *cbyte_io, int max);

2. 輸入sudo ./congure 出現錯誤

make[2]: 正在進入目錄 `/home/byd/test/larbin-2.6.3/src/utils'
makedepend -f- -I.. -Y *.cc 2> /dev/null > .depend
make[2]: *** [dep-in] 錯誤 127
make[2]:正在離開目錄 `/home/byd/test/larbin-2.6.3/src/utils'
make[2]: 正在進入目錄 `/home/byd/test/larbin-2.6.3/src/interf'
<span style="color: #ff0000;"><strong>makedepend</strong></span> -f- -I.. -Y *.cc 2> /dev/null > .depend
make[2]: *** [dep-in] 錯誤 127
make[2]:正在離開目錄 `/home/byd/test/larbin-2.6.3/src/interf'
make[2]: 正在進入目錄 `/home/byd/test/larbin-2.6.3/src/fetch'
makedepend -f- -I.. -Y *.cc 2> /dev/null > .depend
make[2]: *** [dep-in] 錯誤 127
make[2]:正在離開目錄 `/home/byd/test/larbin-2.6.3/src/fetch'
make[1]: *** [dep] 錯誤 2
make[1]:正在離開目錄 `/home/byd/test/larbin-2.6.3/src'
make: *** [dep] 錯誤 2

上邊提示makedepend 有問題，因而輸入makedepend，提示

makedepend 沒安裝，可是能夠經過

sudo apt-get install xutils-dev

ok了。

3. 到/usr/include/c++/下CP一份iostream文件到larbin的src目錄下。並將其名改成iostream.h，在文件中添加一句

using namespace std;

而後，繼續

make

運行

 ./larbin

能夠在瀏覽器上輸入"localhost:8081"看當前爬蟲的運行情況

終止

 ctrl+c

重啓

./larbin -scratch

再次啓動larbin時出現錯誤(只輸入命令 ./larbin)

larbin_2.6.3 is starting its search
Unable to get the socket for the webserver (8081) : Address already in use

緣由

當客戶端保持着與服務器端的鏈接，這時服務器端斷開，再開啓服務器時會出現： Address already in use

解決

netstat -anp | more

能夠看到（以下圖），殺死進程便可

kill -9 18066

其中

在Internet RFC標準中，Netstat的定義是： Netstat是在內核中訪問網絡及相關信息的程序，它能提供TCP鏈接，TCP和UDP監聽，進程內存管理的相關報告
kill - 9 表示強制殺死該進程（最好少用，他是強制性的，即便是系統進程也會殺掉的）

配置

一、larbin.conf文件

###############################################
//客戶端標記，當對其餘網站抓取時，被抓取的網站知道是什麼抓取的

UserAgent larbin_2.6.3

############################################
# What are the inputs and ouputs of larbin
# port on which is launched the http statistic webserver
# if unset or set to 0, no webserver is launched

//用於運行的http web服務器的端口號（larbin運行時訪問http://localhost:8081/，設置爲http_port 8081）.若是將端口設爲0，則不會啓動web服務器。經過這個能夠查看爬行結果

httpPort 8081

# port on which you can submit urls to fetch
# no input is possible if you comment this line or use port 0

//你要爬取url的端口。若是註釋掉或設爲0，則可能沒有任何輸入。若是經過手動或者程序提交爬取的//urls，則必須練就到計算機的TCP端口1976,即設爲:inputPort 1976,能夠添加爬行的url。

#inputPort 1976

############################################
# parameters to adapt depending on your network
# Number of connexions in parallel (to adapt depending of your network speed)
//並行爬取網頁的數量，根據本身環境的網速調解，若是超時太多，則要下降這個並行數量

pagesConnexions 100

# Number of dns calls in parallel
//並行DNS域名解析的數量。

dnsConnexions 5

# How deep do you want to go in a site
//對一個站點的爬取的深度

depthInSite 5

# do you want to follow external links
//不容許訪問外部連接。若是設置則只可訪問同一主機的鏈接
#noExternalLinks

# time between 2 calls on the same server (in sec) : NEVER less than 30
//訪問同一服務器的時間間隔。不可低於30s，建議60s
waitDuration 60

# Make requests through a proxy (use with care)
//是否用代理鏈接，若是用，則要設置、能夠不用盡可能不要用，這個選項要謹慎
#proxy www 8080

##############################################
# now, let's customize the search

# first page to fetch (you can specify several urls)
//開始爬取的URL
startUrl http://slashdot.org/

# Do you want to limit your search to a specific domain ?
# if yes, uncomment the following line
//這個選項設置了，則不能夠爬行指定的特殊域名
#limitToDomain .fr .dk .uk end

# What are the extensions you surely don't want
# never forbid .html, .htm and so on : larbin needs them
//不想要的擴展名文件。必定不要禁止.html、.htm.larbin爬取的就是它們。禁止也是無效的
forbiddenExtensions
.tar .gz .tgz .zip .Z .rpm .deb
.ps .dvi .pdf
.png .jpg .jpeg .bmp .smi .tiff .gif
.mov .avi .mpeg .mpg .mp3 .qt .wav .ram .rm
.jar .java .class .diff
.doc .xls .ppt .mdb .rtf .exe .pps .so .psd
end

二、options.h

2.1 輸出模式

// Select the output module you want to use

//默認模式。什麼也不輸出，不要選擇這個
#define DEFAULT_OUTPUT // do nothing...


//簡單保存，存在save/dxxxxx/fyyyyy文件中，每一個目錄下2000個文件
//#define SIMPLE_SAVE // save in files named save/dxxxxxx/fyyyyyy

//鏡像方式存儲。按網頁的層次存儲，能夠做爲網頁的字典。
//#define MIRROR_SAVE // save in files (respect sites hierarchy)

//狀態輸出。在網頁上進行狀態輸出，能夠查看http://localhost:8081/output.html查看結果
//#define STATS_OUTPUT // do some stats on pages

這些模式被定製在src/type.h中，能夠在src/interf/useroutput.cc中定製本身的輸出模式。這個文件中還有不少相關配置，更改後，須要從新編譯。

2.2 特定查詢

// Set up a specific search
//設置特定的查詢
//#define SPECIFICSEARCH
//內容類型
//#define contentTypes ((char *[]) { "audio/mpeg", NULL })
//文件擴展。用於查詢速度，不涉及類型，類型由上一個決定
//#define privilegedExts ((char *[]) { ".mp3", NULL })

2.3 設置完要設置特定文件的管理

#define DEFAULT_SPECIFIC //默認管理方式。 做爲html有限制除了被解析。

//存儲特定文件。 容許將文件存儲在硬盤上 文件能夠很大在src/types.h 能夠具體設置。
#define SAVE_SPECIFIC 

//動態存儲模式。對於較大的文件動態的分配buffer。
#define DYNAMIC_SPECIFIC

能夠經過"src/fetch/specbuf.cc" and "src/fetch/specbuf.h" 定義特定文件的管理方式。

2.4 你要爬蟲作什麼

//不繼續子連接。不設置此項則html頁不被解析連接也不會爬子連接。經過輸入系統添加url時頗有用
#define FOLLOW_LINKS

//每一個網頁中包含的子連接的列表。在"useroutput.cc" 用page->getLinks() 訪問此信息。
#define LINKS_INFO 

//url標籤。設置此項url有一個int(默認爲0)。使用輸入系通通時應該給定一個int。能夠經過其獲取u//rl。能夠重定向。
#define URL_TAGS

//不容許重複。若是設置則遇到相同網頁但已遇到過期則無論。
#define NO_DUP

//結束退出。沒有url可爬取時是否退出。設置則退出。
#define EXIT_AT_END 

//抓取網頁中的圖片。設置了此項則要更新larbin.conf中禁止項。
#define IMAGES

//抓取任何類型網頁無論其的類型。設置要更新larbin.conf。
#define ANYTYPE

//要larbin管理cookies。只簡單實現但頗有用。
#define COOKIES

2.5 其餘選項說明

#define CGILEVEL 1           //定於選項及其參數。用於對爬行的url的限制。

#define MAXBANDWIDTH 200000  //larbin使用的帶寬大小。不設置則不限帶寬。

#define DEPTHBYSITE          //當url連接到其餘站點時新rul的深度是已被初始化的。

2.6 效率和特徵

//是否爲輸入制定一個專用線程。當你在useroutput.cc定義本身的代碼時必須設置此項。
#define THREAD_OUTPUT

//重啓位置記錄表。設置此項時能夠從上次終止處繼續爬取。使用-scratch 選項從上次結束處重啓。
#define RELOAD

2.7 Larbin怎麼工做

#define NOWEBSERVER       //不啓動服務器。不運行線程時頗有用

#define GRAPH             //是否在狀態也使用柱狀圖。

#define NDEBUG            //不啓動調試信息。

#define NOSTATS           //不啓動狀態輸出。

#define STATS             //啓動狀態輸出。運行時每一個一段時間就會輸出抓取的狀態。
 
#define BIGSTATS          //在標準輸出上顯示每一個被抓去的網頁名字。會下降larbin速度

#define CRASH             //用於報告嚴重的bugs用。以gmake debug模式編譯時使用。