HTTP/FTP客戶端開發庫:libwww、libcurl、libfetch 以及更多

網頁抓取和ftp訪問是目前很常見的一個應用須要,不管是搜索引擎的爬蟲,分析程序,資源獲取程序,WebService等等都是須要的,本身開發抓取庫固然是最好了,不過開發須要時間和週期,使用現有的Open source程序是個更好的選擇,一來別人已經寫的很好了,就近考驗,二來本身使用起來很是快速,三來本身還可以學習一下別人程序的優勢。

閒來無事,在網上瀏覽,就發現了這些好東西,特別抄來分享分享。主要就是libwww、libcurl、libfetch 這三個庫,固然,還有一些其餘不少更優秀庫,文章後面會有簡單的介紹。 php

【libwww】
官方網站:http://www.w3.org/Library/
更多信息:http://www.w3.org/Library/User/
運行平臺:Unix/Linux,Windows html

如下資料來源:http://9.douban.com/site/entry/15448100/http://zh.wikipedia.org/wiki/Libwww

簡介:
Libwww 是一個高度模組化用戶端的網頁存取API ,用C語言寫成,可在 Unix 和 Windows 上運行。 It can be used for both large and small applications including: browsers/editors, robots and batch tools. There are pluggable modules provided with Libwww which include complete HTTP/1.1 with caching, pipelining, POST, Digest Authentication, deflate, etc. The purpose of libwww is to serve as a testbed for protocol experiments. 蒂姆·伯納斯-李 在 1992 年十一月創造出了 Libwww,用於展現網際網路的潛能。使用 Libwww 的應用程式,如被普遍使用的命令列文字瀏覽器 Lynx 及 Mosaic web browser 便是用 Libwww 所寫成的。 Libwww 目前爲一開放原始碼程式,並於日前移至 W3C 管理。基於其爲開放原始碼的特性,任何人都能爲 Libwww 付出一點心力,這也確保了 Libwww 能一直進步,成爲更有用的軟體。 java

操做示例:
最近我須要寫點頁面分析的東西,這些東西某些程度上相似搜索引擎的「爬蟲->parser->存儲」的過程。 linux

過去我經常使用的抓取頁面的庫是libcurl,這個東西是unix經常使用命令curl的基礎,curl被稱作「命令行瀏覽器」,功能強大,支持的協議也全面。遺憾的是libcurl僅僅是個支持多協議的抓取庫,不能作解析。 web

找來找去,發現了w3c的Libwww庫,這東西功能強大的嚇人,不只有解析,還有robot(也就是爬蟲了,或是叫internet walker)功能。在Libwww基礎上完成的程序不少,最著名的大概是字符模式的瀏覽器lynx。我幾乎就以爲這就我須要的東西了,馬上dive into。 apache

一成天以後,我終於能用這東西抓下來頁面,而且從html頁面中分析出來一些信息了,可是想更進一步就變的異常困難。由於這個庫功能太複雜了。這東西文檔不詳細,被人說起的也少。Libwww最近的Release 5.3.2,發佈於2000年12月20日。一個有這麼多年曆史的東西,居然沒多少開發者在討論,很是不正常。 瀏覽器

找來找去,最後在libcurl的FAQ裏面看到了和Libwww的比較,精選的讀者來信告訴我,不只僅是我一我的被Libwww的複雜弄的暈了頭腦,我才花了一成天,寫信的那個哥們居然用了一人月,仍是在裏面打轉,直到換了curl纔好。雖然這是libcurl推銷本身的方法,不過這些失敗的前輩的經驗讓我對本身的智商從新有了信心。看來這東西沒多少人討論是正常的... 安全

好吧,我也投降,libcurl沒html解析功能,這不要緊,我找別的辦法好了...這麼複雜的庫,再好我也實在沒辦法忍受下去了,再說我須要的功能其實也真沒Libwww那麼複雜的。 cookie

寫程序其實很容易迷失,你會看到一個彷佛很完美,什麼都能作的東西,一會兒就喜歡上它,可是最後每每仍是無福消受。每每是那些,不那麼成熟,多少有點小毛病的庫,組合在一塊兒纔是真正的解決方案。 網絡

【libcurl】

官方網站:http://curl.haxx.se/libcurl
更多特色:http://curl.haxx.se/docs/features.html
運行平臺:Unix/Linux,Windows(Windows上貌似也有實現)

如下資料來源:http://blog.csdn.net/hwz119/archive/2007/04/29/1591920.aspx

Libcurl爲一個免費開源的,客戶端url傳輸庫,支持FTP,FTPS,TFTP,HTTP,HTTPS,GOPHER,TELNET,DICT,FILE和LDAP,跨平臺,支持Windows,Unix,Linux等,線程安全,支持Ipv6。而且易於使用。

http://curl.haxx.se/libcurl/

從http://curl.haxx.se/libcurl/ 下載一個穩定的版本,注意選擇OS。

編譯libcurl

下載下來的是源碼包,須要編譯。

解壓zip文件,進入curl-7.14.0\lib目錄(我下載的是7.14.0)。

編譯Debug版本。新建一個批處理bat文件,如buildDebug.bat,內容以下:

call "C:\Program Files\Microsoft Visual Studio\VC98\Bin\vcvars32.bat"

set CFG=debug-dll-ssl-dll-zlib-dll

set OPENSSL_PATH=E:\SSL\openssl-0.9.7e

set ZLIB_PATH=E:\zip\zlib123

nmake -f Makefile.vc6

其輸出:libcurld_imp.lib, libcurld.dll

編譯Release版本。新建一個批處理文件BuildRelease.bat,內容以下:

call "C:\Program Files\Microsoft Visual Studio\VC98\Bin\vcvars32.bat"

set CFG=release-dll-ssl-dll-zlib-dll

set OPENSSL_PATH=E:\SSL\openssl-0.9.7e

set ZLIB_PATH=E:\zip\zlib123

nmake -f Makefile.vc6

其輸出:libcurl_imp.lib, libcurl.dll

上面編譯的是libcurl的 dll,使用OpenSSL Dll版本和Zlib Dll版本。若是沒有,能夠從www.openssl.org或者http://www.zlib.net/下載。

若是須要編譯其餘版本,可查看Makefile.vc6,設定相應的CFG 參數便可。

商業軟件使用libcurl時,只須要包含其copywrite聲明便可。

Sample #include <stdio.h>
#include "../curl-7.14.0/include/curl/curl.h"
#pragma comment(lib, "../curl-7.14.0/lib/libcurl_imp.lib")

int main(void)
{
curl = curl_easy_init();
if(curl) {

CURLcode res; 
res = curl_easy_setopt(curl, CURLOPT_PROXY, "Test-pxy08:8080");
res = curl_easy_setopt(curl, CURLOPT_PROXYTYPE, CURLPROXY_HTTP);
res = curl_easy_setopt(curl, CURLOPT_URL, "http://www.vckbase.com");
res = curl_easy_perform(curl);

if(CURLE_OK == res) {
char *ct;
/**//* ask for the content-type */
/**//* http://curl.haxx.se/libcurl/c/curl_easy_getinfo.html */
res = curl_easy_getinfo(curl, CURLINFO_CONTENT_TYPE, &ct);

if((CURLE_OK == res) && ct)
printf("We received Content-Type: %s ", ct);
}

/**//* always cleanup */
curl_easy_cleanup(curl);
}
return 0;
}

【libfetch】
官方網站:http://libfetch.darwinports.com/
更多信息:http://www.freebsd.org/cgi/man.cgi?query=fetch&sektion=3
運行平臺:BSD

如下資料來源:http://bbs.chinaunix.net/viewthread.php?tid=105809

前幾天無雙老大在FB版介紹了一下CU的巨猛的法老級灌水大師,小弟因而說要編個程序自動來灌,哈哈昨晚有所突破,找到一個很好的庫,先介紹給各位大魚小蝦們,不過可別真的拿它來灌水啊,不然我被這裏的班長們砍死之後的冤魂可要來算賬的喔! 
這是在FreeBSD裏找到的一個庫:libfetch,源代碼在/usr/src/lib/libfetch裏,它對http和ftp協議進行了封裝,提供了一些很容易使用的函數,由於昨天剛看到,還沒仔細研究,我試了一個用http取網頁的函數,示例以下:
#include <stio.h>
#include 
#include 

#include "fetch.h"

const char * myurl = "http://qjlemon:aaa@192.169.0.1:8080/test.html";

main()
{
FILE * fp;
char buf[1024];

fp = fetchGetURL(myurl, "";
if (!fp) {
printf("error: %s ", fetchLastErrString);
return 1;
}
while (!feof(fp)) {
memset(buf, 0, sizeof(buf));
fgets(buf, sizeof(buf), fp);
if (ferror(fp))
break;
if (buf[0])
printf("%s", buf);
else
break;
}
fclose(fp);
fp = NULL;
}
這裏最重要的就是fetchGetURL函數,它按指定的URL來取文件,好比URL
是以http開頭的,這個函數就知道按http取文件,若是是ftp://,就會按ftp取文件,還能夠指定用戶名和口令。
若是文件被取到,它會返回一個FILE指針,能夠象操做普通的文件同樣把網頁的內容取出來。
另外這個庫還提供了一些函數,能夠對網絡操做進行更爲精細的控制。
固然最有用的是仍是幾個PUT函數,想要灌水就得用這個喲!哈哈哈!

【其餘相關HTTP/FTP客戶端庫】
資料來源:http://curl.haxx.se/libcurl/competitors.html

Free Software and Open Source projects have a long tradition of forks and duplicate efforts. We enjoy "doing it ourselves", no matter if someone else has done something very similar already.

Free/open libraries that cover parts of libcurl's features:

libcurl (MIT)

a highly portable and easy-to-use client-side URL transfer library, supporting FTP, FTPS, HTTP, HTTPS, SCP, SFTP, TELNET, DICT, FILE, TFTP and LDAP. libcurl also supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading, kerberos, HTTP form based upload, proxies, cookies, user+password authentication, file transfer resume, http proxy tunnelling and more!

libghttp (LGPL)

Having a glance at libghttp (a gnome http library), it looks as if it works rather similar to libcurl (for http). There's no web page for this and the person who's email is mentioned in the README of the latest release I found claims he has passed the leadership of the project to "eazel". Popular choice among GNOME projects.

libwww (W3C licensecomparison with libcurl

More complex, and and harder to use than libcurl is. Includes everything from multi-threading to HTML parsing. The most notable transfer-related feature that libcurl does not offer but libwww does, is caching.

libferit (GPL)

C++ library "for transferring files via http, ftp, gopher, proxy server". Based on 'snarf' 2.0.9-code (formerly known as libsnarf). Quote from freshmeat:  "As the author of snarf, I have to say this frightens me. Snarf's networking system is far from robust and complete. It's probably full of bugs, and although it works for maybe 85% of all current situations, I wouldn't base a library on it."

neon (LGPL)

An HTTP and WebDAV client library, with a C interface. I've mainly heard and seen people use this with WebDAV as their main interest.

(LGPL) comparison with libcurl

Part of glib (GNOME). Supports: HTTP 1.1, Persistent connections, Asynchronous DNS and transfers, Connection cache, Redirects, Basic, Digest, NTLM authentication, SSL with OpenSSL or Mozilla NSS, Proxy support including SSL, SOCKS support, POST data. Probably not very portable. Lacks: cookie support, NTLM for proxies, GSS, gzip encoding, trailers in chunked responses and more.

mozilla netlib (MPL)

Handles URLs, protocols, transports for the Mozilla browser.

mozilla libxpnet (MPL)

Minimal download library targeted to be much smaller than the above mentioned netlib. HTTP and FTP support.

wget (GPL)

While not a library at all, I've been told that people sometimes extract the network code from it and base their own hacks from there.

libfetch (BSD)

Does HTTP and FTP transfers (both ways), supports file: URLs, and an API for URL parsing. The utility  fetch  that is built on libfetch is an integral part of the  FreeBSD  operating system.

HTTP Fetcher (LGPL)

" a small, robust, flexible library for downloading files via HTTP using the GET method. "

http-tiny (Artistic License)

" a very small C library to make http queries (GET, HEAD, PUT, DELETE, etc.) easily portable and embeddable "

XMLHTTP Object also known as IXMLHTTPRequest (part of MSXML 3.0)

(Windows) Provides client-side protocol support for communication with HTTP servers. A client computer can use the XMLHTTP object to send an arbitrary HTTP request, receive the response, and have the Microsoft® XML Document Object Model (DOM) parse that response.

QHttp (GPL)

QHttp is a class in the Qt library from Troll Tech. Seems to be restricted to plain HTTP. Supports GET, POST and proxy. Asynchronous.

ftplib (GPL)

" a set of routines that implement the FTP protocol. They allow applications to create and access remote files through function calls instead of needing to fork and exec an interactive ftp client program."

ftplibpp (GPL)

A C++ library for "easy FTP client functionality. It features resuming of up- and downloads, FXP support, SSL/TLS encryption, and logging functionality."

GNU Common C++ library

Has a URLStream class. This C++ class allow you to download a file using HTTP. See demo/urlfetch.cpp in commoncpp2-1.3.19.tar.gz

HTTPClient (LGPL)

Java HTTP client library.

Jakarta Commons HttpClient (Apache License)

A Java HTTP client library written by the Jakarta project.
相關文章
相關標籤/搜索