DHT抓取程序開源地址:https://github.com/h31h31/H31DHTDEMOhtml
數據處理程序開源地址:https://github.com/h31h31/H31DHTMgrnode
2.[搜片神器]之DHT網絡爬蟲的代碼實現方法github
--------------------------------------------------------------------------------------------------------------------服務器
在介紹了這麼多期文章後,最後介紹BT網絡裏面一個比較重要種子下載協議,方便你們知道如何從DHT網絡直接下載種子的問題.網絡
先說下咱們目前下載電影等文件是如何下載的,好比咱們有個BT種子,就能夠去下載對應的文件,但若是咱們只有個文件名字,如何去找BT種子呢?app
首先咱們能夠去經過搜索獲得磁鏈接,而後就由此字符串去下載對應的種子文件和電影等信息,但若是沒有網站讓你下載種子,咱們又當如何去搜索這個種子呢?less
如何從DHT網絡下載種子,必須先看兩個協議文章:dom
http://www.bittorrent.org/beps/bep_0009.html
http://www.bittorrent.org/beps/bep_0010.html
這裏面有介紹,但仍是須要說明一下如何操做的流程方便你們更好的理解.
咱們的代碼流程必須仍是基於 DHT抓取程序開源地址:https://github.com/h31h31/H31DHTDEMO 之上,由於是從DHT網絡裏面獲取數據,
須要咱們在此之上操做後續流程.
以前的DHT有SEARCH的相關代碼來搜索這個HASH對應的哪些IP在提供下載.
/* This is how you trigger a search for a torrent hash. If port (the second argument) is non-zero, it also performs an announce. Since peers expire announced data after 30 minutes, it's a good idea to reannounce every 28 minutes or so. */ if(searching) { //m_dht.dht_random_bytes((void*)hashList[2],20); if(m_soListen >= 0) m_dht.dht_search(hashList[2], 0, AF_INET, DHT_callback, this); if(s6 >= 0) m_dht.dht_search(hashList[2], 0, AF_INET6, DHT_callback, this); searching = 0; }
搜索到對方返回的IP信息和端口號後,你們能夠分析dht.c裏面的函數代碼dht_periodic(const void *buf, size_t buflen,const struct sockaddr *fromAddr, int fromlen,time_t *tosleep,dht_callback *callback, void *closure)函數裏面的ANNOUNCE_PEER返回請求裏面帶有對方代表本身此BT種子對應的認證碼peerid.
dht_periodic(const void *buf, size_t buflen,const struct sockaddr *fromAddr, int fromlen,time_t *tosleep,dht_callback *callback, void *closure) 函數裏面的ANNOUNCE_PEER case ANNOUNCE_PEER: _dout("Announce peer!From IP:%s:%d\n",inet_ntoa(tempip->sin_addr),tempip->sin_port); new_node(id, fromAddr, fromlen, 1); if(id_cmp(info_hash, zeroes) == 0) { _dout("Announce_peer with no info_hash.\n"); send_error(fromAddr, fromlen, tid, tid_len,203, "Announce_peer with no info_hash"); break; } if(!token_match(token, token_len, fromAddr)) { _dout("Incorrect token for announce_peer.\n"); send_error(fromAddr, fromlen, tid, tid_len,203, "Announce_peer with wrong token"); break; } if(port == 0) { _dout("Announce_peer with forbidden port %d.\n", port); send_error(fromAddr, fromlen, tid, tid_len,203, "Announce_peer with forbidden port number"); break; } if(callback) { (*callback)(closure, DHT_EVENT_ANNOUNCE_PEER_VALUES, info_hash,(void *)fromAddr, port,id);//此ID就是peerid, }
知道了對應的IP,端口號,還有種子ID號,就能夠向對方發送請求了.
獲取HASH是經過UDP網絡,但下載BT種子是經過TCP來處理,至關於別人是TCP服務器,咱們鏈接過去,直接下載對應PEERID的種子就好了.
先看http://www.bittorrent.org/beps/bep_0010.html協議介紹,咱們必須先握手
此包構造比較簡單,按照格式進行組裝就好了,而後發送出去,對方就會迴應本身是什麼客戶端的軟件提供種子下載.
void CH31BTMgr::Encode_handshake() { //a byte with value 19 (the length of the string that follows); //the UTF-8 string "BitTorrent protocol" (which is the same as in ASCII); //eight reserved bytes used to mark extensions; //the 20 bytes of the torrent info hash; //the 20 bytes of the peer ID. char btname[256]; memset(btname,0,sizeof(btname)); sprintf(btname,"BitTorrent protocol"); char msg[1280]; memset(msg,0,sizeof(msg)); msg[0]=19; memcpy(&msg[1],btname,19); char ext[8]; memset(ext,0,sizeof(ext)); ext[5]=0x10; memcpy(&msg[20],ext,8); memcpy(&msg[28],m_hash,20); memcpy(&msg[48],m_peer_id,20); int res1=Write(msg, 68);//TCP發送消息 }
在發送握手後,咱們能夠接着發送種子數據請求包,須要學習http://www.bittorrent.org/beps/bep_0009.html 裏面的內容:
extension header The metadata extension uses the extension protocol (specified in BEP 0010 ) to advertize its existence. It adds the "ut_metadata" entry to the "m" dictionary in the extension header hand-shake message. This identifies the message code used for this message. It also adds "metadata_size" to the handshake message (not the "m" dictionary) specifying an integer value of the number of bytes of the metadata. Example extension handshake message: {'m': {'ut_metadata', 3}, 'metadata_size': 31235} extension message The extension messages are bencoded. There are 3 different kinds of messages: 0 request 1 data 2 reject The bencoded messages have a key "msg_type" which value is an integer corresponding to the type of message. They also have a key "piece", which indicates which part of the metadata this message refers to. In order to support future extensability, an unrecognized message ID MUST be ignored.
這就須要會bencode的相關代碼,這個你們能夠網上搜索進行編譯,若是實現搞不定,能夠留下郵箱我將此類代碼發送給你,其實也是網上收集整理的.
void CH31BTMgr::Encode_Ext_handshake() { entry m; m["ut_metadata"] = 0; entry e; e["m"]=m; char msg[200]; char* header = msg; char* p = &msg[6]; int len = bencode(p, e); int total_size = 2 + len; namespace io = detail; io::write_uint32(total_size, header); io::write_uint8(20, header); io::write_uint8(0, header); int res1=Write(msg, len + 6); }
若是別人迴應的是2,那就直接退出吧,說明別人拒絕了你.
若是迴應是1,則返回的是數據區,每塊是16K大小,最後一包不是.
data The data message adds another entry to the dictionary, "total_size". This key has the same semantics as the "metadata_size" in the extension header. This is an integer. The metadata piece is appended to the bencoded dictionary, it is not a part of the dictionary, but it is a part of the message (the length prefix MUST include it). If the piece is the last piece of the metadata, it may be less than 16kiB. If it is not the last piece of the metadata, it MUST be 16kiB. Example: {'msg_type': 1, 'piece': 0, 'total_size': 3425} d8:msg_typei1e5:piecei0e10:total_sizei34256eexxxxxxxx... The x represents binary data (the metadata).
下面給出如何進行提交我須要第幾包的數據代碼:
void CH31BTMgr::write_metadata_packet(int type, int piece) { ASSERT(type >= 0 && type <= 2); ASSERT(piece >= 0); entry e; e["msg_type"] = type; e["piece"] = piece; char const* metadata = 0; int metadata_piece_size = 0; if (type == 1) { e["total_size"] = 14132; int offset = piece * 16 * 1024; //metadata = m_tp.metadata().begin + offset; metadata_piece_size = (std::min)(int(14132 - offset), 16 * 1024); } char msg[200]; char* header = msg; char* p = &msg[6]; int len = bencode(p, e); int total_size = 2 + len + metadata_piece_size; namespace io = detail; io::write_uint32(total_size, header); io::write_uint8(20, header); io::write_uint8(m_message_index, header); int res1=Write(msg, len + 6); }
在接收到一包請求後咱們才能夠繼續下一包的請求,下面給了咱們如何解析這一包的問題代碼:
// 處理一個完整的包數據 bool CH31BTMgr::DeCodeFrameData(char * buffer,int buflen) { char * p = (char *)mhFindstr(buffer, buflen, "ut_metadatai", 12); if(p) { m_message_index=atoi(&p[12]); if(m_message_index==2) { return false; } write_metadata_packet(0,0); char filename[256]; memset(filename,0,sizeof(filename)); sprintf(filename,"%s\\torrent.txt",m_workPath); DelFile(filename); } p = (char *)mhFindstr(buffer, buflen, "metadata_sizei", 14); if(p) { m_metadata_size=atoi(&p[14]); m_fileCnt=(int)(m_metadata_size/16384)+1; } p = (char *)mhFindstr(buffer, buflen, "msg_typei", 9); if(p) { int type1=atoi(&p[9]); if(type1==1) { p = (char *)mhFindstr(buffer, buflen, "piecei", 6); if(p) { int piece=atoi(&p[6]); p = (char *)mhFindstr(buffer, buflen, "total_sizei", 11); if(p) { int total_size=atoi(&p[11]); p = (char *)mhFindstr(buffer, buflen, "ee", 2); if(p) { //保存數據 FILE* pfile=NULL; char filename[256]; memset(filename,0,sizeof(filename)); sprintf(filename,"%s\\torrent.txt",m_workPath); char openmethod[5]="a"; if(piece==0) sprintf(openmethod,"w"); if((pfile=fopen(filename,openmethod))!=NULL) { if((piece+1)*16*1024<total_size) { fseek(pfile,(piece)*16*1024,SEEK_SET); fwrite(&p[2],1,16*1024,pfile); write_metadata_packet(0,piece+1); fclose(pfile); } else { fwrite(&p[2],1,total_size-(piece)*16*1024,pfile); fclose(pfile); ManageTorrentFileToRealFile(filename); } } } } } } else if(type1==2) { return false; } } return true; }
void * mhFindstr(const void *haystack, size_t haystacklen,const void *needle, size_t needlelen)
{
const char *h =(const char *) haystack;
const char *n =(const char *) needle;
size_t i;
/* size_t is unsigned */
if(needlelen > haystacklen)
return NULL;
for(i = 0; i <= haystacklen - needlelen; i++) {
if(memcmp(h + i, n, needlelen) == 0)
return (void*)(h + i);
}
return NULL;
}
第一次調試也很天真的等着DHT網絡上的數據過來,須要等好久,並且調試老是發現別人不迴應,要麼就是拒絕,通過一段時間後,
問朋友老是不對問題,結果是協議沒有構造對.下面就須要注意的地方總結下:
1.必定要接收到別的人PEERID後纔可以與別人交流,否則別人確定不理你;
2.構造協議調試不可以在外網絡上調試,最好你們將mono-monotorrent源代碼下載回來,調試分析下,本地開啓服務器;
3.經過本地與mono-monotorrent進行調試,你就能夠分析出是哪裏不對的問題,是否是協議哪些封裝得不對的問題.
4.經過DHT網絡下載回來的種子確定是最新的,WEB下載的可能尚未呢..
5.經過協議下載回來的種子好像沒有announce-list,不知道爲何不提供一些內容,可能還有些什麼關鍵地方沒有下載,分析mono-monotorrent代碼裏面就是不提供下載,但願高手指點.
6.TCPClient接收數據區須要開到16K以上,這樣方便處理,固然若是會先後拼接包就更好.
7.若是須要bencode相關的編碼C++代碼,能夠在此留言或者給h31h31#163.com發郵件.
若是此文章看不太明白,請先看看以前的文章,分析調試下代碼,再來學習此文章可能就比較懂一些.
但願有了解的朋友更好的交流和進步.在此留言學習討論.
但願你們多多推薦哦...你們的推薦纔是下一篇介紹的動力...