PHP的curl功能確實強大了。裏面有個curl_multi_init功能,就是批量處理任務。能夠利用此,實現多進程同步抓取多條記錄,優化普通的網頁抓取程序。php
一個簡單的抓取函數:html
function http_get_multi($urls){ $count = count($urls); $data = []; $chs = []; // 建立批處理cURL句柄 $mh = curl_multi_init(); // 建立cURL資源 for($i = 0; $i < $count; $i ++){ $chs[ $i ] = curl_init(); // 設置URL和相應的選項 curl_setopt($chs[ $i ], CURLOPT_RETURNTRANSFER, 1); // return don't print curl_setopt($chs[ $i ], CURLOPT_URL, $urls[$i]); curl_setopt($chs[ $i ], CURLOPT_HEADER, 0); curl_multi_add_handle($mh, $chs[ $i ]); } // 增長句柄 // for($i = 0; $i < $count; $i ++){ // curl_multi_add_handle($mh, $chs[ $i ]); // } // 執行批處理句柄 do { $mrc = curl_multi_exec($mh, $active); } while ($active > 0); while ($active and $mrc == CURLM_OK) { if (curl_multi_select($mh) != -1) { do { $mrc = curl_multi_exec($mh, $active); } while ($mrc == CURLM_CALL_MULTI_PERFORM); } } for($i = 0; $i < $count; $i ++){ $content = curl_multi_getcontent($chs[ $i ]); $data[ $i ] = ( curl_errno($chs[ $i ]) == 0 ) ? $content : false; } // 關閉所有句柄 for($i = 0; $i < $count; $i ++){ curl_multi_remove_handle($mh, $chs[ $i ]); } curl_multi_close($mh); return $data; }
下面的調用測試(get()函數如這裏: http://www.cnblogs.com/whatmiss/p/7114954.html):web
//弄不少個網頁的url
$url = [ 'http://www.baidu.com', 'http://www.163.com', 'http://www.sina.com.cn', 'http://www.qq.com', 'http://www.sohu.com', 'http://www.douban.com', 'http://www.cnblogs.com', 'http://www.taobao.com', 'http://www.php.net', ]; $urls = []; for($i = 0; $i < 10; $i ++){ foreach($url as $r) $urls[] = $r . '/?v=' . rand(); }
//併發請求 $datas = http_get_multi($urls); foreach($datas as $key => $data){ file_put_contents('log/multi_' . $key . '.txt', $data); // 記錄一下請求結果。記得建立一個log文件夾 } $t2 = microtime(true); echo $t2 - $t1; echo '<br />';
//同步請求, get()函數如這裏: http://www.cnblogs.com/whatmiss/p/7114954.html $t1 = microtime(true); foreach($urls as $key => $url){ file_put_contents('log/get_' . $key . '.txt', get($url)); // 記錄一下請求結果。記得建立一個log文件夾 } $t2 = microtime(true); echo $t2 - $t1;
測試結果,很明顯的差距,並且隨着數據量越大,會呈指數級的拉大差距:多線程
2.4481401443481 21.68923997879 8.925509929657 24.73141503334 3.243185043335 23.384337902069 3.2841880321503 24.754415035248 3.2091829776764 29.068662881851
參考,感謝原做者:併發
http://php.net/manual/zh/function.curl-multi-init.phpcurl
http://www.tuicool.com/articles/auiEBb函數
http://blog.csdn.net/liylboy/article/details/39669963 此文寫了可能超時的問題測試
另,這裏有一篇文章說,多線程並不會更快,甚至還稍慢一點點,我以爲很奇怪,怎麼會有這樣的結果:優化