網絡爬蟲(又被稱爲網頁蜘蛛,網絡機器人,在FOAF社區中間,更常常的稱爲網頁追逐者),是一種按照必定的規則,自動地抓取萬維網信息的程序或者腳本。另一些不常使用的名字還有螞蟻、自動索引、模擬程序或者蠕蟲。html
PHP的異步、並行、高性能網絡通訊引擎,使用純C語言編寫,提供了PHP語言的異步多線程服務器,異步TCP/UDP網絡客戶端,異步MySQL,異步Redis,數據庫鏈接池,AsyncTask,消息隊列,毫秒定時器,異步文件讀寫,異步DNS查詢。 Swoole內置了Http/WebSocket服務器端/客戶端、Http2.0服務器端。linux
原本,公司的意願是,寫幾個PHP腳本,使用linux定時任務crontab既可。後來我一琢磨,正好如今不是業務改動頻繁期,並且,服務化是早晚要作的事情,所以,便開始個人爬坑旅程。 PHP是有一套比較成熟的異步常駐內存的框架的, workerman ,倒不是不採起,而是既然決定採用新的方案,正好也不趕工期,爲什麼不挑戰一下新技術呢?數據庫
因爲公司以前是沒有進行過tcp鏈接的優化基礎的大神在,所以咱們採用是樸實的方案,http協議。 swoole是原生支持http服務器的,參見官網,開啓一個http server是很簡單的:json
$serv = new Swoole\Http\Server("127.0.0.1", 9502); $serv->on('Request', function($request, $response) { var_dump($request->get); var_dump($request->post); var_dump($request->cookie); var_dump($request->files); var_dump($request->header); var_dump($request->server); $response->cookie("User", "Swoole"); $response->header("X-Server", "Swoole"); $response->end("<h1>Hello Swoole!</h1>"); }); $serv->start();
根據swoole官方的定義,http server是有幾個經常使用事件的。服務器
request,packet,pipeMessage,task,finish,receive,close,workerStart,workerStop,shutDown
仔細觀察,通常的使用的狀況下,因爲receive事件swoole已經自動轉發到了request事件,所以,最簡單的例子,咱們直接從request中激活一個爬蟲就行了。swoole
class SwooleHttpServer implements Server { const EVENT = [ 'request'//,'packet','pipeMessage','task','finish','close' ]; protected $server; protected $event = [ ]; //注意,這裏使用了我上一篇文章關於PHP DI的實現 public function __construct(Config $config) { $server = $config->get('server'); if(empty($server)) { throw new \Exception('config not found'); } $this->server = new \swoole_http_server($server['host'], $server['port'], $server['mode'], $server['type']); $extend = $config->get('event')['namespace'] ?? ''; foreach (self::EVENT as $event) { $class = $extend.'\\'.ucfirst($event); if(!class_exists($class)) { $class = '\\Kernel\\Swoole\\Event\\Http\\'.ucfirst($event); } /* @var \Kernel\Swoole\Event\Event $callback */ $callback = new $class($this->server); $this->event[$event] = $callback; $this->server->on($event, [$callback, 'doEvent']); } $this->server->set($config->get('swoole')); } public function start(\Closure $callback = null): Server { if(!is_null($callback)) { $callback(); } $this->server->start(); return $this; } public function shutdown(\Closure $callback = null): Server { // TODO: Implement shutdown() method. } public function close($fd, $fromId = 0) : Server { $this->server->close($fd, $fromId = 0); return $this; }
爲了方便業務的書寫,咱們把request事件解耦出單獨的一個類cookie
namespace Kernel\Swoole\Event\Http; use Kernel\Swoole\Event\Event; use Kernel\Swoole\Event\EventTrait; class Request implements Event { use EventTrait; /* @var \swoole_http_server $server*/ protected $server; protected $data = []; public function __construct(\swoole_http_server $server) { $this->server = $server; } public function doEvent(\swoole_http_request $request, \swoole_http_response $response) { if(isset($request->server['request_uri']) and $request->server['request_uri'] == '/favicon.ico') { $response->end(json_encode(empty($data)?['code'=>0]:$data)); return; } $crawler = new Crawler(); $crawler->run(); }
如今咱們請求IP所在端口,既可開始執行咱們得爬蟲類了網絡
namespace Library\Crawler; class Crawler { private $url; private $toVisit = []; public function __construct($url) { $this->url = $url; } public function visitOneDegree() { $this->visit($this->url, function ($content) { $this->loadPage($content); $this->visitAll(); }); } private function loadPage($content) { $pattern = '#(http|ftp|https)://?([a-z0-9_-]+\.)+(com|net|cn|org){1}(\/[a-z0-9_-]+)*\.?(?!:jpg|jpeg|gif|png|bmp)(?:")#i'; preg_match_all($pattern, $content, $matched); foreach ($matched[0] as $url) { if (in_array($url, $this->toVisit)) { continue; } $this->toVisit[] = rtrim($url,'"'); file_put_contents('urls',$url."\r\n",FILE_APPEND); } } private function visitAll() { foreach ($this->toVisit as $url) { $this->visit($url); } } private function visit($url, $callback = null) { $urlInfo = parse_url($url); \Swoole\Async::dnsLookup($urlInfo['host'], function ($domainName, $ip) use($urlInfo,$url,$callback) { if($domainName == '' or $ip =='') { return; } if(!isset($urlInfo['port'])) { if($urlInfo['scheme'] == 'https') { $urlInfo['port'] = 443; }else{ $urlInfo['port'] = 80; } } if($urlInfo['scheme'] == 'https') { $cli = new \swoole_http_client($ip, $urlInfo['port'], true); }else{ $cli = new \swoole_http_client($ip, $urlInfo['port']); } $cli->setHeaders([ 'Host' => $domainName, "User-Agent" => 'Chrome/49.0.2587.3', 'Accept' => 'text/html,application/xhtml+xml,application/xml', 'Accept-Encoding' => 'gzip', ]); $cli->get($urlInfo['path']??'/', function ($cli) use (,$url) { $data = $this->getMeta($cli->body); //todo:將數據寫到數據庫 $cli->close(); }); }); } private function getMeta(string $data) { $meta = []; ...... return $meta; } }
從如今開始,一套簡單的爬蟲程序既可以使用了。 可是出現了一個問題:多線程
swoole的worker數量受制於CPU有限,所以,一旦超出了worker是不會進行服務的,而我這裏的爬蟲,很明顯是一個同步代碼,舉個栗子,開始我worker_num是5,我就最多同時開啓5個爬蟲任務,就算你想當即開啓更多,也是會失敗的【swoole分配不了更多的worker進程給請求】。至關於app
while(true){ if($condition){ break; } }
引用一張官方的進程圖:
worker的數量是不能動態分配的。聯繫操做系統的知識,能夠獲得這樣的結論,假如我有3個籃子(即worker[生產者]),每次有人來取走一個籃子去裝東西(即request請求[消費者]),可是,籃子一直在被用着(while(true))沒還回來。當有第4我的來拿籃子(請求)的時候,沒法分配籃子,只能等待,假如以前的一直不釋放,while(true)不退出,這就稱之爲死鎖了。
所以,咱們須要進行新一步的優化。 下一篇文章,我將講述如何對swoole的進程進行優化,也就是最大的爬坑篇[swoole task]。