優點:相似jquery的強大搜索DOM的能力。
pq()是一個功能強大的搜索DOM的方法,跟jQuery的$()一模一樣,jQuery的選擇器基本上都能使用在phpQuery上,只要把「.」變成「->」,Demo以下(對應個人github的Demo5)php
<?php require('phpQuery/phpQuery.php'); phpQuery::newDocumentFile('http://www.baidu.com/'); $menu_a = pq("a"); foreach($menu_a as $a){ echo pq($a)->html()."<br>"; } foreach($menu_a as $a){ echo pq($a)->attr("href")."<br>"; } ?>
優點:過濾能力比較強。
官方給的Demo以下(個人github中對應demo4):html
<?php include("PHPCrawl/libs/PHPCrawler.class.php"); class MyCrawler extends PHPCrawler { function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo) { // As example we just print out the URL of the document echo $PageInfo->url."<br>"; } } $crawler = new MyCrawler(); $crawler->setURL("www.baidu.com"); $crawler->addURLFilterRule("#\.(jpg|gif)$# i"); //過濾到含有這些圖片格式的URL $crawler->go(); ?>
優點:提交表單,設置代理等
Snoopy是一個php類,用來模擬瀏覽器的功能,能夠獲取網頁內容,發送表單,
demo以下(對應github中的demo3):mysql
include 'Snoopy/Snoopy.class.php'; $snoopy = new Snoopy(); $url = "http://www.baidu.com"; // $snoopy->fetch($url); // $snoopy->fetchtext($url);//去除HTML標籤和其餘的無關數據 $snoopy->fetchform($url);//只獲取表單 //只返回網頁中連接 默認狀況下,相對連接將自動補全,轉換成完整的URL。 // $snoopy->fetchlinks($url); var_dump($snoopy->results);
優點:安裝配置到數據庫
提供了安裝配置,可以直接鏈接mysql數據庫,使用也是比較普遍,這裏咱們暫時不單獨介紹。jquery
<?php $opts = array( 'http'=>array( 'method'=>"GET", 'header'=>"Accept-language: en\r\n" . "Cookie: foo=bar\r\n" ) ); $context = stream_context_create($opts); /* Sends an http request to www.example.com with additional headers shown above */ $fp = fopen('http://www.example.com', 'r', false, $context); fpassthru($fp); fclose($fp); ?>
$ch=curl_init(); //初始化一個cURL會話 curl_setopt($ch,CURLOPT_URL,$url);//設置須要獲取的 URL 地址 // 設置瀏覽器的特定header curl_setopt($ch, CURLOPT_HTTPHEADER, array( "Host: www.baidu.com", "Connection: keep-alive", "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Upgrade-Insecure-Requests: 1", "DNT:1", "Accept-Language: zh-CN,zh;q=0.8,en-GB;q=0.6,en;q=0.4,en-US;q=0.2", "Cookie:_za=4540d427-eee1-435a-a533-66ecd8676d7d;" )); $result=curl_exec($ch);//執行一個cURL會話
咱們的一個例子
form-demo.htmlgit
<!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>form-demo</title> </head> <body> <form action="./form-demo.php" method="post"> 用戶名:<input type="text" name="userName"><br> 密 碼:<input type="password" name="password"><br> <input type="submit"> </form> </body> </html>
form-demo.phpgithub
<?php $userName = $_POST['userName']; $password = $_POST['password']; if($userName==="admin"&&$password==="admin"){ echo "hello admin"; }else{ echo "login error"; } ?> ``` 提交表單 ```php <?php include 'Snoopy/Snoopy.class.php'; $snoopy = new Snoopy(); $formvars["userName"] = "admin"; //userName 與服務器端/表單的name屬性一致 $formvars["password"] = "admin"; $action = "http://localhost:8000/spider/demo3/form-demo.php";//表單提交地址 $snoopy->submit($action,$formvars); echo $snoopy->results; ?> <div class="se-preview-section-delimiter"></div>
問題1:openssl extension required for HTTPS 增長對https的支持sql
php.in ==> ;extension=php_openssl.dll 去除註釋 <div class="se-preview-section-delimiter"></div>
問題2:405 Not Allowed增長數據庫
$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)"; //假裝瀏覽器 $snoopy->referer = "http://www.icultivator.com"; //假裝來源頁地址 http_referer $snoopy->rawheaders["Pragma"] = "no-cache"; //cache 的http頭信息 $snoopy->rawheaders["X_FORWARDED_FOR"] = "122.0.74.166"; //假裝ip <div class="se-preview-section-delimiter"></div>
問題3 : snoopy使用代理瀏覽器
$snoopy->proxy_host = "http://www.icultivator.com"; // HTTPS connections over proxy are currently not supported $snoopy->proxy_port = "8080"; //使用代理 $snoopy->maxredirs = 2; //重定向次數 $snoopy->expandlinks = true; //是否補全連接 在採集的時候常常用到 $snoopy->maxframes = 5; //容許的最大框架數
問題:
其實嘗試了網站進行提交表單是有問題的。這樣簡單的處理是不能提交表單的,使用代理也是有問題
的。snoopy框架確實會有不少問題,後面有解決思路了再說。服務器