php爬蟲學習筆記1 PHP Simple HTML DOM Parser

時間 2019-11-10

標籤 php 爬蟲學習筆記 simple html dom parser 欄目 PHP 简体版

原文原文鏈接

經常使用爬蟲。

0.php

Snoopy是什麼? （下載 snoopy）

Snoopy是一個php類，用來模仿web瀏覽器的功能，它能完成獲取網頁內容和發送表單的任務。

Snoopy的一些特色:

* 方便抓取網頁的內容

* 方便抓取網頁的文本內容 (去除HTML標籤)

* 方便抓取網頁的連接

* 支持代理主機

* 支持基本的用戶名/密碼驗證

* 支持設置 user_agent, referer(來路), cookies 和 header content(頭文件)

* 支持瀏覽器轉向，並能控制轉向深度

* 能把網頁中的連接擴展成高質量的url(默認)

* 方便提交數據而且獲取返回值

* 支持跟蹤HTML框架(v0.92增長)

* 支持再轉向的時候傳遞cookies (v0.92增長)

PHP Simple HTML DOM Parser

2.OpenWebSpider

OpenWebSpider是一個開源多線程Web Spider（robot：機器人，crawler：爬蟲)和包含許多有趣功能的搜索引擎。html

受權協議：未知
開發語言： PHP
操做系統：跨平臺

特色：開源多線程網絡爬蟲，有許多有趣的功能git

3.PhpDiggithub

PhpDig是一個採用PHP開發的Web爬蟲和搜索引擎。經過對動態和靜態頁面進行索引創建一個詞彙表。當搜索查詢時，它將按必定的排序規則顯示包含關鍵字的搜索結果頁面。PhpDig包含一個模板系統並可以索引PDF,Word,Excel,和PowerPoint文檔。PHPdig適用於專業化更強、層次更深的個性化搜索引擎，利用它打造針對某一領域的垂直搜索引擎是最好的選擇。web

演示：http://www.phpdig.net/navigation.php?action=demoapi

受權協議： GPL
開發語言： PHP
操做系統：跨平臺

特色：具備採集網頁內容、提交表單功能瀏覽器

4.ThinkUp

ThinkUp 是一個能夠採集推特，facebook等社交網絡數據的社會媒體視角引擎。經過採集我的的社交網絡帳號中的數據，對其存檔以及處理的交互分析工具，並將數據圖形化以便更直觀的查看。cookie

受權協議： GPL
開發語言： PHP
操做系統：跨平臺

github源碼：https://github.com/ThinkUpLLC/ThinkUp網絡

特色：採集推特、臉譜等社交網絡數據的社會媒體視角引擎，可進行交互分析並將結果以可視化形式展示多線程

5.微購

微購社會化購物系統是一款基於ThinkPHP框架開發的開源的購物分享系統，同時它也是一套針對站長、開源的的淘寶客網站程序，它整合了淘寶、天貓、淘寶客等300多家商品數據採集接口，爲廣大的淘寶客站長提供傻瓜式淘客建站服務，會HTML就會作程序模板，免費開放下載，是廣大淘客站長的首選。

演示網址：http://tlx.wego360.com

受權協議： GPL

開發語言： PHP

操做系統：跨平臺

6.phpQuery - jQuery port to PHP
https://github.com/TobiaszCudnik/phpquery
http://querylist.cc/

7.Ganon - Fast (HTML DOM) parser written in PHP
https://github.com/Shemahmforash/Ganon

///////////////////////////////////////////////////////////////////////////////////////////////////////////////

<?php

//PHP Simple HTML DOM Parser Manual
require 'E:\wamp\www\php-simple-html-dom-parser-1.5.0\Src\Sunra\PhpSimple\simplehtmldom_1_5\simple_html_dom.php';

//獲取element元素********************************//
/*

$html = file_get_html('http://www.baidu.com/');

// Find all images 獲取圖片連接
foreach($html->find('img') as $element)
echo $element->src . '<br>';

echo "22222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222";
// Find all links 獲取全部連接
foreach($html->find('a') as $element)
echo $element->href . '<br>';

//修改element元素屬性和值
/*
// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';//改變div的class 1表示第二個div（總結：找什麼元素（元素的id是什麼）第幾個）-》要改變的是

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html; // Output: <div id="hello">foo</div><div id="world" class="bar">World</div>

/*
//// Dump contents (without tags) from HTML 打印出所有內容只是內容
echo file_get_html('http://www.ycu.edu.cn/B20110603182545.html')->plaintext;

//plaintext 能夠取到標籤的純文本

/************************************從特定網頁獲取信息根據相關的標籤****/
/*
// Create DOM from URL
$html = file_get_html('http://tech.sina.com.cn/d/i/2015-11-10/doc-ifxkniur3014232.shtml');

//$aaa = $html->find('table',13);var_dump($aaa);die;
// Find all article blocks
// 利用網頁源代碼的標籤頁進行局部信息的採集
foreach($html->find('div.blkContainerSblk') as $article) {
$item['title'] = $article->find('h1#artibodyTitle', 0)->plaintext; //
$item['pubinfo'] = $article->find('div.artInfo', 0)->plaintext;
$item['date'] = $article->find('span#pub_date', 0)->plaintext;
$item['details'] = $article->find('div[id=artibody]', 0)->plaintext;
$articles[] = $item;
}

print_r($articles);

*/
/*

/************************************How to create HTML DOM object?*****/

//如何建立dom 對象
/*

//1 Create a DOM object from a string
$html1 = str_get_html('<html><body>Hello!</body></html>');

//2 Create a DOM object from a URL
$html2 = file_get_html('http://www.baidu.com/');

//3 Create a DOM object from a HTML file
$html3 = file_get_html('../aj.html');

*/
//面向對象的方法Object-oriented way
/*
// Create a DOM object
$html = new simple_html_dom();

// Load HTML from a string
$html->load('<html><body>Hello!word！</body></html>');

// Load HTML from a URL
$html->load_file('http://www.google.com/');

// Load HTML from a HTML file
$html->load_file('test.htm');

/********************************************How to find HTML elements?******************************/
/*
///////////basic////////
// Find all anchors, returns a array of element objects
$ret = $html->find('a');

// Find (N)th anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', 0);

// Find lastest anchor, returns element object or null if not found (zero based)
$ret = $html->find('a', -1);

// Find all <div> with the id attribute 找到全部有id的
$ret = $html->find('div[id]');

// Find all <div> which attribute id=foo 找到id爲。。。的。
$ret = $html->find('div[id=foo]');

///////////////////advanced 高級的///////////////////////
// Find all element which id=foo
$ret = $html->find('#foo');

// Find all element which class="foo"
$ret = $html->find('.foo');

// Find all element has attribute id 有id屬性
$ret = $html->find('*[id]');

// Find all anchors and images 找到全部連接和圖片
$ret = $html->find('a, img');

// Find all anchors and images with the "title" attribute找到全部擁有title屬性的鏈接和圖片
$ret = $html->find('a[title], img[title]');

///////////////後代選擇器 /////////////////////////

// Find all <li> in <ul> 找到在ul裏的li標籤
$es = $html->find('ul li');

// Find Nested <div> tags 嵌套div
$es = $html->find('div div div');

// Find all <td> in <table> which class="hello"
$es = $html->find('table.hello td');

// Find all td tags with attribite align=center in table tags
$es = $html->find('table td[align=center]');

////////////////////嵌套選擇器//////////////////////
///
// Find all <li> in <ul>
foreach($html->find('ul') as $ul)
{
foreach($ul->find('li') as $li)
{
// do something...
}
}

// Find first <li> in first <ul>
$e = $html->find('ul', 0)->find('li', 0);

///////////////////////屬性選擇器 //////////////////////////////////
/*
Supports these operators in attribute selectors:

Filter Description
[attribute] Matches elements that have the specified attribute.
[!attribute] Matches elements that don't have the specified attribute.
[attribute=value] Matches elements that have the specified attribute with a certain value.
[attribute!=value] Matches elements that don't have the specified attribute with a certain value.
[attribute^=value] Matches elements that have the specified attribute and it starts with a certain value. 屬性值的起始爲特定的值
[attribute$=value] Matches elements that have the specified attribute and it ends with a certain value.
屬性值的結束爲特定的值
[attribute*=value] Matches elements that have the specified attribute and it contains a certain value.
屬性值的包含特定的值

//////////////////////查找全部文本塊評論內容/////////////////////////////////////
// Find all text blocks
$es = $html->find('text');

// Find all comment () blocks
$es = $html->find('comment');

/*********************How to access the HTML element's attributes? 如何訪問html元素的屬性********/

/*
// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)
$value = $e->href; //得到

// Set a attribute(If the attribute is non-value attribute (eg. checked, selected...), set it's value as true or false)
$e->href = 'my link'; //設置賦值

// Remove a attribute, set it's value as null!
$e->href = null; //移除置空

// Determine whether a attribute exist? 判斷元素是否存在
if(isset($e->href))
echo 'href exist!';

//魔法屬性

// Example
$html = str_get_html("<div>foo <b>bar</b> </div>");
$e = $html->find("div", 0);

echo $e->tag; // Returns: " div" //標籤
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>" 獲取到的全部顯示的只有 foo bar 可是都是帶着屬性的，好比顏色啊還有黑體等等
echo $e->innertext; // Returns: " foo <b>bar</b>" 標籤內部的只顯示內部的。內部的標籤屬性仍是能夠現實的。
echo "<br>";
echo $e->plaintext; // Returns: " foo bar" 純文本的不帶標籤屬性只是純文本其餘的顏色字體等等都沒了。

// Attribute Name Usage
// $e->tag Read or write the tag name of element.
// $e->outertext Read or write the outer HTML text of element.
// $e->innertext Read or write the inner HTML text of element.
// $e->plaintext Read or write the plain text of element.

////////////////小技巧///////////////////////////////////

// Extract contents from HTML
echo $html->plaintext;

// Wrap a element 包裹一個元素
$e->outertext = '<div class="wrap">' . $e->outertext . '<div>';

// Remove a element, set it's outertext as an empty string 移除
$e->outertext = '';

// Append a element
$e->outertext = $e->outertext . '<div>foo<div>'; //附加元素後面

// Insert a element 插入元素（在元素前面）
$e->outertext = '<div>foo<div>' . $e->outertext;

/*************************How to traverse the DOM tree?*****遍歷dom樹*************************************/
// Example
//echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;
// or
//echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)->getAttribute('id');
/*
Method Description
mixed$e->children ( [int $index] ) Returns the Nth child object if index is set, otherwise return an array of children.
element$e->parent () Returns the parent of element.
element$e->first_child () Returns the first child of element, or null if not found.
element$e->last_child () Returns the last child of element, or null if not found.
element$e->next_sibling () Returns the next sibling of element, or null if not found.
element$e->prev_sibling () Returns the previous sibling of element, or null if not found.
*/

/*
//How to dump contents of DOM object? 如何轉存dom對象
$str = $html;

// Print it!
echo $html;

//面向對象方式
// Dumps the internal DOM tree back into string 存爲字符串
$str = $html->save();

// Dumps the internal DOM tree back into a file 存到文件
$html->save('result.htm');

//How to customize the parsing behavior? 如何自定義解析行爲

// Write a function with parameter "$element"
function my_callback($element) {
// Hide all <b> tags
if ($element->tag=='b')
$element->outertext = '';
}

// Register the callback function with it's function name
$html->set_callback('my_callback');

// Callback function will be invoked while dumping
echo $html;

api

Index

API Reference

Top

Helper functions

Name	Description
objectstr_get_html ( string $content )	Creates a DOM object from a string.
objectfile_get_html ( string $filename )	Creates a DOM object from a file or a URL.

DOM methods & properties

Name	Description
void __construct ( [string $filename] )	Constructor, set the filename parameter will automatically load the contents, either text or file/url.
string plaintext	Returns the contents extracted from HTML.
void clear ()	Clean up memory.
void load ( string $content )	Load contents from a string.
string save ( [string $filename] )	Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file.
void load_file ( string $filename )	Load contents from a from a file or a URL.
void set_callback ( string $function_name )	Set a callback function.
mixed find ( string $selector [, int $index] )	Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.

Element methods & properties

Name	Description
string [attribute]	Read or write element's attribure value.
string tag	Read or write the tag name of element.
string outertext	Read or write the outer HTML text of element.
string innertext	Read or write the inner HTML text of element.
string plaintext	Read or write the plain text of element.
mixed find ( string $selector [, int $index] )	Find children by the CSS selector. Returns the Nth element object if index is set, otherwise, return an array of object.

DOM traversing

Name	Description
mixed $e->children ( [int $index] )	Returns the Nth child object if index is set, otherwise return an array of children.
element $e->parent ()	Returns the parent of element.
element $e->first_child ()	Returns the first child of element, or null if not found.
element $e->last_child ()	Returns the last child of element, or null if not found.
element $e->next_sibling ()	Returns the next sibling of element, or null if not found.
element $e->prev_sibling ()	Returns the previous sibling of element, or null if not found.

Camel naming convertions

Top

You can also call methods with W3C STANDARD camel naming convertions.

Method	Mapping
array $e->getAllAttributes ()	array $e->attr
string $e->getAttribute ( $name )	string $e->attribute
void $e->setAttribute ( $name, $value )	void $value = $e->attribute
bool $e->hasAttribute ( $name )	bool isset($e->attribute)
void $e->removeAttribute ( $name )	void $e->attribute = null
element $e->getElementById ( $id )	mixed $e->find ( "#$id", 0 )
mixed $e->getElementsById ( $id [,$index] )	mixed $e->find ( "#$id" [, int $index] )
element $e->getElementByTagName ($name )	mixed $e->find ( $name, 0 )
mixed $e->getElementsByTagName ( $name [, $index] )	mixed $e->find ( $name [, int $index] )
element $e->parentNode ()	element $e->parent ()
mixed $e->childNodes ( [$index] )	mixed $e->children ( [int $index] )
element $e->firstChild ()	element $e->first_child ()
element $e->lastChild ()	element $e->last_child ()
element $e->nextSibling ()	element $e->next_sibling ()
element $e->previousSibling ()	element $e->prev_sibling ()