以前介紹了Beautifulsoup庫,這個庫能夠讓咱們不寫繁雜的正則表達式就能夠爬取數據。可是你可能會以爲Beautifulsoup庫不太好用,語法太繁雜,難記。今天介紹一個靈活又強大的網頁解析庫PyQuery。css
若是你熟悉jQuery的語法,那麼PyQuery就是爬蟲的絕佳選擇,api能夠無縫遷移。html
pip install pyqueryjava
下面案例講解使用到的都是下面這個字符串python
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
複製代碼
(1)字符串初始化面試
from pyquery import PyQuery as pq
doc = pq(html)#PyQuery對象,直接傳入字符串
print(doc('li'))#其實就是css選擇器,選擇class時前面加‘.’;選擇屬性時前面加‘#’,選擇標籤直接寫
print(doc('.item-0')[0].text)#輸出第一個class值爲item-0對應的內容
複製代碼
輸出:正則表達式
<li class="item-0">first item</li>
<li class="item-1">< a href=" ">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
first item
複製代碼
(2) URL初始化api
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')#直接傳入URL,會自動返回請求後的HTML並傳入到PyQuery
print(doc('head'))
複製代碼
輸出:學習
<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head>
複製代碼
(3)文件初始化大數據
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')#本地文件名
print(doc('li'))
複製代碼
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))#id前面加‘#’,class選擇就在前面加‘.’ 標籤的話什麼都不加,寫在前面就是選擇外層元素、後面就是選擇裏面的元素
複製代碼
輸出:ui
<li class="item-0">first item</li>
<li class="item-1">< a href=" ">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
複製代碼
(1)查找子元素
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)
複製代碼
輸出:
<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1">< a href=" ">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
</ul>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1">< a href="link2.html">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
複製代碼
(2)直接子元素:
lis = items.children(‘.active’)#()中是二次篩選,也能夠沒有
print(type(lis))
print(lis)
複製代碼
(3)父元素
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')#list的父節點
container = items.parent()
複製代碼
輸出:
<class 'pyquery.pyquery.PyQuery'>
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1">< a href=" ">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
</ul>
</div>
複製代碼
返回祖先節點:
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents()
print(type(parents))
print(parents)
複製代碼
輸出:
<class 'pyquery.pyquery.PyQuery'>
<div class="wrap">
<div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1">< a href=" ">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
</ul>
</div>
</div><div id="container">
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1">< a href="link2.html">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
</ul>
</div>
複製代碼
也能夠傳入css選擇器再次進行篩選:
parent = items.parents('.wrap')
print(parent)
複製代碼
只會輸出上面的第一個結果
兄弟元素
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')#後面是沒有空格,表示查找同時包含這兩個class的元素,只有一個符合條件
print(li.siblings())
複製代碼
輸出的是其餘4個兄弟li標籤
from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
print(li)#每個li標籤都是pyquery類型,能夠進行進一步操做
複製代碼
(1)獲取屬性值
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.attr('href'))
print(a.attr.href)
複製代碼
輸出:
< a href=" "><span class="bold">third item</span></ a>
link3.html
link3.html
< a href="link3.html"><span class="bold">third item</span></ a>
link3.html
link3.html
複製代碼
(2)獲取文本值
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.text())
複製代碼
輸出:
< a href=" "><span class="bold">third item</span></ a>
third item
複製代碼
(3)獲取HTML
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())
複製代碼
輸出:
<li class="item-0 active">< a href=" "><span class="bold">third item</span></ a></li>
< a href="link3.html"><span class="bold">third item</span></ a>
複製代碼
(1)addClass和removeClass
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)
複製代碼
輸出:
<li class="item-0 active">< a href=" "><span class="bold">third item</span></ a></li>
<li class="item-0">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
複製代碼
DOM操做其實就是對:屬性、css、class等進行操做 (2)添加屬性attr、添加css
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name', 'link')#添加新的屬性對
print(li)
li.css('font-size', '14px')#添加新的css
print(li)
複製代碼
輸出:
<li class="item-0 active">< a href=" "><span class="bold">third item</span></ a></li>
<li class="item-0 active" name="link">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-0 active" name="link" style="font-size: 14px">< a href="link3.html"><span class="bold">third item</span></ a></li>
複製代碼
(3)移除
html = ''' <div class="wrap"> Hello, World <p>This is a paragraph.</p> </div> '''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove()#單獨獲取Hello world
print(wrap.text())
複製代碼
輸出: Hello, World This is a paragraph. Hello, World (5)僞類選擇器
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')#獲取第一個li標籤
print(li)
li = doc('li:last-child')#獲取最後一個li標籤
print(li)
li = doc('li:nth-child(2)')#獲取第二個li標籤
print(li)
li = doc('li:gt(2)')#獲取大於2的li標籤
print(li)
li = doc('li:nth-child(2n)')##獲取第偶數個li標籤
print(li)
li = doc('li:contains(second)')#獲取包含某個文本值的li標籤
print(li)
複製代碼
(5)其餘
歡迎關注我的公衆號【菜鳥名企夢】,公衆號專一:互聯網求職面經、java、python、爬蟲、大數據等技術分享**: 公衆號**菜鳥名企夢
後臺發送「csdn」便可免費領取【csdn】和【百度文庫】下載服務; 公衆號菜鳥名企夢
後臺發送「資料」:便可領取5T精品學習資料**、java面試考點和java面經總結,以及幾十個java、大數據項目,資料很全,你想找的幾乎都有