【一塊兒學爬蟲】PyQuery詳解

回顧

以前介紹了Beautifulsoup庫,這個庫能夠讓咱們不寫繁雜的正則表達式就能夠爬取數據。可是你可能會以爲Beautifulsoup庫不太好用,語法太繁雜,難記。今天介紹一個靈活又強大的網頁解析庫PyQuery。css

什麼是PyQuery

若是你熟悉jQuery的語法,那麼PyQuery就是爬蟲的絕佳選擇,api能夠無縫遷移。html

PyQuery的安裝

pip install pyqueryjava

PyQuery的使用

下面案例講解使用到的都是下面這個字符串python

html = '''
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
複製代碼

(1)字符串初始化面試

from pyquery import PyQuery as pq
doc = pq(html)#PyQuery對象,直接傳入字符串
print(doc('li'))#其實就是css選擇器,選擇class時前面加‘.’;選擇屬性時前面加‘#’,選擇標籤直接寫
print(doc('.item-0')[0].text)#輸出第一個class值爲item-0對應的內容
複製代碼

輸出:正則表達式

<li class="item-0">first item</li>
         <li class="item-1">< a href=" ">second item</ a></li>
         <li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
         <li class="item-1 active">< a href="link4.html">fourth item</ a></li>
         <li class="item-0">< a href="link5.html">fifth item</ a></li>
     
first item
複製代碼

(2) URL初始化api

from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')#直接傳入URL,會自動返回請求後的HTML並傳入到PyQuery
print(doc('head'))
複製代碼

輸出:學習

<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>百度一下,你就知道</title></head> 
複製代碼

(3)文件初始化大數據

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')#本地文件名
print(doc('li'))
複製代碼

基本的CSS選擇器

from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))#id前面加‘#’,class選擇就在前面加‘.’ 標籤的話什麼都不加,寫在前面就是選擇外層元素、後面就是選擇裏面的元素
複製代碼

輸出:ui

<li class="item-0">first item</li>
<li class="item-1">< a href=" ">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
複製代碼

(1)查找子元素

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
print(type(items))
print(items)
lis = items.find('li')
print(type(lis))
print(lis)
複製代碼

輸出:

<class 'pyquery.pyquery.PyQuery'>
<ul class="list">
<li class="item-0">first item</li>
<li class="item-1">< a href=" ">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
</ul>
 
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
<li class="item-1">< a href="link2.html">second item</ a></li>
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
<li class="item-1 active">< a href="link4.html">fourth item</ a></li>
<li class="item-0">< a href="link5.html">fifth item</ a></li>
複製代碼

(2)直接子元素:

lis = items.children(‘.active’)#()中是二次篩選,也能夠沒有
print(type(lis))
print(lis)
複製代碼

(3)父元素

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')#list的父節點
container = items.parent()
複製代碼

輸出:

<class 'pyquery.pyquery.PyQuery'>
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1">< a href=" ">second item</ a></li>
         <li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
         <li class="item-1 active">< a href="link4.html">fourth item</ a></li>
         <li class="item-0">< a href="link5.html">fifth item</ a></li>
     </ul>
 </div>
複製代碼

返回祖先節點:

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents()
print(type(parents))
print(parents)
複製代碼

輸出:

<class 'pyquery.pyquery.PyQuery'>
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1">< a href=" ">second item</ a></li>
             <li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
             <li class="item-1 active">< a href="link4.html">fourth item</ a></li>
             <li class="item-0">< a href="link5.html">fifth item</ a></li>
         </ul>
     </div>
 </div><div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1">< a href="link2.html">second item</ a></li>
             <li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
             <li class="item-1 active">< a href="link4.html">fourth item</ a></li>
             <li class="item-0">< a href="link5.html">fifth item</ a></li>
         </ul>
     </div>
複製代碼

也能夠傳入css選擇器再次進行篩選:

parent = items.parents('.wrap')
print(parent)
複製代碼

只會輸出上面的第一個結果

兄弟元素

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')#後面是沒有空格,表示查找同時包含這兩個class的元素,只有一個符合條件
print(li.siblings())
複製代碼

輸出的是其餘4個兄弟li標籤

遍歷

from pyquery import PyQuery as pq
doc = pq(html)
lis = doc('li').items()
print(type(lis))
for li in lis:
    print(li)#每個li標籤都是pyquery類型,能夠進行進一步操做
複製代碼

獲取信息

(1)獲取屬性值

from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.attr('href'))
print(a.attr.href)
複製代碼

輸出:

< a href=" "><span class="bold">third item</span></ a>
link3.html
link3.html
< a href="link3.html"><span class="bold">third item</span></ a>
link3.html
link3.html
複製代碼

(2)獲取文本值

from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a)
print(a.text())
複製代碼

輸出:

< a href=" "><span class="bold">third item</span></ a>
third item
複製代碼

(3)獲取HTML

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.html())
複製代碼

輸出:

<li class="item-0 active">< a href=" "><span class="bold">third item</span></ a></li>   
< a href="link3.html"><span class="bold">third item</span></ a>
複製代碼

DOM 操做

(1)addClass和removeClass

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)
複製代碼

輸出:

<li class="item-0 active">< a href=" "><span class="bold">third item</span></ a></li>
             
<li class="item-0">< a href="link3.html"><span class="bold">third item</span></ a></li>
             
<li class="item-0 active">< a href="link3.html"><span class="bold">third item</span></ a></li>
複製代碼

DOM操做其實就是對:屬性、css、class等進行操做 (2)添加屬性attr、添加css

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name', 'link')#添加新的屬性對
print(li)
li.css('font-size', '14px')#添加新的css
print(li)
複製代碼

輸出:

<li class="item-0 active">< a href=" "><span class="bold">third item</span></ a></li>
             
<li class="item-0 active" name="link">< a href="link3.html"><span class="bold">third item</span></ a></li>
             
<li class="item-0 active" name="link" style="font-size: 14px">< a href="link3.html"><span class="bold">third item</span></ a></li>
複製代碼

(3)移除

html = ''' <div class="wrap"> Hello, World <p>This is a paragraph.</p> </div> '''
from pyquery import PyQuery as pq
doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove()#單獨獲取Hello world
print(wrap.text())
複製代碼

輸出: Hello, World This is a paragraph. Hello, World (5)僞類選擇器

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li:first-child')#獲取第一個li標籤
print(li)
li = doc('li:last-child')#獲取最後一個li標籤
print(li)
li = doc('li:nth-child(2)')#獲取第二個li標籤
print(li)
li = doc('li:gt(2)')#獲取大於2的li標籤
print(li)
li = doc('li:nth-child(2n)')##獲取第偶數個li標籤
print(li)
li = doc('li:contains(second)')#獲取包含某個文本值的li標籤
print(li)
複製代碼

(5)其餘

資料分享

java學習筆記、10T資料、100多個java項目分享


歡迎關注我的公衆號【菜鳥名企夢】,公衆號專一:互聯網求職面經javapython爬蟲大數據等技術分享**: 公衆號**菜鳥名企夢後臺發送「csdn」便可免費領取【csdn】和【百度文庫】下載服務; 公衆號菜鳥名企夢後臺發送「資料」:便可領取5T精品學習資料**、java面試考點java面經總結,以及幾十個java、大數據項目資料很全,你想找的幾乎都有

掃碼關注,及時獲取更多精彩內容。(博主今日頭條大數據工程師)
相關文章
相關標籤/搜索