pyquery詳細用法 python爬蟲之PyQuery的基本使用

python爬蟲之PyQuery的基本使用javascript

 

PyQuery庫也是一個很是強大又靈活的網頁解析庫,若是你有前端開發經驗的,都應該接觸過jQuery,那麼PyQuery就是你很是絕佳的選擇,PyQuery 是 Python 仿照 jQuery 的嚴格實現。語法與 jQuery 幾乎徹底相同,因此不用再去費心去記一些奇怪的方法了。
官網地址:http://pyquery.readthedocs.io/en/latest/
jQuery參考文檔: http://jquery.cuishifeng.cn/css

 

一、字符串的初始化html

from pyquery import PyQuery as pq

html = '''<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''

doc = pq(html)
print(doc)
print(type(doc))
print(doc('li'))
複製代碼
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>
<class 'pyquery.pyquery.PyQuery'>
<li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
複製代碼
運行結果

 

二、打開html文件前端

  注意路勁問題java

from pyquery import PyQuery as pq
doc = pq(filename='index.html')
print(doc)
print(doc('head'))
複製代碼
    <title>Title</title>
</head>
<body>
    <div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''
</body>
</html>
<head>
    <meta charset="UTF-8"/>
    <title>Title</title>
</head>
複製代碼
運行結果

 

三、打開某個網站python

doc = pq('https://www.baidu.com')
# doc1 = pq(url='https://www.baidu.com')
print(doc)
print(doc('head'))

  

四、基於CSS選擇器查找jquery

from pyquery import PyQuery as pq

html = '''<div>
    <ul id = 'haha'>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''

doc = pq(html)
print(doc)
#id等於haha下面的class等於item-0下的a標籤下的span標籤(注意層級關係以空格隔開)
print(doc('#haha .item-0 a span'))
複製代碼
<div>
    <ul id="haha">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>
<span class="bold">third item</span>
複製代碼
運行結果

 

 

五、能夠經過已經查找的標籤,查找這個標籤下的子標籤或者父標籤,而不用從頭開始查找。python爬蟲

from pyquery import PyQuery as pq

html = '''<div class=‘content’>
    <ul id = 'haha'>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''

doc = pq(html)
item = doc('div ul')
print(item)
#咱們能夠經過已經查找到的標籤,再此查找這個標籤下面的標籤
print(item.parent())
print(item.children())
複製代碼
<ul id="haha">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
<div class="&#x2018;content&#x2019;">
    <ul id="haha">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>
<li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
複製代碼
運行結果

 

from pyquery import PyQuery as pq

html = '''<div class=‘content’>
    <ul id = 'haha'>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''

doc = pq(html)
item = doc('div ul')
print(item)
#注意這裏查找ul標籤的全部子標籤,也就是li標籤,下面是查找class屬性的標籤,若是你把class換成href確定不行,它指的只是兒子並非子子孫孫
print(item.children('[class]'))

 

六、獲取屬性值ide

from pyquery import PyQuery as pq

html = '''<div class=‘content’>
    <ul id = 'haha'>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''

doc = pq(html)
#注意class=item-0 active是一個class的屬性,可是在pyquery裏面要是中間也是空格隔開的話,
#就變成了item-0下的active標籤下的a標籤了,因此這裏空格必須改爲點
item = doc(".item-0.active a")
print(type(item))
print(item)
#獲取屬性值的兩種方法
print(item.attr.href)
print(item.attr('href'))
<class 'pyquery.pyquery.PyQuery'>
<a href="link3.html"><span class="bold">third item</span></a>
link3.html
link3.html
運行結果

 

七、獲取標籤的內容post

from pyquery import PyQuery as pq

html = '''<div class=‘content’>
    <ul id = 'haha'>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''

doc = pq(html)
a = doc("a").text()
print(a)
#結果頗有趣,他是找到全部標籤的值,而後給連到一塊兒打出來,就像一段話
second item third item fourth item fifth item
運行結果

 

 

八、Dom操做

一、屬性的增長刪除操做

from pyquery import PyQuery as pq

html = '''<div class=‘content’>
    <ul id = 'haha'>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''

doc = pq(html)
li = doc('.item-0.active')
print(li)
#刪除classactive
print(li.removeClass('active'))
#增長class屬性haha
print(li.addClass('haha'))
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         
<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>
         
<li class="item-0 haha"><a href="link3.html"><span class="bold">third item</span></a></li>
運行結果

 

二、attrs和css

  注意:下列操做有則改之,無則加之。

from pyquery import PyQuery as pq

html = '''<div class=‘content’>
    <ul id = 'haha'>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''

doc = pq(html)
li = doc('.item-0.active')
print(li)
print(li.attr('id','id_test'))
print(li.css('font-size','20px'))
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         
<li class="item-0 active" id="id_test"><a href="link3.html"><span class="bold">third item</span></a></li>
         
<li class="item-0 active" id="id_test" style="font-size: 20px"><a href="link3.html"><span class="bold">third item</span></a></li>
運行結果

 

 

三、刪除某個標籤,在爬去過程當中咱們一般爬去一下標籤或者內容下來的時候總會有些不想要的標籤,這個時候咱們能夠用下面的相似方法刪除這個標籤。

from pyquery import PyQuery as pq

html = '''<div class='content'>
    <ul id = 'haha'>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul></div>'''

doc = pq(html)
data = doc('.content')
print(data.text())
#刪除全部a標籤
data.find('a').remove()
#再次打印
print(data.text())
first item second item third item fourth item fifth item
first item
運行結果
相關文章
相關標籤/搜索