強大又靈活的網頁解析庫,若是以爲正則表達式寫起來太麻煩,而BeautifulSoup語法太難記,可是熟悉jQuery的語法,那麼PyQuery就是一個絕佳選擇。css
安裝:pip3 install pyqueryhtml
字符串初始化python
from pyquery import PyQuery as pq html = ''' <div> <url> <li class='item-0'>first item</li> <li class='item-1'><a href='link3.html'><span class='bold'>third item</span></a></li> </url> </div> ''' doc = pq(html) print(doc('li'))
#這裏的選擇與css選擇器同樣,選class加點,選id加#,選標籤什麼都不加 輸出結果爲: <li class="item-0">first item</li> <li class="item-1"><a href="link3.html"><span class="bold">third item</span></a></li>
URL初始化正則表達式
from pyquery import PyQuery as pq doc = pq(url='http://www.baidu.com') print(doc('head')) 輸出結果爲: <head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head>
這種是傳入一個url,會自動請求這個url,把源代碼給pq,生成一個pq對象 api
文件初始化ui
from pyquery import PyQuery as pq doc = pq(filename='1.html') print(doc('url')) 輸出結果爲: <url> <li class="item-0">first item</li> <li class="item-1"><a href="link3.html"><span class="bold">third item</span></a></li> </url> ------------------------ 1.html內容: <div> <url> <li class='item-0'>first item</li> <li class='item-1'><a href='link3.html'><span class='bold'>third item</span></a></li> </url> </div>
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) print(doc('#container .list li')) 輸出結果爲: <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li>
css選擇器,id前面加#號,class前面加點,標籤前面什麼都不加 this
查找子元素url
find 方法:查找元素裏面包含的元素spa
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) items = doc('.list') print(type(items)) print(items) lis = items.find('li') print(type(lis)) print(lis) 輸出結果爲: <class 'pyquery.pyquery.PyQuery'> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul> <class 'pyquery.pyquery.PyQuery'> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li>
children方法,查找直接子元素,find查找的只要在裏面就行,find更經常使用ssr
查找父元素
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) items = doc('.list') print(items.parent()) 輸出結果爲: <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul></div>
還有parents方法,查找祖先節點,不僅是父節點,父節點的父節點也會查找到
能夠像查找元素同樣,在這些方法里加上參數(相似於css選擇器)來進一步進行篩選,如:
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) items = doc('.list') print(items.parent('#container'))
#對父元素中id = container的進行篩選 輸出結果爲: <div id="container"> <ul class="list"> <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> </ul></div>
兄弟元素
siblings與sibling方法
##在查找的時候,例如doc('.list .item-0.active'),有空格表示一級級往下找,沒有空格表示並列的意思,就是即含有iten-0,又含有active的意思
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) items = doc('.list .item-0.active') print(items) 輸出結果爲: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
執行items.siblings()就會輸出其兄弟元素:
<li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0">first item</li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li>
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) items = doc('.list .item-0.active') print(items.siblings()) print(items.siblings('.active'))
#在查找的時候,能夠進行進一步知足條件的篩選 輸出結果爲: <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0">first item</li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li>
items()方法:實際上就是產生了一個產生器,再用for循環進行遍歷
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) lis = doc('li').items() for li in lis: print(li) 輸出結果爲: <li class="item-0">first item</li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-1 active"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a></li>
獲取屬性
好比要獲取item元素的屬性:
item.attr('屬性名稱'),或者:
item.attr.屬性名稱
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) li = doc('.item-0.active a') print(li.attr.href) print(li.attr('href')) 輸出結果爲: link3.html link3.html
獲取文本
text()方法
獲取html
html()方法,如:
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) li = doc('.item-0.active') print(li) print(li.html()) 輸出結果爲: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <a href="link3.html"><span class="bold">third item</span></a>
#輸出li獲得,這個標籤及裏面的內容,
#使用html方法後,獲得標籤裏面的html代碼
DOM操做
就是節點操做
addClass,removeClass 增刪屬性
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) li = doc('.item-0.active') print(li) print(li.removeClass('active')) print(li.addClass('active')) 輸出結果爲: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
attr,css 修改屬性
from pyquery import PyQuery as pq html = ''' <div id='container'> <ul class='list'> <li class='item-0'>first item</li> <li class='item-1'><a href='link2.html'>second item</a></li> <li class='item-0 active'><a href='link3.html'><span class='bold'>third item</span></a></li> <li class='item-1 active'><a href='link4.html'>fourth item</a></li> <li class='item-0'><a href='link5.html'>fifth item</a></li> </url> </div> ''' doc = pq(html) li = doc('.item-0.active') print(li) print(li.attr('name','link')) print(li.css('font-size','14px')) 輸出結果爲: <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li> <li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li> #原先沒有name屬性,如今增長了一個name屬性,如過原來有name屬性,那麼就會修改原來的值 <li class="item-0 active" name="link" style="font-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>
#用了css以後,就出現了style這個屬性
remove
from pyquery import PyQuery as pq html = ''' <div class='wrap'> hello world <p>this is a paragraph</p> </div> ''' doc = pq(html) wrap = doc('.wrap') print(wrap.text()) print(wrap.find('p')) wrap.find('p').remove() print(wrap.text()) 輸出結果爲: hello world this is a paragraph <p>this is a paragraph</p> hello world
其餘DOM方法
http://pyquery.readthedocs.io/en/latest/api.html
僞類選擇器