認識爬蟲：優秀的爬蟲利器，pyquery 框架爬蟲到底有多簡潔？

時間 2021-04-07

標籤 css html 前端 python jquery 微信 dom 函數 post url 欄目網絡爬蟲简体版

原文原文鏈接

瞭解過了 BeautifulSoup 對象的爬蟲解析、lxml 擴展庫的 xpath 語法等 html 的解析庫，如今來講說 pyquery ，看名稱就長得和 jquery 很像。其實，pyquery 就是仿照 jquery 的語法來實現的，語法使用能夠說是幾乎相同，算是前端爬蟲的福利語言，若是你剛好會一些 jquery 的語法使用起來就會很是簡單。css

一、安裝並導入 pyquery 擴展庫html

1pip install -i https://pypi.mirrors.ustc.edu.cn/simple/ pyquery
2
3# -*- coding: UTF-8 -*-
4
5# 導入 pyquery 擴展庫
6from pyquery import PyQuery as pq

二、pyquery 執行網頁請求(不經常使用)前端

1'''
2直接使用 PyQuery 對象便可發送網頁請求，返回響應信息
3'''
4
5# GET 請求
6print(PyQuery(url='http://www.baidu.com/', data={},headers={'user-agent': 'pyquery'},method='get'))
7
8# POST 請求
9print(PyQuery(url='http://httpbin.org/post',data={'name':u"Python 集中營"},headers={'user-agent': 'pyquery'}, method='post', verify=True))

三、pyquery 執行網頁源代碼解析(經常使用)python

解析對象初始化

1# 首先獲取到網頁下載器已經下載到的網頁源代碼
 2# 這裏直接取官方的案例
 3html_doc = """
 4<html><head><title>The Dormouse's story</title></head>
 5<body>
 6<p class="title"><b>The Dormouse's story</b></p>
 7
 8<p class="story">Once upon a time there were three little sisters; and their names were
 9<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
10<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
11<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
12and they lived at the bottom of a well.</p>
13
14<p class="story">...</p>
15"""
16
17# 初始化解析對象
18pyquery_obj = PyQuery(html_doc)

css選擇器模式提取元素及元素文本

1# 獲取a標籤元素、文本
 2print(pyquery_obj('a'))
 3print(pyquery_obj('a').text())
 4
 5# 獲取class=story元素、文本
 6print(pyquery_obj('.story'))
 7print(pyquery_obj('.story').text())
 8
 9# 獲取id=link3元素、文本
10print(pyquery_obj('#link3'))
11print(pyquery_obj('#link3').text())
12
13# 獲取body下面p元素、文本
14print(pyquery_obj('body p'))
15print(pyquery_obj('body p').text())
16
17# 獲取body和p元素、文本
18print(pyquery_obj('p,a'))
19print(pyquery_obj('p,a').text())
20
21# 獲取body和p元素、文本
22print(pyquery_obj("[class='story']"))
23print(pyquery_obj("[class='story']").text())

獲取元素以後再進一步提取信息

1# 提取元素文本
2print("......元素再提取......")
3print("全部a元素文本",pyquery_obj('a').text())
4print("第一個a元素的html文本",pyquery_obj('a').html())
5print("a元素的父級元素",pyquery_obj('a').parent())
6print("a元素的子元素",pyquery_obj('a').children())
7print("全部a元素中id是link3的元素",pyquery_obj('a').filter('#link3'))
8print("最後一個a元素的href屬性值",pyquery_obj('a').attr.href)

dom操做

1# attr() 函數獲取屬性值
 2print(pyquery_obj('a').filter('#link3').attr('href'))
 3# attr.屬性，獲取屬性值
 4print(pyquery_obj('a').filter('#link3').attr.href)
 5print(pyquery_obj('a').filter('#link3').attr.class_)
 6# 添加 class 屬性值 w
 7pyquery_obj('a').filter('#link3').add_class('w')
 8print(pyquery_obj('a').filter('#link3').attr('class'))
 9
10# 移除 class 屬性值 w
11pyquery_obj('a').filter('#link3').remove_class('sister')
12print(pyquery_obj('a').filter('#link3').attr('class'))
13# 移除標籤
14pyquery_obj('html').remove('a')
15print(pyquery_obj)

更多精彩前往微信公衆號【Python 集中營】，專一於 python 技術棧，資料獲取、交流社區、乾貨分享，期待你的加入~jquery