16.Python使用lxml爬蟲

時間 2019-11-10

標籤 16.python python 使用 lxml 爬蟲欄目 Python 简体版

原文原文鏈接

1.lxml是解析庫，使用時須要導入該包，直接在命令行輸入：pip3 install lxml，基本上會報錯。正確應該去對應的網址：https://pypi.org/project/lxml/#files，直接下載對應的lxmlhtml

（根據python版本本身去選擇，筆者是python3.6，故下載：lxml-4.2.4-cp36-cp36m-win32.whl ，切換到下載的whl目錄，在該目錄下執行：python

pip3 install lxml-4.2.4-cp36-cp36m-win32.whl ）url

2.代碼以下所示：命令行

import requests
from lxml import etree

url = 'https://www.mafengwo.cn/gonglve/ziyouxing/2033.html'

response = requests.get(url)   #返回一個response對象
page = response.text

html = etree.HTML(page)      #返回一個Element對象，將字符串解析爲HTML文檔
content = html.xpath('//h2')

for i in content:
    print(i.text)

3.代碼解釋：xml

A：定義好url的路徑，使用url獲取到response對象如：url = ''htm

B：須要將reponse對象轉化爲字符串格式，page = response.text對象

C：使用解析庫將字符串轉爲爲HTML文檔，根據本身想要獲取的內容去定義xpath路徑blog