Python 爬蟲之 Beautifulsoup4，爬網站圖片

時間 2019-11-06

標籤 python 爬蟲 beautifulsoup4 beautifulsoup 爬網圖片欄目 Python 简体版

原文原文鏈接

安裝：html

pip3 install beautifulsoup4
pip install beautifulsoup4

Beautifulsoup4 解析器使用 lxml，緣由爲，解析速度快，容錯能力強，效率夠高python

安裝解析器：json

pip install lxml

使用方法：app

加載 beautifulsoup4 模塊
加載 urllib 庫的 urlopen 模塊
使用 urlopen 讀取網頁，若是是中文，須要添加 utf-8 編碼模式
使用 beautifulsoup4 解析網頁

#coding: utf8
#python 3.7

from bs4 import BeautifulSoup
from urllib.request import urlopen

#if chinese apply decode()
html = urlopen("https://www.anviz.com/product/entries/1.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
all_li = soup.find_all("li",{"class","product-subcategory-item"})
for li_title in all_li:
  li_item_title = li_title.get_text()
  print(li_item_title)

Beautifulsoup4文檔： https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id13dom

方法同 jQuery 相似：網站

//獲取全部的某個標籤：soup.find_all('a')，find_all() 和 find() 只搜索當前節點的全部子節點,孫子節點
find_all()
soup.find_all("a")  //查找全部的標籤
soup.find_all(re.compile("a"))  //查找匹配包含 a 的標籤
soup.find_all(id="link2")
soup.find_all(href=re.compile("elsie")) //搜索匹配每一個tag的href屬性
soup.find_all(id=True)  //搜索匹配包含 id 的屬性
soup.find_all("a", class_="sister")  //搜索匹配 a 標籤中 class 爲 sister 
soup.find_all("p", class_="strikeout")
soup.find_all("p", class_="body strikeout")
soup.find_all(text="Elsie")  //搜索匹配內容爲 Elsie 
soup.find_all(text=["Tillie", "Elsie", "Lacie"])
soup.find_all("a", limit=2)  //當搜索內容知足第2頁時，中止搜索
//獲取tag中包含的文本內容
get_text() 
soup.get_text("|")
soup.get_text("|", strip=True)
//用來搜索當前節點的父輩節點
find_parents()
find_parent()
//用來搜索兄弟節點
find_next_siblings() //返回全部符合條件的後面的兄弟節點
find_next_sibling()  //只返回符合條件的後面的第一個tag節點
//用來搜索兄弟節點
find_previous_siblings() //返回全部符合條件的前面的兄弟節點
find_previous_sibling() //返回第一個符合條件的前面的兄弟節點

find_all_next()  //返回全部符合條件的節點
find_next()  //返回第一個符合條件的節點

find_all_previous() //返回全部符合條件的節點
find_previous()  //返回第一個符合條件的節點

.select() 方法中傳入字符串參數,便可使用CSS選擇器的語法找到tag
soup.select("body a")
soup.select("head > title")
soup.select("p > a")
soup.select("p > a:nth-of-type(2)")
soup.select("#link1 ~ .sister")
soup.select(".sister")
soup.select("[class~=sister]")
soup.select("#link1")
soup.select('a[href]')
soup.select('a[href="http://example.com/elsie"]')

.wrap() 方法能夠對指定的tag元素進行包裝 [8] ,並返回包裝後的結果

爬取 anviz 網站產品列表圖片： demo編碼

使用了 url

BeautifulSoup

requests

os

#Python 自帶的模塊有如下幾個，使用時直接 import 便可
    import json
    import random     //生成隨機數
    import datetime
    import time
    import os       //創建文件夾

#coding: utf8
#python 3.7

from bs4 import BeautifulSoup
import requests
import os

URL = "https://www.anviz.com/product/entries/2.html"
html = requests.get(URL).text
os.makedirs("./imgs/",exist_ok=True)
soup = BeautifulSoup(html,features="lxml")

all_li = soup.find_all("li",class_="product-subcategory-item")
for li in all_li:
    imgs = li.find_all("img")
    for img in imgs:
        imgUrl = "https://www.anviz.com/" + img["src"]
        r = requests.get(imgUrl,stream=True)
        imgName = imgUrl.split('/')[-1]
        with open('./imgs/%s' % imgName, 'wb') as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)
        print('Saved %s' % imgName)