爬蟲 Beautifulsoup 經常使用筆記

時間 2019-11-10

原文原文鏈接

Beautifulsoup("http:........", "lxml") or "html.parser" or "html5lib"html

soup.find返回的是一個對象，第一個符合條件的標籤 html5

soup.findAll返回的是一個列表，包含全部符合條件的標籤函數

因此find後面能夠直接接get_text函數，而findAll不行，只能將findAll列表中的元素，單獨地去get_texturl

例如：spa

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
html = urlopen("http://www.biqukan.com/")
a_html = BeautifulSoup(html,'lxml')
mudi = a_html.find("div",{"class":"r bd"}).get_text()  #find返回單個對象，能夠直接接get_text()
print(mudi)

target = "http://www.biqukan.com/"
req = requests.get(target)
html2 = req.text
bf = BeautifulSoup(html2,'html.parser')
for mudi2 in a_html.find("div",{"class":"r bd"}).findAll("span",{"class":"s2"}): print(mudi2.get_text())   # findAll是一個列表，須要單獨將元素get_text()

同時，注意urlopen和request方式是不一樣的code

bsObj.findAll("h1")  # 返回頁面中標籤爲<h1>的一個列表
bsObj.findAll({"h1","h2","h3","h4","h5"})  # 獲取所有標題標籤的列表

a = bsObj.findAll("h2")    
len(a)       #  .findAll 返回的是列表，因此能夠用  len()  計算個數

bsObj.findAll("", {"class":"green"})  # 返回全部標籤屬性class爲green的標籤，造成一個列表

bsObj.findAll("span", {"class":"green"})  # 返回全部的span標籤中，屬性class爲綠色的

for child in jbsObj.find("table",{"id":"giftList"}).children  # 在名爲giftList的列表標籤下，找到該列表標籤的子標籤

# 若是不是children而是descendants函數的話，就是該標籤下的所有後代標籤
# parent 尋找父標籤


for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:  # 尋找該table標籤的兄弟標籤，不包括他自己，且從他開始日後
# 若是不是next_siblings, 而是previous_siblings，則是從後往前

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。