BeautifulSoup是一個類html
b = BeautifulSoup(html)java
b對象有與html結構相關的各類方法和和屬性。app
a = b.findAll('a')得到標籤的對象dom
a對象又有關於屬性的各類方法和屬性吧url
獲取某網頁的全部鏈接:spa
from bs4 import BeautifulSoup import urllib.request import sys url = 'http://news.163.com/' #獲取網頁html html = urllib.request.urlopen(url).read() html = html.decode('gbk') #經過BeautifulSoup提取href a = BeautifulSoup(html).findAll('a') count = 0 err_a_list = [] for i in a: try: if i and i.attrs['href'][0] != 'j': #排除href = java.. print(i.attrs['href']) except Exception as e: #當沒有href屬性或屬性值爲空時會報錯,捕獲以防止循環被中斷 print(e) err_a_list.append(i) count += 1 print("\n"*8) for i in err_a_list: print(i) print() print(count)
對網址沒有域名以及錨點等href處理:.net
http://blog.csdn.net/huangxiongbiao/article/details/45584407code
# 將形如#comment-text的錨點補全成http://www.ruanyifeng.com/blog/2015/05/co.html,將形如/feed.html補全爲http://www.ruanyifeng.com/feed.html alist = map(lambda i: proto + '://' + domain + i if i[0] == '/' else url + i if i[0] == '#' else i, alist)
shtm