BeautifulSoup隨筆

BeautifulSoup是一個類html

b = BeautifulSoup(html)java

b對象有與html結構相關的各類方法和和屬性。app

a = b.findAll('a')得到標籤的對象dom

a對象又有關於屬性的各類方法和屬性吧url

獲取某網頁的全部鏈接:spa

from bs4 import BeautifulSoup
import urllib.request
import sys

url = 'http://news.163.com/'

#獲取網頁html
html = urllib.request.urlopen(url).read()
html = html.decode('gbk')

#經過BeautifulSoup提取href
a = BeautifulSoup(html).findAll('a')
count = 0
err_a_list = []
for i in a:
    try:
        if i and i.attrs['href'][0] != 'j':  #排除href = java..
            print(i.attrs['href'])
    except Exception as e:            #當沒有href屬性或屬性值爲空時會報錯,捕獲以防止循環被中斷
        print(e)
        err_a_list.append(i)
        count += 1
print("\n"*8)
for i in err_a_list:
    print(i)
    print()
print(count)

 

對網址沒有域名以及錨點等href處理:.net

 http://blog.csdn.net/huangxiongbiao/article/details/45584407code

    # 將形如#comment-text的錨點補全成http://www.ruanyifeng.com/blog/2015/05/co.html,將形如/feed.html補全爲http://www.ruanyifeng.com/feed.html
    alist = map(lambda i: proto + '://' + domain + i if i[0] == '/' else url + i if i[0] == '#' else i, alist)

 

shtm

相關文章
相關標籤/搜索