Python網絡數據採集二

時間 2019-12-07

標籤 python 網絡數據採集欄目 Python 简体版

原文原文鏈接

1、正則表達式html

* 表匹配0次或者屢次 a*b*正則表達式

+ 表至少一次dom

[ ] 匹配任意一個函數

( ) 辨識一個編組url

{m，n} m或者n 次htm

[^] 匹配任意不在中括號裏的字符圖片

| 表示或者ip

. 表示匹配任意字符資源

^ 表字符的開始 ^a 表示以a開始字符串

\ 表示轉義字符

$ 和^ 相反從字符串的末尾開始匹配

？！不包含

2、得到屬性

得到一個標籤的所有屬性

myTag.attrs

獲取圖片的資源位置src

myImgTag.attrs["src"]

獲取網頁的函數：

random.seed(datetime.datetime.now())def getLinks(articleUrl): html = urlopen("http://en.wikipedia.org"+articleUrl) bs0bj = BeautifulSoup(html) return bs0bj.find("div",{"id":"bodyContent"}).findAll("a",herf=re.compile("^(/wiki/)((?!:).)*$"))links = getLinks("/wiki/Kevin_Bacon")while len(links) > 0: newArticle = links[random.randint(0,len(links)-1)].attrs["href"] print(newArticle) link = getLinks(newArticle)