路飛學城-Python爬蟲實戰密訓-第2章

時間 2019-11-15

標籤 python 爬蟲實戰欄目 Python 简体版

原文原文鏈接

看完第二章了。因爲缺乏Web以及Flask相關知識儲備，因此比較艱難。不過這都不是問題，我學習的目的就是要用，能讓我深入體會到這些知識的必要性，那麼我也必定會再花時間把相關的課補上。班主任很負責任，還專門發了Flask專題學習視頻給咱們。第一章的做業被老師在羣裏表揚了，得了90分，仍是蠻開心的。感受本身私下學，和有老師帶着學，最大的差異，第一是經過老師的點評和糾正，及時認識本身程序的不足，讓程序更具備規範性；第二是再也不盲目，能夠循序漸進，由淺入深的開展學習，同時掌握最爲關鍵的——學習方法。html

第二章的知識筆記：學習

遇到圖片防盜鏈功能，請求頭要加：spa

—Referercode

—Cookie視頻

—Hosthtm

Beautifulsoup4內置方法：blog

1. name，標籤名稱索引

1 # tag = soup.find('a')
2 # name = tag.name # 獲取
3 # print(name)
4 # tag.name = 'span' # 設置
5 # print(soup)

2. arrt，標籤屬性圖片

1 # tag = soup.find('a')
2 # attrs = tag.attrs    # 獲取
3 # print(attrs)
4 # tag.attrs = {'ik':123} # 設置
5 # tag.attrs['id'] = 'iiiii' # 設置
6 # print(soup)

3. children，全部子標籤ci

1 # from bs4.element import Tag
2 # div = soup.find('body')
3 # v = div.children
4 # print(list(v))          #含換行符 
5 # for ele in v:
6     If type(ele) == Tag:
7         print(ele)      #去掉換行符

4. children,全部子子孫孫標籤

1 # body = soup.find('body')
2 # v = body.descendants      #返回全部標籤，以及標籤裏的子標籤

5. clear,將標籤的全部子標籤所有清空（保留標籤名）

1 # tag = soup.find('body')
2 # tag.clear()
3 # print(soup)

6. decode,轉換爲字符串（含當前標籤）；decode_contents（不含當前標籤）

1 # body = soup.find('body')
2 # v = body.decode()
3 # v = body.decode_contents()
4 # print(v)

7. encode,轉換爲字節（含當前標籤）；encode_contents（不含當前標籤）

1 # body = soup.find('body')
2 # v = body.encode()
3 # v = body.encode_contents()
4 # print(v)

8. find,獲取匹配的第一個標籤

1 # tag = soup.find('a')
2 # print(tag)
3 # tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
4 # tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
5 # print(tag)

9. find_all,獲取匹配的全部標籤

 1 # tags = soup.find_all('a')
 2 # print(tags)
 3  
 4 # tags = soup.find_all('a',limit=1)              #limit=1表明找幾個
 5 # print(tags)
 6  
 7 # tags = soup.find_all(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
 8 # # tags = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
 9 # print(tags)
10  
11  
12 # ####### 列表，表明或的關係 #######
13 # v = soup.find_all(name=['a','div'])           #name=’a’或者name=’div’
14 # print(v)
15  
16 # v = soup.find_all(class_=['sister0', 'sister'])
17 # print(v)
18  
19 # v = soup.find_all(text=['Tillie'])
20 # print(v, type(v[0]))
21  
22  
23 # v = soup.find_all(id=['link1','link2'])
24 # print(v)
25  
26 # v = soup.find_all(href=['link1','link2'])
27 # print(v)
28  
29 # ####### 正則 #######
30 import re
31 # rep = re.compile('p')
32 # rep = re.compile('^p')                    #以P開頭
33 # v = soup.find_all(name=rep)
34 # print(v)
35  
36 # rep = re.compile('sister.*')
37 # v = soup.find_all(class_=rep)               #樣式裏含有sister開頭的
38 # print(v)
39  
40 # rep = re.compile('http://www.oldboy.com/static/.*')     
41 # v = soup.find_all(href=rep)
42 # print(v)
43  
44 # ####### 方法篩選 #######
45 # def func(tag):
46 # return tag.has_attr('class') and tag.has_attr('id')     #既具備class屬性又有id屬性
47 # v = soup.find_all(name=func)
48 # print(v)
49  
50 # ## get,獲取標籤屬性
51 # tag = soup.find('a')
52 # v = tag.get('id')
53 # print(v)

10. get_text,獲取標籤內部文本內容

1 # tag = soup.find('a')
2 # v = tag.get_text('id')
3 # print(v)

11. index,檢查標籤在某標籤中的索引位置

1 # tag = soup.find('body')
2 # v = tag.index(tag.find('div'))
3 # print(v)
4  
5 # tag = soup.find('body')
6 # for i,v in enumerate(tag):
7 # print(i,v)

12. is_empty_element,是不是空標籤(是否能夠是空)或者自閉合標籤

判斷是不是以下標籤：'br' , 'hr', 'input', 'img', 'meta','spacer', 'link', 'frame', 'base'

1 # tag = soup.find('br')
2 # v = tag.is_empty_element
3 # print(v)

13． select,select_one, CSS選擇器

 1 soup.select("title")
 2  
 3 soup.select("p nth-of-type(3)")
 4  
 5 soup.select("body a")            #空格表明去它的子子孫孫裏找；>a表明去下一級找
 6  
 7 soup.select("html head title")
 8  
 9 tag = soup.select("span,a")
10  
11 soup.select("head > title")
12  
13 soup.select("p > a")
14  
15 soup.select("p > a:nth-of-type(2)")
16  
17 soup.select("p > #link1")
18  
19 soup.select("body > a")
20  
21 soup.select("#link1 ~ .sister")            # #號表明id=link1
22  
23 soup.select("#link1 + .sister")
24  
25 soup.select(".sister")
26  
27 soup.select("[class~=sister]")
28  
29 soup.select("#link1")
30  
31 soup.select("a#link2")                  # a標籤而且id=link1
32  
33 soup.select('a[href]')                   # a標籤而且具備href屬性的
34  
35 soup.select('a[href="http://example.com/elsie"]')
36  
37 soup.select('a[href^="http://example.com/"]')           #表明以這個網址開頭的
38  
39 soup.select('a[href$="tillie"]')                        # $表明以這個結尾的
40  
41 soup.select('a[href*=".com/el"]')