python爬蟲基礎

時間 2019-12-13

標籤 python 爬蟲基礎欄目 Python 简体版

原文原文鏈接

Note：
一：簡單爬蟲的基本步驟

1.爬蟲的前奏：
    (1)明確目的
 (2)找到數據對應的網頁
 (3)分析網頁的結構，找到數據的位置

2.爬蟲第二步：__fetch_content方法
 模擬HTTP請求，向服務器發送這個請求，獲取服務器返回給咱們的Html
 用正則表達式提取咱們要的數據

3.爬蟲第三步：__analysis
 (1)找到一個定位標籤或者是標識符，利用正則表達式找到須要的內容：
 它的選擇原則是：
 惟一原則、就近原則、選擇父級閉合標籤
 (2)再找到的內容中進一步提取須要的數據，可能屢次提取

4.精煉提取到的數據
 利用lambda表達式替換for循環

5.處理精煉後的數據

5.顯示處理後的數據

二：程序規範
 1.註釋
 2.空行的利用
 3.函數大小10-20行
 4.寫平級方法並用主方法調用，避免多級嵌套方法！

四：補充
 beautiful Soup, scrapy爬蟲框架
 爬蟲、反爬蟲、反反爬蟲
 ip 被封 代理IP
五：總結
 (1)增強對正則表達式的練習
 (2)增強對lambda表達式的練習！
 (3)鍛鍊面向對象的思惟模式

Code：

 1 """
 2 this module is used to spider data!
 3 """
 4 
 5 from urllib import request
 6 import re
 7 # 代替print的斷點調試方法,特別重要！！！
 8 
 9 
10 class Spider:
11     """
12     this class is used to spider data!
13     """
14     url = 'https://www.panda.tv/cate/hearthstone'
15     root_pattern = '<div class="video-info">([\s\S]*?)</div>'     # 非貪婪模式
16     name_pattern = '</i>([\s\S]*?)</span>'
17     number_pattern = '<span class="video-number">([\s\S]*?)</span>'
18 
19     def __fetch_content(self):
20         """
21             this class is used to spider data!
22         """
23 
24         r = request.urlopen(self.url)   # 提取到html
25         html_s = r.read()
26         html = str(html_s, encoding='utf-8')
27 
28         return html
29 
30     def __analysis(self, html):
31         root_html = re.findall(self.root_pattern, html)     # list
32         # print(root_html[0])   # 第一次匹配的結果
33 
34         anchors =[]
35         for html in root_html:
36             name = re.findall(self.name_pattern, html)
37             number = re.findall(self.number_pattern, html)
38             anchor = {'name': name, 'number': number}
39             anchors.append(anchor)
40         # print(anchors[0])
41 
42         return anchors
43 
44     @staticmethod
45     def __refine(anchors):
46         i = lambda anchor: {'name': anchor['name'][0].strip(),  # 列表後面只有一個元素
47                             'number': anchor['number'][0].strip()
48                             }
49         return map(i, anchors)
50 
51     def __sort(self, anchors):      # 業務處理
52         anchors = sorted(anchors, key=self.__sort_seek, reverse=True)
53         return anchors
54 
55     @staticmethod
56     def __sort_seek(anchors):
57         r = re.findall('\d*', anchors['number'])
58         number = float(r[0])
59         if '萬' in anchors['number']:
60             number *= 10000
61 
62         return number
63 
64     @staticmethod
65     def __show(anchors):
66         # for anchor in anchors:
67             # print(anchor['name'] + '-----' + anchor['number'])
68         for rank in range(0, len(anchors)):
69             print('rank' + str(rank + 1)
70                   + ' : ' + anchors[rank]['name']
71                   + '   ' + anchors[rank]['number'])
72 
73     def go(self):                           # 主方法（平級的函數）
74         html = self.__fetch_content()       # 獲取到文本
75         anchors = self.__analysis(html)     # 分析數據
76         anchors = self.__refine(anchors)    # 精煉數據
77         # print(list(anchors))
78         anchor = self.__sort(anchors)
79         self.__show(anchor)
80 
81 
82 spider = Spider()
83 spider.go()

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。