python3實現簡單爬蟲功能

時間 2019-11-25

原文原文鏈接

本文參考蟲師python2實現簡單爬蟲功能，並增長本身的感悟。html

 1 #coding=utf-8
 2 import re
 3 import urllib.request
 4 
 5 def getHtml(url):
 6     page = urllib.request.urlopen(url)
 7     html = page.read()
 8     #print(type(html))
 9     html = html.decode('UTF-8')
10     #print(html)
11     return html
12 
13 def getImg(html):
14     reg = r'img class="BDE_Image" src="(.+?\.jpg)"'
15     imgre = re.compile(reg)
16     #print(type(imgre))
17     #print(imgre)
18     imglist = re.findall(imgre,html)
19     #print(type(imglist))
20     #print(imglist)
21     num = 0
22     for imgurl in imglist:
23         urllib.request.urlretrieve(imgurl,'D:\img\hardaway%s.jpg' %num)
24         num+=1      
25 
26 html = getHtml("http://tieba.baidu.com/p/1569069059")
27 print(getImg(html))

re-python自帶模塊，用於正則表達式的相關操做
https://docs.python.org/3/library/re.html
urllib.request,來自擴展庫urllib，用於打開網址相關操做
https://docs.python.org/3/installing/index.htmlpython
先定義了一個getHtml()函數正則表達式
使用urllib.request.urlopen()方法打開網址
使用read()方法讀取網址上的數據
使用decode()方法指定編碼格式解碼字符串瀏覽器

我這裏指定的編碼格式爲UTF-8，根據頁面源代碼得出：
函數

再定義了一個getImg()函數，用於篩選整個頁面數據中咱們所須要的圖片地址工具

上文中的例子所編寫的編碼格式是經過查看網頁源代碼的方式得知的，後來我嘗試了下經過正則表達式去匹配獲取charset定義的編碼格式，而後指定使用匹配來的編碼格式。編碼

 1 def getHtml(url):
 2     page = urllib.request.urlopen(url)
 3     html = page.read()
 4     #print(type(html))
 5     rehtml = str(html)
 6     #print(type(rehtml))
 7     reg = r'content="text/html; charset=(.+?)"'
 8     imgre = re.compile(reg)
 9     imglist = re.findall(imgre,rehtml)
10     print(type(imglist))
11     code = imglist[0]
12     print(type(code))
13     html = html.decode('%s' %code)
14     return html

說一說這裏的思路，html = page.read()方法處理後，返回的爲bytes對象。而re.findall()方法是沒法在一個字節對象上使用字符串模式的url
因此我新定義了一個變量rehtml,使用str()方法把html的值轉爲了字符串，供re.findall()方法使用spa
定義了一個新變量code用來放編碼格式的值，由於re.findall()方法獲取回來的是列表類型，我須要使用的是字符串類型。code
根據須要的圖片來編寫正則表達式 reg = r’img class=」BDE_Image」 src=」(.+?.jpg)」’
使用re.compile()方法把正則表達式編譯成一個正則表達式對象,在一個程序中屢次使用會更有效。
使用re.findall()方法匹配網頁數據中包含正則表達式的非重疊數據，做爲字符串列表。
urllib.request.urlretrieve()方法，將圖片下載到本地，並指定到了D盤img文件夾下

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。