Python爬蟲關於urlretrieve()函數的使用筆記

時間 2019-11-13

標籤 python 爬蟲關於 urlretrieve 函數使用筆記欄目 Python 简体版

原文原文鏈接

urllib 模塊提供的 urlretrieve() 函數。urlretrieve() 方法直接將遠程數據下載到本地。html

>>> help(urllib.urlretrieve)
Help on function urlretrieve in module urllib:
urlretrieve(url, filename=None, reporthook=None, data=None)

參數 finename 指定了保存本地路徑（若是參數未指定，urllib會生成一個臨時文件保存數據。）
參數 reporthook 是一個回調函數，當鏈接上服務器、以及相應的數據塊傳輸完畢時會觸發該回調，咱們能夠利用這個回調函數來顯示當前的下載進度。
參數 data 指 post 到服務器的數據，該方法返回一個包含兩個元素的(filename, headers)元組，filename 表示保存到本地的路徑，header 表示服務器的響應頭。

下面經過例子來演示一下這個方法的使用，這個例子將 google 的 html 抓取到本地，保存在 D:/google.html 文件中，同時顯示下載的進度。python

import urllib
def cbk(a, b, c): 
    '''回調函數
    @a: 已經下載的數據塊
    @b: 數據塊的大小
    @c: 遠程文件的大小
    ''' 
    per = 100.0 * a * b / c 
    if per > 100: 
        per = 100 
    print '%.2f%%' % per
url = 'http://www.google.com'
local = 'd://google.html'
urllib.urlretrieve(url, local, cbk)

在 Python Shell 裏執行以下：服務器

Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import urllib
>>> def cbk(a, b, c): 
    '''回調函數
    @a: 已經下載的數據塊
    @b: 數據塊的大小
    @c: 遠程文件的大小
    ''' 
    per = 100.0 * a * b / c 
    if per > 100: 
        per = 100 
    print '%.2f%%' % per
 
>>> url = 'http://www.google.com'
>>> local = 'd://google.html'
>>> urllib.urlretrieve(url, local, cbk)
-0.00%
-819200.00%
-1638400.00%
-2457600.00%
('d://google.html', <httplib.HTTPMessage instance at 0x0000000003450608>)
>>>

下面是 urlretrieve() 下載文件實例，能夠顯示下載進度。多線程

#!/usr/bin/python
#encoding:utf-8
import urllib
import os
def Schedule(a,b,c):
    '''''
    a:已經下載的數據塊
    b:數據塊的大小
    c:遠程文件的大小
   '''
    per = 100.0 * a * b / c
    if per > 100 :
        per = 100
    print '%.2f%%' % per
url = 'http://www.python.org/ftp/python/2.7.5/Python-2.7.5.tar.bz2'
#local = url.split('/')[-1]
local = os.path.join('/data/software','Python-2.7.5.tar.bz2')
urllib.urlretrieve(url,local,Schedule)
######output######
#0.00%
#0.07%
#0.13%
#0.20%
#....
#99.94%
#100.00%

經過上面的練習能夠知道，urlopen() 能夠輕鬆獲取遠端 html 頁面信息，而後經過 python 正則對所須要的數據進行分析，匹配出想要用的數據，在利用urlretrieve() 將數據下載到本地。對於訪問受限或者對鏈接數有限制的遠程 url 地址能夠採用 proxies（代理的方式）鏈接，若是遠程數據量過大，單線程下載太慢的話能夠採用多線程下載，這個就是傳說中的爬蟲。函數

#爬去百度貼吧某圖片post

#爬取百度貼吧一些小圖片  
#urllib.urlretriev---將遠程數據下載到本地  
import urllib  
import urllib2  
import re  
  
#http://tieba.baidu.com/p/3868127254  
a = raw_input('inpt url:')  
s = urllib2.urlopen(a)  
s1 = s.read()  
  
def getimg(aaa):  
        reg = re.compile(r'img.src="(.*?)"')  
        #reg = re.compile(r'<title>')  
        l = re.findall(reg, aaa)  
        tmp =0  
        for x in l:  
                tmp += 1  
                urllib.urlretrieve(x, '%s.jpg' % tmp)  
  
#print s1  
getimg(s1)