python10min系列之多線程下載器

時間 2019-12-13

標籤 python10min python min 系列之多線程下載欄目 Python 简体版

原文原文鏈接

今天羣裏看到有人問關於python多線程寫文件的問題，聯想到這是reboot的架構師班的入學題，我想了一下，感受坑和考察的點還挺多，能夠當成一個面試題來問，簡單說一下個人想法和思路吧，涉及的代碼和註釋在github 跪求starpython

本文須要必定的python基礎，但願你們對下面幾個知識點有所瞭解git

python文件處理，open write
簡單瞭解http協議頭信息
os，sys模塊
threading模塊多進程
requests模塊發請求

題目既然是多線程下載，首先要解決的就是下載問題，爲了方便測試，咱們先不用QQ安裝包這麼大的，直接用pc大大英明神武又很內涵的頭像舉例，大概是這個樣子(http://51reboot.com/src/blogimg/pc.jpg)github

下載

python的requests模塊很好的封裝了http請求，咱們選擇用它來發送http的get請求，而後寫入本地文件便可（關於requests和http，以及python處理文件的具體介紹，能夠百度或者持續關注，後面我會寫），思路既然清楚了，代碼就呼之欲出了面試

# 簡單粗暴的下載
import requests

res=requests.get('http://51reboot.com/src/blogimg/pc.jpg')
with open('pc.jpg','w') as f:
    f.write(res.content)

運行完上面的代碼，文件夾下面多了個pc.jpg 就是你想要的圖片了多線程

上面代碼功能太少了，注意，咱們的要求是多線程下載，這種簡單粗暴的下載徹底不符合要求，所謂多線程，你能夠理解爲倉庫裏有不少不少袋奧利奧餅乾，老闆讓我去都搬到公司來放好，並且要按照原順序放好架構

上面的代碼，大概就是我一我的去倉庫，把全部奧利奧一次性拿回來，大概流程以下app

咱們若是要完成題目多線程的要求，首先就要把任務拆解，拆成幾個子任務，子任務之間能夠並行執行，而且執行結果能夠彙總成最終結果函數

拆解任務

爲了完成這個任務，咱們首先要知道數據到底有多大，而後把數據分塊去取就OK啦，咱們要對http協議有一個很好的瞭解學習

用head方法請求數據，返回只有http頭信息，沒有主題部分
- 咱們從頭信息Content-length的值，知道資源的大小，好比有50字節
好比咱們要分四個線程，每一個線程去取大概1/4便可
- 50/4=12，因此前幾個線程每人取12個字節，最後一個現成取剩下的便可
每一個線程取到相應的內容，文件中seek到相應的位置再寫入便可
- file.seek
爲了方便理解，一開始咱們先用單線程的跑通流程圖大概以下

思路清晰了，代碼也就呼之欲出了，咱們先測試一下range頭信息測試

http頭信息中的Range信息，用於請求頭中，指定第一個字節的位置和最後一個字節的位置，如1-12，若是省略第二個數，就認爲取到最後，好比36-

# range測試代碼
import requests
# http頭信息，指定獲取前15000個字節
headers={'Range':'Bytes=0-15000','Accept-Encoding':'*'}
res=requests.get('http://51reboot.com/src/blogimg/pc.jpg',headers=headers)

with open('pc.jpg','w') as f:
    f.write(res.content)

咱們獲得了頭像的前15000個字節，以下圖，目測range是對的

繼續豐富咱們的代碼

要先用requests.head方法去獲取數據的長度
確認開幾個線程後，給每一個線程確認要獲取的數據區間，即Range字段的值
seek寫文件
功能比較複雜了，咱們須要用面向對象來組織一下代碼
先寫單線程，逐步優化
代碼呼之欲出了

import requests
# 下載器的類
class downloader:
    # 構造函數
    def __init__(self):
        # 要下載的數據鏈接
        self.url='http://51reboot.com/src/blogimg/pc.jpg'
        # 要開的線程數
        self.num=8
        # 存儲文件的名字，從url最後面取
        self.name=self.url.split('/')[-1]
        # head方法去請求url
        r = requests.head(self.url)
        # headers中取出數據的長度
        self.total = int(r.headers['Content-Length'])
        print type('total is %s' % (self.total))
    def get_range(self):
        ranges=[]
        # 好比total是50,線程數是4個。offset就是12
        offset = int(self.total/self.num)
        for i in  range(self.num):
            if i==self.num-1:
                # 最後一個線程，不指定結束位置，取到最後
                ranges.append((i*offset,''))
            else:
                # 沒個線程取得區間
                ranges.append((i*offset,(i+1)*offset))
        # range大概是[(0,12),(12,24),(25,36),(36,'')]
        return ranges
    def run(self):

        f = open(self.name,'w')
        for ran in self.get_range():
            # 拼出Range參數 獲取分片數據
            r = requests.get(self.url,headers={'Range':'Bytes=%s-%s' % ran,'Accept-Encoding':'*'})
            # seek到相應位置
            f.seek(ran[0])
            # 寫數據
            f.write(r.content)
        f.close()

if __name__=='__main__':
    down = downloader()
    down.run()

多線程

多線程和多進程是啥在這就很少說了，要說明白還得專門寫個文章，你們知道threading模塊是專門解決多線程的問題就OK了，大概的使用方法以下，更詳細的請百度或者關注後續文章

threading.Thread建立線程，設置處理函數
start啓動
setDaemon 設置守護進程
join設置線程等待
代碼以下

import requests
import threading

class downloader:
    def __init__(self):
        self.url='http://51reboot.com/src/blogimg/pc.jpg'
        self.num=8
        self.name=self.url.split('/')[-1]
        r = requests.head(self.url)
        self.total = int(r.headers['Content-Length'])
        print 'total is %s' % (self.total)
    def get_range(self):
        ranges=[]
        offset = int(self.total/self.num)
        for i in  range(self.num):
            if i==self.num-1:
                ranges.append((i*offset,''))
            else:
                ranges.append((i*offset,(i+1)*offset))
        return ranges
    def download(self,start,end):
        headers={'Range':'Bytes=%s-%s' % (start,end),'Accept-Encoding':'*'}
        res = requests.get(self.url,headers=headers)
        self.fd.seek(start)
        self.fd.write(res.content)
    def run(self):
        self.fd =  open(self.name,'w')
        thread_list = []
        n = 0
        for ran in self.get_range():
            start,end = ran
            print 'thread %d start:%s,end:%s'%(n,start,end)
            n+=1
            thread = threading.Thread(target=self.download,args=(start,end))
            thread.start()
            thread_list.append(thread)
        for i in thread_list:
            i.join()
        print 'download %s load success'%(self.name)
        self.fd.close()
if __name__=='__main__':
    down = downloader()
    down.run()

執行python downloader效果以下

total is 21520
thread 0 start:0,end:2690
thread 1 start:2690,end:5380
thread 2 start:5380,end:8070
thread 3 start:8070,end:10760
thread 4 start:10760,end:13450
thread 5 start:13450,end:16140
thread 6 start:16140,end:18830
thread 7 start:18830,end:
download pc.jpg load success

run函數作了修改，加了多線程的東西，加了一個download函數專門用來下載數據塊，這倆函數詳細解釋以下

def download(self,start,end):
    #拼接Range字段,accept字段支持全部編碼
    headers={'Range':'Bytes=%s-%s' % (start,end),'Accept-Encoding':'*'}
    res = requests.get(self.url,headers=headers)
    #seek到start位置
    self.fd.seek(start)
    self.fd.write(res.content)
def run(self):
    # 保存文件打開對象
    self.fd =  open(self.name,'w')
    thread_list = []
    #一個數字,用來標記打印每一個線程
    n = 0
    for ran in self.get_range():
        start,end = ran
        #打印信息
        print 'thread %d start:%s,end:%s'%(n,start,end)
        n+=1
        #建立線程 傳參,處理函數爲download
        thread = threading.Thread(target=self.download,args=(start,end))
        #啓動
        thread.start()
        thread_list.append(thread)
    for i in thread_list:
        # 設置等待
        i.join()
    print 'download %s load success'%(self.name)
    #關閉文件
    self.fd.close()

持續能夠優化的點

一個文件描述符多個進程用會出問題
- 建議用os.dup複製文件描述符和os.fdopen來打開處理文件
要下載的資源地址和線程數,應該作成命令行傳進來的
- 用sys.argv獲取命令行參數
- 支持python downloader.py url num這種寫法
- 參數數量不對或者格式不對時報錯
各類容錯處理
正所謂女人的迪奧，男人的奧利奧，這篇文章，你值得擁有

大概就是這樣了，我也是正在學習python，文章表明我我的見解，有錯誤不可避免，歡迎你們指正，共同窗習，本文完整代碼在github,跪求你們star