【轉】Python中的GIL、多進程和多線程

1 GIL(Global Interpretor Lock,全局解釋器鎖)

see:html

若是其餘條件不變,Python程序的執行速度直接與解釋器的「速度」相關。無論你怎樣優化本身的程序,你的程序的執行速度仍是依賴於解釋器執行你的程序的效率。python

目前來講,多線程執行仍是利用多核系統最經常使用的方式。儘管多線程編程大大好於「順序」編程,不過即使是仔細的程序員也無法在代碼中將併發性作到最好。git

對於任何Python程序,無論有多少的處理器,任什麼時候候都老是隻有一個線程在執行。程序員

事實上,這個問題被問得如此頻繁以致於Python的專家們精心製做了一個標準答案:」不要使用多線程,請使用多進程。「但這個答案比那個問題更加讓人困惑。github

GIL對諸如當前線程狀態和爲垃圾回收而用的堆分配對象這樣的東西的訪問提供着保護。然而,這對Python語言來講沒什麼特殊的,它須要使用一個GIL。這是該實現的一種典型產物。如今也有其它的Python解釋器(和編譯器)並不使用GIL。雖然,對於CPython來講,自其出現以來已經有不少不使用GIL的解釋器。web

無論某一我的對Python的GIL感受如何,它仍然是Python語言裏最困難的技術挑戰。想要理解它的實現須要對操做系統設計、多線程編程、C語言、解釋器設計和CPython解釋器的實現有着很是完全的理解。單是這些所需準備的就妨礙了不少開發者去更完全的研究GIL。編程

2 threading

threading 模塊提供比/基於 thread 模塊更高層次的接口;若是此模塊因爲 thread 丟失而沒法使用,可使用 dummy_threading 來代替。segmentfault

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.多線程

舉例:併發

import threading, zipfile

class AsyncZip(threading.Thread):
    def __init__(self, infile, outfile):
        threading.Thread.__init__(self)
        self.infile = infile
        self.outfile = outfile
    def run(self):
        f = zipfile.ZipFile(self.outfile, 'w', zipfile.ZIP_DEFLATED)
        f.write(self.infile)
        f.close()
        print 'Finished background zip of: ', self.infile

background = AsyncZip('mydata.txt', 'myarchive.zip')
background.start()
print 'The main program continues to run in foreground.'

background.join()    # Wait for the background task to finish
print 'Main program waited until background was done.'

2.1 建立線程

import threading
import datetime

class ThreadClass(threading.Thread):
     def run(self):
         now = datetime.datetime.now()
         print "%s says Hello World at time: %s" % (self.getName(), now)

for i in range(2):
    t = ThreadClass()
    t.start()

2.2 使用線程隊列

import Queue
import threading
import urllib2
import time
from BeautifulSoup import BeautifulSoup

hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
        "http://ibm.com", "http://apple.com"]

queue = Queue.Queue()
out_queue = Queue.Queue()

class ThreadUrl(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, queue, out_queue):
        threading.Thread.__init__(self)
        self.queue = queue
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            host = self.queue.get()

            #grabs urls of hosts and then grabs chunk of webpage
            url = urllib2.urlopen(host)
            chunk = url.read()

            #place chunk into out queue
            self.out_queue.put(chunk)

            #signals to queue job is done
            self.queue.task_done()

class DatamineThread(threading.Thread):
    """Threaded Url Grab"""
    def __init__(self, out_queue):
        threading.Thread.__init__(self)
        self.out_queue = out_queue

    def run(self):
        while True:
            #grabs host from queue
            chunk = self.out_queue.get()

            #parse the chunk
            soup = BeautifulSoup(chunk)
            print soup.findAll(['title'])

            #signals to queue job is done
            self.out_queue.task_done()

start = time.time()
def main():

    #spawn a pool of threads, and pass them queue instance
    for i in range(5):
        t = ThreadUrl(queue, out_queue)
        t.setDaemon(True)
        t.start()

    #populate queue with data
    for host in hosts:
        queue.put(host)

    for i in range(5):
        dt = DatamineThread(out_queue)
        dt.setDaemon(True)
        dt.start()


    #wait on the queue until everything has been processed
    queue.join()
    out_queue.join()

main()
print "Elapsed Time: %s" % (time.time() - start)

3 dummy_threading(threading的備用方案)

dummy_threading 模塊提供徹底複製了threading模塊的接口,若是沒法使用thread,則能夠用這個模塊替代.

使用方法:

try:
    import threading as _threading
except ImportError:
    import dummy_threading as _threading

4 thread

在Python3中叫 _thread,應該儘可能使用 threading 模塊替代。

5 dummy_thread(thead的備用方案)

dummy_thread 模塊提供徹底複製了thread模塊的接口,若是沒法使用thread,則能夠用這個模塊替代.

在Python3中叫 _dummy_thread, 使用方法:

try:
    import thread as _thread
except ImportError:
    import dummy_thread as _thread

最好使用 dummy_threading 來代替.

6 multiprocessing(基於thread接口的多進程)

see:

使用 multiprocessing 模塊建立子進程而不是線程來克服GIL引發的問題.

舉例:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(f, [1, 2, 3]))

6.1 Process類

建立進程是使用Process類:

from multiprocessing import Process

def f(name):
    print 'hello', name

if __name__ == '__main__':
    p = Process(target=f, args=('bob',))
    p.start()
    p.join()

6.2 進程間通訊

Queue 方式:

from multiprocessing import Process, Queue

def f(q):
    q.put([42, None, 'hello'])

if __name__ == '__main__':
    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    print q.get()    # prints "[42, None, 'hello']"
    p.join()

Pipe 方式:

from multiprocessing import Process, Pipe

def f(conn):
    conn.send([42, None, 'hello'])
    conn.close()

if __name__ == '__main__':
    parent_conn, child_conn = Pipe()
    p = Process(target=f, args=(child_conn,))
    p.start()
    print parent_conn.recv()   # prints "[42, None, 'hello']"

6.3 同步

添加鎖:

from multiprocessing import Process, Lock

def f(l, i):
    l.acquire()
    print 'hello world', i
    l.release()

if __name__ == '__main__':
    lock = Lock()

    for num in range(10):
        Process(target=f, args=(lock, num)).start()

6.4 共享狀態

應該儘可能避免共享狀態.

共享內存方式:

from multiprocessing import Process, Value, Array

def f(n, a):
    n.value = 3.1415927
    for i in range(len(a)):
        a[i] = -a[i]

if __name__ == '__main__':
    num = Value('d', 0.0)
    arr = Array('i', range(10))

    p = Process(target=f, args=(num, arr))
    p.start()
    p.join()

    print num.value
    print arr[:]

Server進程方式:

from multiprocessing import Process, Manager

def f(d, l):
    d[1] = '1'
    d['2'] = 2
    d[0.25] = None
    l.reverse()

if __name__ == '__main__':
    manager = Manager()

    d = manager.dict()
    l = manager.list(range(10))

    p = Process(target=f, args=(d, l))
    p.start()
    p.join()

    print d
    print l

第二種方式支持更多的數據類型,如list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Queue, Value ,Array.

6.5 Pool類

經過Pool類能夠創建進程池:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes
    result = pool.apply_async(f, [10])    # evaluate "f(10)" asynchronously
    print result.get(timeout=1)           # prints "100" unless your computer is *very* slow
    print pool.map(f, range(10))          # prints "[0, 1, 4,..., 81]"

7 multiprocessing.dummy

在官方文檔只有一句話:

multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.

  • multiprocessing.dummy 是 multiprocessing 模塊的完整克隆,惟一的不一樣在於 multiprocessing 做用於進程,而 dummy 模塊做用於線程;
  • 能夠針對 IO 密集型任務和 CPU 密集型任務來選擇不一樣的庫. IO 密集型任務選擇multiprocessing.dummy,CPU 密集型任務選擇multiprocessing.

舉例:

import urllib2 
from multiprocessing.dummy import Pool as ThreadPool 

urls = [
    'http://www.python.org', 
    'http://www.python.org/about/',
    'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
    'http://www.python.org/doc/',
    'http://www.python.org/download/',
    'http://www.python.org/getit/',
    'http://www.python.org/community/',
    'https://wiki.python.org/moin/',
    'http://planet.python.org/',
    'https://wiki.python.org/moin/LocalUserGroups',
    'http://www.python.org/psf/',
    'http://docs.python.org/devguide/',
    'http://www.python.org/community/awards/'
    # etc.. 
    ]

# Make the Pool of workers
pool = ThreadPool(4) 
# Open the urls in their own threads
# and return the results
results = pool.map(urllib2.urlopen, urls)
#close the pool and wait for the work to finish 
pool.close() 
pool.join() 

results = [] 
for url in urls:
   result = urllib2.urlopen(url)
   results.append(result)

8 後記

  • 若是選擇多線程,則應該儘可能使用 threading 模塊,同時注意GIL的影響
  • 若是多線程沒有必要,則使用多進程模塊 multiprocessing ,此模塊也經過 multiprocessing.dummy 支持多線程.
  • 分析具體任務是I/O密集型,仍是CPU密集型
相關文章
相關標籤/搜索