【轉】Python中的GIL、多進程和多線程

轉自:http://lesliezhu.github.io/public/2015-04-20-python-multi-process-thread.html

1 GIL(Global Interpretor Lock,全局解釋器鎖)

see:html

若是其餘條件不變,Python程序的執行速度直接與解釋器的「速度」相關。無論你怎樣優化本身的程序,你的程序的執行速度仍是依賴於解釋器執行你的程序的效率。python

目前來講,多線程執行仍是利用多核系統最經常使用的方式。儘管多線程編程大大好於「順序」編程,不過即使是仔細的程序員也無法在代碼中將併發性作到最好。git

對於任何Python程序,無論有多少的處理器,任什麼時候候都老是隻有一個線程在執行。程序員

事實上,這個問題被問得如此頻繁以致於Python的專家們精心製做了一個標準答案:」不要使用多線程,請使用多進程。「但這個答案比那個問題更加讓人困惑。github

GIL對諸如當前線程狀態和爲垃圾回收而用的堆分配對象這樣的東西的訪問提供着保護。然而,這對Python語言來講沒什麼特殊的,它須要使用一個GIL。這是該實現的一種典型產物。如今也有其它的Python解釋器(和編譯器)並不使用GIL。雖然,對於CPython來講,自其出現以來已經有不少不使用GIL的解釋器。web

無論某一我的對Python的GIL感受如何,它仍然是Python語言裏最困難的技術挑戰。想要理解它的實現須要對操做系統設計、多線程編程、C語言、解釋器設計和CPython解釋器的實現有着很是完全的理解。單是這些所需準備的就妨礙了不少開發者去更完全的研究GIL。編程

2 threading

threading 模塊提供比/基於 thread 模塊更高層次的接口;若是此模塊因爲 thread 丟失而沒法使用,可使用 dummy_threading 來代替。segmentfault

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.多線程

舉例:併發

 1 import threading, zipfile
 2 
 3 class AsyncZip(threading.Thread):
 4     def __init__(self, infile, outfile):
 5         threading.Thread.__init__(self)
 6         self.infile = infile
 7         self.outfile = outfile
 8     def run(self):
 9         f = zipfile.ZipFile(self.outfile, 'w', zipfile.ZIP_DEFLATED)
10         f.write(self.infile)
11         f.close()
12         print 'Finished background zip of: ', self.infile
13 
14 background = AsyncZip('mydata.txt', 'myarchive.zip')
15 background.start()
16 print 'The main program continues to run in foreground.'
17 
18 background.join()    # Wait for the background task to finish
19 print 'Main program waited until background was done.'

 

2.1 建立線程

 1 import threading
 2 import datetime
 3 
 4 class ThreadClass(threading.Thread):
 5      def run(self):
 6          now = datetime.datetime.now()
 7          print "%s says Hello World at time: %s" % (self.getName(), now)
 8 
 9 for i in range(2):
10     t = ThreadClass()
11     t.start()

 

2.2 使用線程隊列

 1 import Queue
 2 import threading
 3 import urllib2
 4 import time
 5 from BeautifulSoup import BeautifulSoup
 6 
 7 hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
 8         "http://ibm.com", "http://apple.com"]
 9 
10 queue = Queue.Queue()
11 out_queue = Queue.Queue()
12 
13 class ThreadUrl(threading.Thread):
14     """Threaded Url Grab"""
15     def __init__(self, queue, out_queue):
16         threading.Thread.__init__(self)
17         self.queue = queue
18         self.out_queue = out_queue
19 
20     def run(self):
21         while True:
22             #grabs host from queue
23             host = self.queue.get()
24 
25             #grabs urls of hosts and then grabs chunk of webpage
26             url = urllib2.urlopen(host)
27             chunk = url.read()
28 
29             #place chunk into out queue
30             self.out_queue.put(chunk)
31 
32             #signals to queue job is done
33             self.queue.task_done()
34 
35 class DatamineThread(threading.Thread):
36     """Threaded Url Grab"""
37     def __init__(self, out_queue):
38         threading.Thread.__init__(self)
39         self.out_queue = out_queue
40 
41     def run(self):
42         while True:
43             #grabs host from queue
44             chunk = self.out_queue.get()
45 
46             #parse the chunk
47             soup = BeautifulSoup(chunk)
48             print soup.findAll(['title'])
49 
50             #signals to queue job is done
51             self.out_queue.task_done()
52 
53 start = time.time()
54 def main():
55 
56     #spawn a pool of threads, and pass them queue instance
57     for i in range(5):
58         t = ThreadUrl(queue, out_queue)
59         t.setDaemon(True)
60         t.start()
61 
62     #populate queue with data
63     for host in hosts:
64         queue.put(host)
65 
66     for i in range(5):
67         dt = DatamineThread(out_queue)
68         dt.setDaemon(True)
69         dt.start()
70 
71 
72     #wait on the queue until everything has been processed
73     queue.join()
74     out_queue.join()
75 
76 main()
77 print "Elapsed Time: %s" % (time.time() - start)

 

3 dummy_threading(threading的備用方案)

dummy_threading 模塊提供徹底複製了threading模塊的接口,若是沒法使用thread,則能夠用這個模塊替代.

使用方法:

1 try:
2     import threading as _threading
3 except ImportError:
4     import dummy_threading as _threading

4 thread

在Python3中叫 _thread,應該儘可能使用 threading 模塊替代。

5 dummy_thread(thead的備用方案)

dummy_thread 模塊提供徹底複製了thread模塊的接口,若是沒法使用thread,則能夠用這個模塊替代.

在Python3中叫 _dummy_thread, 使用方法:

try:
    import thread as _thread
except ImportError:
    import dummy_thread as _thread

 

最好使用 dummy_threading 來代替.

6 multiprocessing(基於thread接口的多進程)

see:

使用 multiprocessing 模塊建立子進程而不是線程來克服GIL引發的問題.

舉例:

from multiprocessing import Pool

def f(x):
    return x*x

if __name__ == '__main__':
    p = Pool(5)
    print(p.map(f, [1, 2, 3]))

 

6.1 Process類

建立進程是使用Process類:

from multiprocessing import Process

def f(name):
    print 'hello', name

if __name__ == '__main__':
    p = Process(target=f, args=('bob',))
    p.start()
    p.join()

 

6.2 進程間通訊

Queue 方式:

from multiprocessing import Process, Queue

def f(q):
    q.put([42, None, 'hello'])

if __name__ == '__main__':
    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    print q.get()    # prints "[42, None, 'hello']"
    p.join()

 

Pipe 方式:

from multiprocessing import Process, Pipe

def f(conn):
    conn.send([42, None, 'hello'])
    conn.close()

if __name__ == '__main__':
    parent_conn, child_conn = Pipe()
    p = Process(target=f, args=(child_conn,))
    p.start()
    print parent_conn.recv()   # prints "[42, None, 'hello']"

 

6.3 同步

添加鎖:

from multiprocessing import Process, Lock

def f(l, i):
    l.acquire()
    print 'hello world', i
    l.release()

if __name__ == '__main__':
    lock = Lock()

    for num in range(10):
        Process(target=f, args=(lock, num)).start()

 

6.4 共享狀態

應該儘可能避免共享狀態.

共享內存方式:

 1 from multiprocessing import Process, Value, Array
 2 
 3 def f(n, a):
 4     n.value = 3.1415927
 5     for i in range(len(a)):
 6         a[i] = -a[i]
 7 
 8 if __name__ == '__main__':
 9     num = Value('d', 0.0)
10     arr = Array('i', range(10))
11 
12     p = Process(target=f, args=(num, arr))
13     p.start()
14     p.join()
15 
16     print num.value
17     print arr[:]

 

Server進程方式:

from multiprocessing import Process, Manager

def f(d, l):
    d[1] = '1'
    d['2'] = 2
    d[0.25] = None
    l.reverse()

if __name__ == '__main__':
    manager = Manager()

    d = manager.dict()
    l = manager.list(range(10))

    p = Process(target=f, args=(d, l))
    p.start()
    p.join()

    print d
    print l

 

第二種方式支持更多的數據類型,如list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Queue, Value ,Array.

6.5 Pool類

經過Pool類能夠創建進程池:

 1 from multiprocessing import Pool
 2 
 3 def f(x):
 4     return x*x
 5 
 6 if __name__ == '__main__':
 7     pool = Pool(processes=4)              # start 4 worker processes
 8     result = pool.apply_async(f, [10])    # evaluate "f(10)" asynchronously
 9     print result.get(timeout=1)           # prints "100" unless your computer is *very* slow
10     print pool.map(f, range(10))          # prints "[0, 1, 4,..., 81]"

 

7 multiprocessing.dummy

在官方文檔只有一句話:

multiprocessing.dummy replicates the API of multiprocessing but is no more than a wrapper around the threading module.

  • multiprocessing.dummy 是 multiprocessing 模塊的完整克隆,惟一的不一樣在於 multiprocessing 做用於進程,而 dummy 模塊做用於線程;
  • 能夠針對 IO 密集型任務和 CPU 密集型任務來選擇不一樣的庫. IO 密集型任務選擇multiprocessing.dummy,CPU 密集型任務選擇multiprocessing.

舉例:

 1 import urllib2 
 2 from multiprocessing.dummy import Pool as ThreadPool 
 3 
 4 urls = [
 5     'http://www.python.org', 
 6     'http://www.python.org/about/',
 7     'http://www.onlamp.com/pub/a/python/2003/04/17/metaclasses.html',
 8     'http://www.python.org/doc/',
 9     'http://www.python.org/download/',
10     'http://www.python.org/getit/',
11     'http://www.python.org/community/',
12     'https://wiki.python.org/moin/',
13     'http://planet.python.org/',
14     'https://wiki.python.org/moin/LocalUserGroups',
15     'http://www.python.org/psf/',
16     'http://docs.python.org/devguide/',
17     'http://www.python.org/community/awards/'
18     # etc.. 
19     ]
20 
21 # Make the Pool of workers
22 pool = ThreadPool(4) 
23 # Open the urls in their own threads
24 # and return the results
25 results = pool.map(urllib2.urlopen, urls)
26 #close the pool and wait for the work to finish 
27 pool.close() 
28 pool.join() 
29 
30 results = [] 
31 for url in urls:
32    result = urllib2.urlopen(url)
33    results.append(result)

 

8 後記

  • 若是選擇多線程,則應該儘可能使用 threading 模塊,同時注意GIL的影響
  • 若是多線程沒有必要,則使用多進程模塊 multiprocessing ,此模塊也經過 multiprocessing.dummy 支持多線程.
  • 分析具體任務是I/O密集型,仍是CPU密集型
相關文章
相關標籤/搜索