urlopen內存泄漏淺析

1.背景

  urllib,urllib2是客戶端http協議的實現,urllib2底層使用httplib,socket庫,它主要包含urlopen, build_opener, install_opener等func。python2.7使用urllib2庫中的urlopen會出現內存泄漏的現象,能夠經過gc模塊來視察內存泄漏狀況。python

# -*- coding: utf-8 -*-
#!usr/bin/python
import urllib2
import socket
import gc

# check memory on memory leaks
def get_unreachable_memory_len():
    #當設置DEBUG_SAVEALL後,全部unreachable對象會append到garbage中,不會被銷燬,從而進行視察,測試時使用。
    gc.set_debug(gc.DEBUG_SAVEALL)
    gc.collect()
    unreachableL = []
    for it in gc.garbage:
        unreachableL.append(it)
        #print(str(it))
    print str(unreachableL)

def task():
    try:
        req = urllib2.urlopen('http://www.baidu.com/', timeout=3)
        text = req.read()
        #req.fp._sock.recv = None
        req.close()
    except urllib2.HTTPError, e:
        print e.code
    except urllib2.URLError, e:
        print e.reason
    else:
        print("urlopen success")

if __name__ == '__main__':
    get_unreachable_memory_len()
    print("-------------------------")
    task()
    print("-------------------------")
    get_unreachable_memory_len()

 運行程序肯定urlopen存在內存泄漏:app

 

 2.現象分析

   python垃圾回收機制基於對象的引用計數,因此先找到形成循環引用的代碼。採用objgraph模塊打印出增長的對象。示例代碼以下:python2.7

# -*- coding: utf-8 -*-
#!usr/bin/python
import urllib2
import socket
import gc
import objgraph

# check memory on memory leaks
def get_unreachable_memory_len():
    #當設置DEBUG_SAVEALL後,全部unreachable對象會append到garbage中,不會被銷燬,從而進行視察,測試時使用。
    gc.set_debug(gc.DEBUG_SAVEALL)
    gc.collect()
    unreachableL = []
    for it in gc.garbage:
        unreachableL.append(it)
        #print(str(it))
    print str(unreachableL)

def task():
    try:
        req = urllib2.urlopen('http://www.baidu.com/', timeout=3)
        text = req.read()
        #req.fp._sock.recv = None
        req.close()
    except urllib2.HTTPError, e:
        print e.code
    except urllib2.URLError, e:
        print e.reason
    else:
        print("urlopen success")

#class HTTPResponse(object):
#    pass

if __name__ == '__main__':
    gc.set_debug(gc.DEBUG_SAVEALL)
    objgraph.show_growth()
    print("-------------------------")
    for i in range(5):
        task()
    print("-------------------------")
    objgraph.show_growth()

 看到引用計數加5的三個字段,以及觀察到上一次運行結果首先出現的是httplib.HTTPResponse。socket

使用objgraph.show_backrefs對httplib.HTTPResponse進行分析:測試

# -*- coding: utf-8 -*-
#!usr/bin/python
import urllib2
import socket
import gc
import objgraph

# check memory on memory leaks
def get_unreachable_memory_len():
    #當設置DEBUG_SAVEALL後,全部unreachable對象會append到garbage中,不會被銷燬,從而進行視察,測試時使用。
    gc.set_debug(gc.DEBUG_SAVEALL)
    gc.collect()
    unreachableL = []
    for it in gc.garbage:
        unreachableL.append(it)
        #print(str(it))
    print str(unreachableL)

def task():
    try:
        req = urllib2.urlopen('http://www.baidu.com/', timeout=3)
        text = req.read()
        #req.fp._sock.recv = None
        req.close()
    except urllib2.HTTPError, e:
        print e.code
    except urllib2.URLError, e:
        print e.reason
    else:
        print("urlopen success")

#class HTTPResponse(object):
#    pass

if __name__ == '__main__':
    gc.set_debug(gc.DEBUG_SAVEALL)
    print("-------------------------")
    for i in range(5):
        task()
    print("-------------------------")
    objgraph.show_backrefs(objgraph.by_type('HTTPResponse')[0], max_depth = 10, filename = 'obj.dot')

 將生成的obj.dot轉化爲obj.png(使用命令dot obj.dot -Tpng -o obj.png)圖示以下,記錄下形成循環引用的recv引用和read方法。ui

3.源碼追蹤

 查看urllib2類圖能夠使用pycharm自動生成UML類圖,這裏須要分析urllib2.urlopen的調用流程,能夠引入pycallgraph模塊來分析,示例代碼入下:url

# -*- coding: utf-8 -*-
#!usr/bin/python
import urllib2
import socket
import gc
from pycallgraph import PyCallGraph
from pycallgraph.output import GraphvizOutput

def task():
    graphviz = GraphvizOutput()
    graphviz.output_file = 'urlopen.png'
    with PyCallGraph(output=graphviz):
        try:
            req = urllib2.urlopen('http://www.baidu.com/', timeout=3)
            #text = req.read()
            #req.fp._sock.recv = None
            #req.close()
        except urllib2.HTTPError, e:
            print e.code
        except urllib2.URLError, e:
            print e.reason
        else:
            print("urlopen success")


if __name__ == '__main__':
    task()

 

 截取部分生成的調用流程圖:spa

在HTTPHandler類中的do_open方法中有這一行代碼:debug

 

這個r指的是HTTPResopnse類,它只有read方法而沒有recv方法,這個引用在urlopen調用結束後並無釋放。解決內存泄漏問題就須要消除改引用。3d

4.解決方法:

1)上述示例當中調用task()以後使用gc.collect()進行手動內存回收。

2)http鏈接close以前手動解決r.recv這個引用。

 req = urllib2.urlopen('http://www.baidu.com/', timeout=3)
text = req.read()
#對於調用urlopen正常返回的狀況手動解除r.recv = r.read這個引用
req.fp._sock.recv = None
req.close()

 注:當返回錯誤狀態碼urllib2.HTTPError時沒法生效,須要修改urllib2.py源碼爲

 

3)改用更底層的socket,httplib庫。

 

參考資料:

1)http://python.jobbole.com/88827/

2)https://bugs.python.org/issue1208304

3)https://stackoverflow.com/questions/4214224/how-to-solve-python-memory-leak-when-using-urrlib2#

相關文章
相關標籤/搜索