Python urllib.request 踩坑

BUG記錄-且踩且珍惜,爭取不在同一個地方摔倒兩次html

一、背景

在項目開發過程當中,有一個需求須要得到對應標籤的圖片信息,就須要從圖片服務器上查詢,以前使用的是以下方法查詢:python

import json
 import urllib
 
 url = 'http://127.0.0.1:8080/images/query/?type=%s&tags=%s'%('yuv', '4,3,6')
 
 print("url: " + str(url))
 response = urllib.request.urlopen(url)
 
 download_list = json.loads(response.read())

 print(download_list)
複製代碼

以前數據量小的時候並無出現什麼問題,可是當數據量大的時候,好比這次爲192708Byte時,就出現了了以下錯誤:nginx

Traceback (most recent call last): File "/Users/min/Desktop/workspace/python/Demo/fuck.py", line 12, in <module> download_list = json.loads(response.read()) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 464, in read s = self._safe_read(self.length) File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/http/client.py", line 618, in _safe_read raise IncompleteRead(b''.join(s), amt) http.client.IncompleteRead: IncompleteRead(144192 bytes read, 48516 more expected)json

二、分析結論

深刻下去以後,看到此情景下read()方法最終會走到以下方法塊中:bash

def _safe_read(self, amt):
    """Read the number of bytes requested, compensating for partial reads. Normally, we have a blocking socket, but a read() can be interrupted by a signal (resulting in a partial read). Note that we cannot distinguish between EOF and an interrupt when zero bytes have been read. IncompleteRead() will be raised in this situation. This function should be used when <amt> bytes "should" be present for reading. If the bytes are truly not available (due to EOF), then the IncompleteRead exception can be used to detect the problem. """  s = []
    while amt > 0:
        print("1", amt)
        chunk = self.fp.read(min(amt, MAXAMOUNT))
        print(chunk)
        if not chunk:
            raise IncompleteRead(b''.join(s), amt)
        s.append(chunk)
        print('2',len(chunk))
        amt -= len(chunk)
        print('3',amt)
    return b"".join(s)
複製代碼

能夠看到,其實這個方法自己就是有缺陷的,即:we cannot distinguish between EOF and an interrupt when zero bytes have been read. 最終發現輸出的DEBUG信息以下:服務器

1 192708
 
 b'[{"title": "\\u5ba4\\u5185\\u767d\\u8272\\u80cc\\u666f\\u5899+\\u6b63\\u5e38\\u5149+\\u8fd1\\u8ddd(\\u5927\\u8138)+\\u65e0\\u9762\\u90e8\\u7a7f\\u623..... #此處省略部分 2 144192 3 48516 1 48516 b'' 複製代碼

問題定位,因此建議大數據的傳輸,儘量的避免使用urllib庫,使用requests替代。app

另外貌似urllib.request獲取的文件頭信息比requests獲取的頭文件信息粗糙不少,好比缺乏最關鍵的Transfer-Encoding信息,具體細節以下:socket

****urllib.request:****
 
Server: nginx/1.14.0 (Ubuntu)
 Date: Mon, 17 Sep 2018 10:02:51 GMT
 
 Content-Type: text/html; charset=utf-8
 
 Content-Length: 192708
 
 Connection: close

 X-Frame-Options: SAMEORIGIN`**</pre>
 
 ****request:****
 
{'Server': 'nginx/1.14.0 (Ubuntu)', 'Date': 'Mon, 17 Sep 2018 09:55:15 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Content-Encoding': 'gzip'}
複製代碼
相關文章
相關標籤/搜索