python的Requests(http://docs.python-requests.org/en/latest/)庫在處理http/https請求時仍是比較方便的,應用也比較普遍。
但其在處理response時有一些地方須要特別注意,簡單來講就是Response對象的content方法和text方法的區別,具體代碼以下:python
@property def content(self): """Content of the response, in bytes.""" if self._content is False: # Read the contents. try: if self._content_consumed: raise RuntimeError( 'The content for this response was already consumed') if self.status_code == 0: self._content = None else: self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes() except AttributeError: self._content = None self._content_consumed = True # don't need to release the connection; that's been handled by urllib3 # since we exhausted the data. return self._content @property def text(self): """Content of the response, in unicode. if Response.encoding is None and chardet module is available, encoding will be guessed. """ # Try charset from content-type content = None encoding = self.encoding if not self.content: return str('') # Fallback to auto-detected encoding. if self.encoding is None: encoding = self.apparent_encoding # Decode unicode from given encoding. try: content = str(self.content, encoding, errors='replace') except (LookupError, TypeError): # A LookupError is raised if the encoding was not found which could # indicate a misspelling or similar mistake. # # A TypeError can be raised if encoding is None # # So we try blindly encoding. content = str(self.content, errors='replace') return content
@property
def apparent_encoding(self):
"""The apparent encoding, provided by the lovely Charade library
(Thanks, Ian!)."""
return chardet.detect(self.content)['encoding']
能夠看出text方法中對原始數據作了編碼操做
其中response的encoding屬性是在adapters.py中的HTTPAdapter中的build_response中進行賦值,具體代碼以下:cookie
def build_response(self, req, resp): """Builds a :class:`Response <requests.Response>` object from a urllib3 response. This should not be called from user code, and is only exposed for use when subclassing the :class:`HTTPAdapter <requests.adapters.HTTPAdapter>` :param req: The :class:`PreparedRequest <PreparedRequest>` used to generate the response. :param resp: The urllib3 response object. """ response = Response() # Fallback to None if there's no status_code, for whatever reason. response.status_code = getattr(resp, 'status', None) # Make headers case-insensitive. response.headers = CaseInsensitiveDict(getattr(resp, 'headers', {})) # Set encoding. response.encoding = get_encoding_from_headers(response.headers) response.raw = resp response.reason = response.raw.reason if isinstance(req.url, bytes): response.url = req.url.decode('utf-8') else: response.url = req.url # Add new cookies from the server. extract_cookies_to_jar(response.cookies, req, resp) # Give the Response some context. response.request = req response.connection = self return response
從上述代碼(response.encoding = get_encoding_from_headers(response.headers))中能夠看出,具體的encoding是經過解析headers獲得的,app
def get_encoding_from_headers(headers): """Returns encodings from given HTTP Header Dict. :param headers: dictionary to extract encoding from. """ content_type = headers.get('content-type') if not content_type: return None content_type, params = cgi.parse_header(content_type) if 'charset' in params: return params['charset'].strip("'\"") if 'text' in content_type: return 'ISO-8859-1'
爲避免Requests採用chardet去猜想response的編碼,請慎用text屬性,直接使用content屬性便可,再根據實際須要進行編碼。
對於服務端沒有顯式指明charset的response來講,採用text和content的差異以下所示:
代碼:ide
print time.time() print 'begin request' r = requests.get(r'http://www.sina.com.cn') # erase response encoding r.encoding = None r.text #r.content print 'request end' print time.time()
採用text時的耗時:
採用content時的耗時:
ui