python3 使用urllib報錯urlopen error EOF occurred in violation of protocol (_ssl.c:841)

python3源碼:php

import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen("http://php.net/")
html = response.read()
soup=BeautifulSoup(html, "html5lib")
text=soup.get_text(strip=True)
print(text)

  代碼很簡單,就是抓取http://php.net/頁面文本內容,而後使用BeautifulSoup模塊清除過濾掉多餘的html標籤。貌似第一次容許成功了,以後一直卡着再報錯:html

File "C:\Python36\lib\urllib\request.py", line 504, in _call_chain
    result = func(*args)
  File "C:\Python36\lib\urllib\request.py", line 1361, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "C:\Python36\lib\urllib\request.py", line 1320, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:841)>

  實際上google瀏覽器是可以訪問的。html5

  此問題多是因爲Web服務器上禁用了SSLv2,而比較老的python庫Python 2.x嘗試默認狀況下與PROTOCOL_SSLv23創建鏈接。所以在這種狀況下,須要選擇請求使用的SSL版本。python

  要更改HTTPS中使用的SSL版本,須要將該HTTPAdapter類子類化並將其掛載到 Session對象。例如,若是想強制使用TLSv1,則新的傳輸適配器將以下所示:瀏覽器

from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager class MyAdapter(HTTPAdapter): def init_poolmanager(self, connections, maxsize, block=False): self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize, block=block, ssl_version=ssl.PROTOCOL_TLSv1)

  而後,能夠將其掛載到Requests Session對象:服務器

s=requests.Session() s.mount('https://', MyAdapter()) response = urllib.request.urlopen("http://php.net/")

  編寫一個通用傳輸適配器仍是很簡單,它能夠從ssl構造函數中的包中獲取任意SSL類型並使用它。函數

from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager class SSLAdapter(HTTPAdapter): '''An HTTPS Transport Adapter that uses an arbitrary SSL version.'''
    def __init__(self, ssl_version=None, **kwargs): self.ssl_version = ssl_version super(SSLAdapter, self).__init__(**kwargs) def init_poolmanager(self, connections, maxsize, block=False): self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize, block=block, ssl_version=self.ssl_version)

  修改後的上述出錯的代碼:google

import urllib.request from bs4 import BeautifulSoup import requests from requests.adapters import HTTPAdapter from requests.packages.urllib3.poolmanager import PoolManager import ssl class MyAdapter(HTTPAdapter): def init_poolmanager(self, connections, maxsize, block=False): self.poolmanager = PoolManager(num_pools=connections, maxsize=maxsize, block=block, ssl_version=ssl.PROTOCOL_TLSv1) s=requests.Session() s.mount('https://', MyAdapter()) response = urllib.request.urlopen("http://php.net/") html = response.read() soup=BeautifulSoup(html, "html5lib") text=soup.get_text(strip=True) print(text)

  能夠正常抓取網頁文本信息。url

相關文章
相關標籤/搜索