1. 首先經過pip install builtwith安裝builtwithjavascript
C:\Users\Administrator>pip install builtwith Collecting builtwith Downloading builtwith-1.3.2.tar.gz Installing collected packages: builtwith Running setup.py install for builtwith ... done Successfully installed builtwith-1.3.2
2. 在pycharm中新建工程並輸入下面測試代碼html
import builtwith tech_used = builtwith.parse('http://www.baidu.com') print(tech_used)
運行會獲得下面的錯誤:java
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy Traceback (most recent call last): File "F:/python/first/FirstPy", line 1, in <module> import builtwith File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 43 except Exception, e: ^ SyntaxError: invalid syntax Process finished with exit code 1
緣由是builtwith是基於2.x版本的,須要修改幾個地方,在pycharm出錯信息中雙擊出錯文件,進行修改,主要修改下面三種:
1. Python2中的 「Exception ,e」的寫法已經不支持,須要修改成「Exception as e」。
2. Python2中print後的表達式在Python3中都須要用括號括起來。
3. builtwith中使用的是Python2中的urllib2工具包,這個工具包在Python3中是不存在的,須要修改urllib2相關的代碼。
1和2容易修改,下面主要針對第3點進行修改:
首先將import urllib2替換爲下面的代碼:python
import urllib.request import urllib.error
而後將urllib2的相關方法替換以下:web
request = urllib.request.Request(url, None, {'User-Agent': user_agent}) response = urllib.request.urlopen(request)
再次運行項目,遇到下面錯誤:ide
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy Traceback (most recent call last): File "F:/python/first/FirstPy", line 3, in <module> builtwith.parse('http://www.baidu.com') File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 62, in builtwith if contains(html, snippet): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 105, in contains return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v) TypeError: cannot use a string pattern on a bytes-like object Process finished with exit code 1
這是由於urllib返回的數據格式已經發生了改變,須要進行轉碼,將下面的代碼:工具
if html is None: html = response.read()
修改成測試
if html is None: html = response.read() html = html.decode('utf-8')
再次運行獲得最終結果以下:網站
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy {'javascript-frameworks': ['jQuery']} Process finished with exit code 0
可是若是把網站換成 'www.163.com',運行再次報錯以下:ui
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy Error: 'utf-8' codec can't decode byte 0xcd in position 500: invalid continuation byte Traceback (most recent call last): File "F:/python/first/FirstPy", line 2, in <module> tech_used = builtwith.parse('http://www.163.com') File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 63, in builtwith if contains(html, snippet): File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\builtwith\__init__.py", line 106, in contains return re.compile(regex.split('\\;')[0], flags=re.IGNORECASE).search(v) TypeError: cannot use a string pattern on a bytes-like object Process finished with exit code 1
彷佛仍是編碼的問題,將編碼設置成 ‘GBK’,運行成功以下:
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe F:/python/first/FirstPy {'web-servers': ['Nginx']} Process finished with exit code 0
因此不一樣的網站須要用不一樣的解碼方式麼?下面介紹一種判別網站編碼格式的方法。
咱們須要安裝一個叫chardet的工具包,以下:
C:\Users\Administrator>pip install chardet Collecting chardet Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB) 100% |████████████████████████████████| 184kB 616kB/s Installing collected packages: chardet Successfully installed chardet-2.3.0 C:\Users\Administrator>
將byte數據傳入chardet的detect方法後會獲得一個Dict,裏面有兩個值,一個是置信值,一個是編碼方式
{'encoding': 'utf-8', 'confidence': 0.99}
將builtwith對應的代碼作下面修改:
encode_type = chardet.detect(html) if encode_type['encoding'] == 'utf-8': html = html.decode('utf-8') else: html = html.decode('gbk')
記得 import chardet!!!!
加入chardet判斷字符編碼的方式後,就能適配網站了~~~~