Python爬蟲連載1-urllib.request和chardet包使用方式

時間 2020-01-09

標籤 python 爬蟲連載 urllib.request urllib request chardet 使用方式欄目 Python 简体版

原文原文鏈接

1、參考資料html

1.《Python網絡數據採集》圖靈工業出版社前端

2.《精通Python爬蟲框架Scrapy》人民郵電出版社python

3.[Scrapy官方教程](http://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/tutorial.html)git

4.[Python3網絡爬蟲](http://blog.csdn.net/c406495762/article/details/72858983github

2、前提知識web

url、http協議、web前端：html\CSS\JS、ajax、re、Xpath、xmlajax

3、基礎知識微信

1.爬蟲簡介網絡

爬蟲定義：網絡爬蟲（又被稱爲網頁蜘蛛、網絡機器人、在FOAF社區中，更常常的稱爲網頁追逐者）是一種按照必定的規則，自動的抓取萬維網信息的程序或者腳本。兩外一些不常使用的名字還有螞蟻、自動索引、模擬程序或者如蠕蟲。框架

2.兩大特徵

（1）能按做者要求下載數據或者內容

（2）能自動在網絡上流竄

3.三大步驟

（1）下載網頁；

（2）提取正確的信息

（3）根據必定規則自動跳到另外的網頁上執行上兩步內容

4.爬蟲分類

（1）通用爬蟲

（2）專用爬蟲

5.Python網絡包簡介

Python2:urllib\urllib2\urllib3\httplib\httplib2\requests

Python3.x:urllib\urllib3\httplib2\requests

其中python2中urllib和urllib2配合使用，或者requests

Python3就是使用urllib.requests

6.urllib

包含模塊

urllib.requests:打開和讀取urls

urllib.error:包含urllib.requests產生的常見的錯誤，使用try捕捉

urllib.parse:包含即時url的方法

urllib.robotparse:解析roobs.txt文件

 

from urllib import request

"""

使用urllib,request請求一個網頁內容，並把內容打印出來

"""

if __name__ == "__main__":

    url = "https://mp.weixin.qq.com/cgi-bin/home?t=home/index&lang=zh_CN&token=984602018"

    #打開相應的url並把相應頁面做爲返回

    rsp = request.urlopen(url)

    #返回結果讀取出來

    html = rsp.read()

    print(type(html))##bytes類型

    html = html.decode()

    print(html)

7.網頁編碼解析方式chardet包的使用

 

from urllib import request

import chardet

"""

使用urllib,request請求一個網頁內容，並把內容打印出來

"""

if __name__ == "__main__":

    url = "https://mp.weixin.qq.com/cgi-bin/home?t=home/index&lang=zh_CN&token=984602018"

    #打開相應的url並把相應頁面做爲返回

    rsp = request.urlopen(url)

    #返回結果讀取出來

    html = rsp.read()

    print(type(html))##bytes類型

    print("=========================")



    cs = chardet.detect(html)#利用chardet來檢測這個網頁使用的是什麼編碼方式

    print(cs)

    print(type(cs))

    #使用get方法是爲了不若是取不到值報錯，程序就崩潰了

    html = html.decode(cs.get("encoding","utf-8"))#取cs字典中encoding屬性，若是取不到，那麼就使用utf-8