Python爬蟲學習筆記(一)

時間 2021-01-29

標籤 html python 安全服務器網絡 ide 網站搜索引擎編碼 url 欄目 Python 简体版

原文原文鏈接

概念：

使用代碼模擬用戶，批量發送網絡請求，批量獲取數據。html

Robots協議：

robots協議也叫robots.txt（統一小寫）是一種存放於網站根目錄下的ASCII編碼的文本文件，它一般告訴網絡搜索引擎的漫遊器（又稱網絡蜘蛛），此網站中的哪些內容是不該被搜索引擎的漫遊器獲取的，哪些是能夠被漫遊器獲取的。

由於一些系統中的URL是大小寫敏感的，因此robots.txt的文件名應統一爲小寫。robots.txt應放置於網站的根目錄下。
若是想單獨定義搜索引擎的漫遊器訪問子目錄時的行爲，那麼能夠將自定的設置合併到根目錄下的robots.txt，或者使用robots元數據（Metadata，又稱元數據）。
robots協議並非一個規範，而只是約定俗成的，因此並不能保證網站的隱私。

簡單來講，robots決定是否容許爬蟲（通用爬蟲）抓取某些內容。編碼

注：聚焦爬蟲不遵照robots。url

eg：

爬取流程：

大多數狀況下的需求，咱們都會指定去使用聚焦爬蟲，也就是爬取頁面中指定部分的數據值，而不是整個頁面的數據。

指定url
發起請求
獲取響應數據
數據解析
持久化存儲

Test：

Test1：

import urllib.request def load_data(): url = "http://www.baidu.com/" #GET請求 #http請求 #response：http響應對象 response = urllib.request.urlopen(url) print(response) load_data()

urllib

Test2：

import urllib.request def load_data(): url = "http://www.baidu.com/" #GET請求 #http請求 #response：http響應對象 response = urllib.request.urlopen(url) print(response) #讀取內容 byte類型 data = response.read() print(data) load_data()

讀取內容

Test3：

import urllib.request def load_data(): url = "http://www.baidu.com/" #GET請求 #http請求 #response：http響應對象 response = urllib.request.urlopen(url) #print(response) #讀取內容 byte類型 data = response.read() #print(data) #將文件獲取的內容轉換爲字符串 str_data = data.decode("UTF-8") print(str_data) load_data()

字符串方式讀取內容

Test4：

import urllib.request def load_data(): url = "http://www.baidu.com/" #GET請求 #http請求 #response：http響應對象 response = urllib.request.urlopen(url) #print(response) #讀取內容 byte類型 data = response.read() #print(data) #將文件獲取的內容轉換爲字符串 str_data = data.decode("UTF-8") #print(str_data) #將數據寫入文件 with open("baidu.html", "w", encoding="utf-8") as f: f.write(str_data) load_data()

將數據寫入文件

注：

出於安全性，https請求的話將沒法打開，而http則能夠打開。

Test5：

str_name = "baidu" bytes_name = str_name.encode("utf-8") print(str_name)

將字符串類型傳喚爲bytes

注：

python爬取的類型：str，bytes

若是爬取返回的是bytes類型：但寫入的時候須要字符串 => decode(「utf-8」);

若是爬取返回的是str類型：但寫入的時候須要bytes類型 => encode(「utf-8」).

Test1 ~ Test4代碼：

import urllib.request def load_data(): url = "http://www.baidu.com/" #GET請求 #http請求 #response：http響應對象 response = urllib.request.urlopen(url) #print(response) #讀取內容 byte類型 data = response.read() #print(data) #將文件獲取的內容轉換爲字符串 str_data = data.decode("UTF-8") #print(str_data) #將數據寫入文件 with open("baidu.html", "w", encoding="utf-8") as f: f.write(str_data) #將字符串類型傳喚爲bytes str_name = "baidu" bytes_name = str_name.encode("utf-8") print(str_name) load_data()

View Code

Test5：

import urllib.request import urllib.parse import string def get_method_params(): url = "http://www.baidu.com/?wd=" #拼接字符串(漢字) name = "爬蟲" final_url = url + name #print(final_url) #代碼發送了請求 #網址裏面包含了漢字；ascii是沒有漢字的；URL轉義 #使用代碼發送網絡請求 #將包含漢字的網址進行轉義 encode_new_url = urllib.parse.quote(final_url, safe=string.printable) #response = urllib.request.urlopen(final_url) print(encode_new_url) #UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
 #針對報錯結合上一條註釋的解釋： #python是解釋性語言；解釋器只支持 ascii 0 - 127，即不支持中文！！！ get_method_params()

GET - params

import urllib.request import urllib.parse import string def get_method_params(): url = "http://www.baidu.com/?wd=" #拼接字符串(漢字) name = "爬蟲" final_url = url + name #print(final_url) #代碼發送了請求 #網址裏面包含了漢字；ascii是沒有漢字的；URL轉義 #將包含漢字的網址進行轉義 encode_new_url = urllib.parse.quote(final_url, safe=string.printable) response = urllib.request.urlopen(encode_new_url) print(response) #UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
 #針對報錯結合上一條註釋的解釋： #python是解釋性語言；解釋器只支持 ascii 0 - 127，即不支持中文！！！ get_method_params()

直接運行

import urllib.request import urllib.parse import string def get_method_params(): url = "http://www.baidu.com/?wd=" #拼接字符串(漢字) name = "爬蟲" final_url = url + name #print(final_url) #代碼發送了請求 #網址裏面包含了漢字；ascii是沒有漢字的；URL轉義 #將包含漢字的網址進行轉義 encode_new_url = urllib.parse.quote(final_url, safe=string.printable) response = urllib.request.urlopen(encode_new_url) print(response) #讀取內容 data = response.read().decode() print(data) #UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
 #針對報錯結合上一條註釋的解釋： #python是解釋性語言；解釋器只支持 ascii 0 - 127，即不支持中文！！！ get_method_params()

讀取內容

import urllib.request import urllib.parse import string def get_method_params(): url = "http://www.baidu.com/?wd=" #拼接字符串(漢字) name = "爬蟲" final_url = url + name #print(final_url) #代碼發送了請求 #網址裏面包含了漢字；ascii是沒有漢字的；URL轉義 #將包含漢字的網址進行轉義 encode_new_url = urllib.parse.quote(final_url, safe=string.printable) response = urllib.request.urlopen(encode_new_url) print(response) #讀取內容 data = response.read().decode() print(data) #保存到本地 with open("encode_test.html", "w", encoding="utf-8")as f: f.write(data) #UnicodeEncodeError: 'ascii' codec can't encode characters in position 15-16: ordinal not in range(128)
 #針對報錯結合上一條註釋的解釋： #python是解釋性語言；解釋器只支持 ascii 0 - 127，即不支持中文！！！ get_method_params()