python模擬登錄知乎

時間 2019-11-07

標籤 python 模擬登錄欄目 Python 简体版

原文原文鏈接

---恢復內容開始---html

在完成前面的階段的任務以後，咱們如今已經可以嘗試着去模擬登陸一些網站了。在這裏咱們模擬登陸一下知乎作一下實驗。筆者在這裏總共用了三天多的時間，下面給你們分享一下筆者是怎麼一步一步的模擬登陸成功的。也但願你們可以吸收個人教訓。

初步的模擬登陸

下面這段代碼是筆者最初寫的，咱們慢慢來看

import requests
from bs4 import BeautifulSoup as bs
ssesion = requests.session()
headers = {
'Connection': 'keep-alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
'Accept-Encoding': 'gzip, deflate, sdch',
'Host': 'www.zhihu.com',
}

login_data = {'username': '', # 替換爲帳號

'password': '', # 替換爲密碼

'remember_me': 'true',
'Referer': 'https://www.baidu.com/',
}

response = bs(requests.get('http://www.zhihu.com/#signin').content, 'html.parser')

xsrf = response.find('input',attrs={'name':'_xsrf'})['value']

login_data['_xsrf'] =xsrf

responed = ssesion.post('http://www.zhihu.com/login/email',headers=headers,data=login_data)

print(responed)

在最初的寫模擬登陸知乎的時候，筆者也是經過抓包，發現了，cookie中有一個_xsrf的屬性，相似於token的做用。而這個東西的存在，就讓咱們在模擬登陸的時候，必須將這個屬性做爲參數一塊兒加在請求中發送出去，那麼怎麼得到這個東西呢？彷佛又是一個問題。

我想到的方法，就是隨便訪問一個頁面，而後再頁面元素中去定位到_xsrf這個字段，而後抓取下來，添加到data裏，在請求的時候一塊兒發出去就能夠了。

而後爲何會去用ssesion去請求，由於在知乎上，它的xsrf是一直在變化的，咱們每一次請求，它都在變。因此若是咱們用requests去請求的話，就沒法登陸成功。

那麼上面這段代碼基本已經符合咱們的要求了。咱們運行看一下結果

Traceback (most recent call last):

File "C:/Users/Administrator/PycharmProjects/Practice/Login_zhihu.py", line 20, in <module>

xsrf = response.find('input',attrs={'name':'_xsrf'})['value']

TypeError: 'NoneType' object is not subscriptable

報錯了，獲取到的xsrf是空的，怎麼辦呢？嗯，根據這裏的報錯信息顯示應該是類型錯誤，那就是獲取xsrf那一段有錯，咱們單獨把那一段代碼拿出去運行看看結果。

定位並修復報錯信息

既然知道了錯誤緣由咱們就去看看，究竟是哪兒錯了，要怎麼解決。

首先，我單獨的把獲取xsrf那一段代碼拿出來運行

import requests

from bs4 import BeautifulSoup as bs

response = bs(requests.get('http://www.zhihu.com/#signin').content, 'html.parser')
print(response)

xsrf = response.find('input',attrs={'name':'_xsrf'})['value']

print(xsrf)

在這裏，分開進行打印，以便查看究竟是走到哪一步出的錯。

運行這一段代碼獲得結果以下顯示：

Traceback (most recent call last):

File "C:/Users/Administrator/PycharmProjects/Practice/Login_zhihu.py", line 6, in <module>

xsrf = response.find('input',attrs={'name':'_xsrf'})['value']

TypeError: 'NoneType' object is not subscriptable

<html><body><h1>500 Server Error</h1>

An internal server error occured.

</body></html>

在這裏報了500，也就是說咱們在get請求的那裏就已經出錯了，而後下方的xsrf也沒有獲取到。在這裏我首先想到的是先解決爬取的xsrf爲空的問題，這裏實際上走入了一個誤區。之因此會爬取xsrf失敗，其實是因爲在請求的時候就失敗了，致使根本獲取不到xsrf。因此應該是解決500的問題先。

那麼怎麼解決500問題呢？

通過前輩的教導，我在請求後面加上了headers，再次運行

import requests
frombs4importBeautifulSoupasbs
headers = {
'Connection':'keep-alive',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language':'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
'User-Agent':'Mozilla/5.0 ( Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
'Accept-Encoding':'gzip, deflate, sdch',
'Host':'www.zhihu.com',
}

login_data = {'username':'',# 替換爲帳號
'password':'',# 替換爲密碼
'remember_me':'true',
'Referer':'https://www.baidu.com/',
}

response = bs(requests.get('http://www.zhihu.com/#signin',headers=headers).content,'html.parser')
xsrf = response.find('input',attrs={'name':'_xsrf'})['value']

print(xsrf)

好的，在運行看看：

899ce2556d7e705ca9bbf2b818a48d40

好的，這裏咱們能夠看到是成功的爬取到了xsrf的信息，那麼咱們將這段代碼在拿到以前的模擬登陸的代碼中去看看。

成功模擬登陸知乎

import requests
from bs4 import BeautifulSoup as bs
ssesion = requests.session()
headers = {
'Connection': 'keep-alive',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',
'Accept-Encoding': 'gzip, deflate, sdch',
'Host': 'www.zhihu.com',
}

login_data = {'username': '', # 替換爲帳號
'password': '', # 替換爲密碼
'remember_me': 'true',
'Referer': 'https://www.baidu.com/',
}

response = bs(requests.get('http://www.zhihu.com/#signin',headers=headers).content, 'html.parser')
xsrf = response.find('input',attrs={'name':'_xsrf'})['value']
login_data['_xsrf'] =xsrf
responed = ssesion.post('http://www.zhihu.com/login/email',headers=headers,data=login_data)

print(responed)

運行這段代碼獲得的結果是

<Response [200]>

返回狀態爲200，說明咱們已經模擬登陸成功了。經歷過蠻多挫折哈，光是錯誤定位那一起，我就折騰了整整一個晚上，還請教了好幾個程序員都沒有搞定。這裏提醒你們一下，可千萬不要犯我這樣的錯誤咯。在作爬蟲的時候，必定要記得請求的時候加上頭信息。

---恢復內容結束---程序員