知乎內容須要登錄後才能查看,與以前案例不一樣,這裏要向瀏覽器提交登陸信息。html
首先爬取知乎登陸頁面python
def getHtmlText(url): try: r = requests.get(url) r.encoding = 'utf-8' return r.text except: return '' url = 'https://www.zhihu.com/' getHtmlText(url) '<html><body><h1>500 Server Error</h1>\nAn internal server error occured.\n</body></html>\n'
此時出現 500 Server Error,解決方法爲經過 headers={...} 更改用戶代理爲瀏覽器json
def getHtmlText(url): headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'} try: r = requests.get(url, headers = headers) r.encoding = 'utf-8' return r.text except: return ''
在知乎登陸頁面打開Chrome瀏覽器F12,這裏打鉤以後新跳轉的頁面的信息就不會覆蓋以前接受到的信息,輸入帳號密碼點擊登陸,就能夠看到須要提交的表單數據。瀏覽器
#!/usr/bin/env python3 # -*- coding: utf-8 -*- import requests import re import time from PIL import Image from bs4 import BeautifulSoup import json # 構造 Request headers # 登錄的url地址 logn_url = 'http://www.zhihu.com/#signin' session = requests.session() headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36', } content = session.get(logn_url, headers=headers).content soup = BeautifulSoup(content, 'html.parser') def getxsrf(): return soup.find('input', attrs={'name': "_xsrf"})['value'] # 獲取驗證碼 def get_captcha(): t = str(int(time.time() * 1000)) captcha_url = 'http://www.zhihu.com/captcha.gif?r=' + t + "&type=login" r = session.get(captcha_url, headers=headers) with open('captcha.jpg', 'wb') as f: f.write(r.content) f.close() im = Image.open('captcha.jpg') im.show() im.close() captcha = input("please input the captcha\n>") return captcha def isLogin(): # 經過查看用戶我的信息來判斷是否已經登陸 url = "https://www.zhihu.com/settings/profile" login_code = session.get(url, allow_redirects=False).status_code if int(x=login_code) == 200: return True else: return False def login(secret, account): # 經過輸入的用戶名判斷是不是手機號 if re.match(r"^1\d{10}$", account): print("手機號登陸 \n") post_url = 'http://www.zhihu.com/login/phone_num' postdata = { '_xsrf': getxsrf(), 'password': secret, 'remember_me': 'true', 'phone_num': account, } else: print("郵箱登陸 \n") post_url = 'http://www.zhihu.com/login/email' postdata = { '_xsrf': getxsrf(), 'password': secret, 'remember_me': 'true', 'email': account, } try: # 不須要驗證碼直接登陸成功 login_page = session.post(post_url, data=postdata, headers=headers) login_code = login_page.text print(login_page.status) print(login_code) except: # 須要輸入驗證碼後才能登陸成功 postdata["captcha"] = get_captcha() login_page = session.post(post_url, data=postdata, headers=headers) login_code = eval(login_page.text) print(login_code['msg']) if __name__ == '__main__': if isLogin(): print('您已經登陸') else: account = input('請輸入你的用戶名\n> ') secret = input("請輸入你的密碼\n> ") login(secret, account)
存在問題:運到驗證碼爲「點擊圖中倒立文字或移動滑塊至」,則登錄失敗,跳轉到登陸界面。據說能夠用打碼平臺解決cookie
打開你要爬取信息的界面,發出請求查看所需 headers,比較重要的有 'User-Agent' 和 ‘cookie’,對應添加session
import requests import re url='https://www.zhihu.com/question/22591304/followers' headers={ 'User-Agent': 'Cookie':} page=requests.get(url,headers=headers).text imgs=re.findall(r'<img src=\"(.*?)_m.jpg',page)
查看 imgs 便可看到匹配的圖片dom
如下是一段爬取知乎頭像的代碼:函數
# -*- coding: utf-8 -*- #py3.6 import requests import urllib import re import random from time import sleep def main(): url='https://www.zhihu.com/question/22591304/followers' headers={ 'User-Agent':'', 'Cookie':''} i=1 for x in range(20,40,20): data={'start':'0', 'offset':str(i), '_xsrf':'2e65c02ceeaaa1ac16d193415cf8d5be'} page=requests.post(url,headers=headers,data=data,timeout=50).text imgs=re.findall(r'<img src=\\"(.*?)_m.jpg',page) #在爬下來的json上用正則提取圖片地址,去掉_m爲大圖 for img in imgs: try: img=img.replace('\\','') #去掉\字符這個干擾成分 pic=img+'.jpg' path='d:\\zhihu\\'+str(i)+'.jpg' #聲明存儲地址及圖片名稱 urllib.request.urlretrieve(pic,path) #下載圖片 print(u'下載了第'+str(i)+u'張圖片') i+=1 sleep(random.uniform(0.5,1)) #睡眠函數用於防止爬取過快被封IP except: pass sleep(random.uniform(0.5,1)) if __name__=='__main__': main()
貌似 get 和 post 方法返回的 .text 形式不同, post 會用 \\" 轉義表示 」,而 get 不會;工具
另外,data 中 '_xsrf' 在 F12 Form Data 中沒找到,但必不可少,難道隱藏了?要用抓包工具?post