Scrapy分佈式爬蟲打造搜索引擎（慕課網）--爬取知乎（一）

時間 2019-11-13

標籤 scrapy 分佈式爬蟲打造搜索引擎欄目 Python 简体版

原文原文鏈接

第一節：session和cookie的實現原理html

　　session和cookie的區別python

cookie是瀏覽器的本地存儲機制（以鍵值對的形式）

http是無狀態的協議（即服務器在接收到請求以後直接返回，無論是誰傳輸的————無狀態請求）

有狀態請求：

第二節：瀏覽器

狀態碼：

zhihu_login_requests.py源代碼1：

 1 #coding:utf-8
 2 
 3 import re
 4 import requests
 5 try:
 6 import cookielib
 7 except:
 8 import http.cookiejar as cookielib
 9 
10 def get_xsrf():
11 response = requests.get("https://www.zhihu.com")
12 print (response.text)
13 return ""
14 
15 def zhihu_login(account, password):
16 #知乎登錄
17 if re.match("^1\d{10}", account):
18 print "手機號碼登錄"
19 post_url = "https://www.zhihu.com/signup?next=%2F"
20 post_data = {
21 "_xsrf": "",
22 "phone_num": account,
23 "password": password
24 }
25 get_xsrf()

運行結果（返回500錯誤，由於此時的請求頭是本地請求頭，不是瀏覽器請求頭

C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py
<html><body><h1>500 Server Error</h1>
An internal server error occured.
</body></html>

解決500錯誤的方法————添加請求頭

agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
header = {
"HOST":"www.zhihu.com",
"Referer": "https://www.zhihu.com",
"User-Agent": agent
}

經過session創建鏈接（注：response.text要轉換成utf-8編碼）

#coding:utf-8

import re
import requests
try:
import cookielib
except:
import http.cookiejar as cookielib

session = requests.session()
session.cookies = cookielib.LWPCookieJar(filename="cookie.txt")
try:
session.cookies.load(ignore_discard = True)
except:
print "cookie未能加載"

agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
header = {
"HOST":"www.zhihu.com",
"Referer": "https://www.zhihu.com",
"User-Agent": agent
}

def get_xsrf():
#獲取xsrf code
response = session.get("https://www.zhihu.com", headers=header)
match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)
if match_obj:
print (match_obj.group(1))
else:
return ""

def get_index():
response = session.get("https://www.zhihu.com", headers = header)
with open("index_page.html", "wb") as f:
f.write(response.text.encode("utf-8"))
print "ok"

def zhihu_login(account, password):
#知乎登錄
if re.match("^1\d{10}", account):
print "手機號碼登錄"
post_url = "https://www.zhihu.com/signup?next=%2F"
post_data = {
"_xsrf": get_xsrf(),
"phone_num": account,
"password": password
}

response = session.post(post_url, data=post_data, headers=header)

session.cookies.save()

zhihu_login("15603367590","0019wan,.WEI3618")
get_index()

運行結果（新增長了cookie.txt和index_page.html文件）

C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py
手機號碼登錄
ok

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。