Scrapy分佈式爬蟲打造搜索引擎(慕課網)--爬取知乎(一)

 第一節:session和cookie的實現原理html

  session和cookie的區別python

  • cookie是瀏覽器的本地存儲機制(以鍵值對的形式)
  • http是無狀態的協議(即服務器在接收到請求以後直接返回,無論是誰傳輸的————無狀態請求)

 

  •  有狀態請求:


第二節:瀏覽器

  • 狀態碼:

 

  •  zhihu_login_requests.py源代碼1:
     1 #coding:utf-8
     2 
     3 import re
     4 import requests
     5 try:
     6 import cookielib
     7 except:
     8 import http.cookiejar as cookielib
     9 
    10 def get_xsrf():
    11 response = requests.get("https://www.zhihu.com")
    12 print (response.text)
    13 return ""
    14 
    15 def zhihu_login(account, password):
    16 #知乎登錄
    17 if re.match("^1\d{10}", account):
    18 print "手機號碼登錄"
    19 post_url = "https://www.zhihu.com/signup?next=%2F"
    20 post_data = {
    21 "_xsrf": "",
    22 "phone_num": account,
    23 "password": password
    24 }
    25 get_xsrf()

 

 

  •  運行結果(返回500錯誤,由於此時的請求頭是本地請求頭,不是瀏覽器請求頭
    C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py
    <html><body><h1>500 Server Error</h1>
    An internal server error occured.
    </body></html>

 

  • 解決500錯誤的方法————添加請求頭
    agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
    header = {
    "HOST":"www.zhihu.com",
    "Referer": "https://www.zhihu.com",
    "User-Agent": agent
    }

 

  • 經過session創建鏈接(注:response.text要轉換成utf-8編碼)
    #coding:utf-8
    
    import re
    import requests
    try:
    import cookielib
    except:
    import http.cookiejar as cookielib
    
    session = requests.session()
    session.cookies = cookielib.LWPCookieJar(filename="cookie.txt")
    try:
    session.cookies.load(ignore_discard = True)
    except:
    print "cookie未能加載"
    
    agent = "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/57.0"
    header = {
    "HOST":"www.zhihu.com",
    "Referer": "https://www.zhihu.com",
    "User-Agent": agent
    }
    
    def get_xsrf():
    #獲取xsrf code
    response = session.get("https://www.zhihu.com", headers=header)
    match_obj = re.match('.*name="_xsrf" value="(.*?)"', response.text)
    if match_obj:
    print (match_obj.group(1))
    else:
    return ""
    
    def get_index():
    response = session.get("https://www.zhihu.com", headers = header)
    with open("index_page.html", "wb") as f:
    f.write(response.text.encode("utf-8"))
    print "ok"
    
    def zhihu_login(account, password):
    #知乎登錄
    if re.match("^1\d{10}", account):
    print "手機號碼登錄"
    post_url = "https://www.zhihu.com/signup?next=%2F"
    post_data = {
    "_xsrf": get_xsrf(),
    "phone_num": account,
    "password": password
    }
    
    response = session.post(post_url, data=post_data, headers=header)
    
    session.cookies.save()
    
    zhihu_login("15603367590","0019wan,.WEI3618")
    get_index()

     

  • 運行結果(新增長了cookie.txt和index_page.html文件)
    C:\Python27\python.exe F:/everyday/Zhihu/zhihu_login_requests.py
    手機號碼登錄
    ok

相關文章
相關標籤/搜索