第十部分模擬登陸（模擬登陸GitHub並爬取、Cookies池的搭建）

時間 2019-12-17

原文原文鏈接

前言：有些頁面的信息在爬蟲時須要登陸才能查看。打開網頁登陸後，在客戶端生成了Cookies，在Cookies中保存了SessionID的信息，登陸後的請求都會攜帶生成後的Cookies發送給服務器。服務器根據Cookies判斷出對應的SessionID，進而找到會話。若是當前會話有效，服務器就判斷用戶當前已登陸，返回請求的頁面信息，這樣就能夠看到登陸後的頁面。

這裏主要是獲取登陸後Cookies。要獲取Cookies能夠手動在瀏覽器輸入用戶名和密碼後，再把Cookies複製出來，這樣作就增長了人工工做量，爬蟲的目的是自動化，須要用程序來完成這個過程，也就是用程序來模擬登陸。下面來了解模擬登陸相關方法及如何維護一個Cookies池。

1、模擬登陸並爬取GitHub
模擬登陸的原理在於登陸後Cookies的維護。

瞭解模擬登陸GitHub的過程，同時爬取登陸後才能夠訪問的頁面信息，如好友動態、我的信息等內容。

須要使用到的庫有：requests和 lxml 庫。

一、分析登陸過程
打開GitHub的登陸頁面https://github.com/login，輸入用戶名和密碼，打開開發者工具，勾選Preserve Log選項，這表示顯示持續日誌。點擊登陸按鈕，就會在開發者工具下方顯示各個請求過程。點擊第一個請求（session），進入其詳情頁面，如圖1-1所示。
css

圖1-1 session請求詳情面
從圖上可看到請求的URL是 https://github.com/session，請求方式爲POST。繼續往下看，能夠觀察到它的Request Headers和Form Data 這兩部份內容。如圖1-2所示。
html

圖1-2 Request Headers和Form Data詳情頁面
Headers裏面包含了 Cookies、Host、Origin、Referer、User-Agent等信息。Form Data包含了6個字段，commit 是固定的字符串Sign in，utf8 是一個勾選字符，authenticity_token 較長，初步判斷是一個Base64加密的字符串，login是登陸的用戶名，password是登陸的密碼，webauthn-support是頁面認證，默認是supported。

由上可知，如今不能構造的內容有 Cookies和 authenticity_token。下面繼續看下這兩部份內容如何獲取。在登陸前訪問的是登陸頁面，該頁面是以GET形式訪問的。輸入用戶名和密碼，點擊登陸按鈕，瀏覽器發送這兩部分信息，也就是說Cookies和 authenticity_token必定是在訪問登陸頁面時候設置的。

再次退出登陸，清空Cookies，回到登陸頁。從新登陸，截獲發生的請求，如圖1-3所示。
python

圖1-3 截獲的請求
在截獲的請求中，Response Headers有一個 Set-Cookie 字段。這就是設置 Cookies 的過程。另外，在Response Headers中沒有和authenticity_token相關的信息，這個 authenticity_token 可能隱藏在其餘地方或者計算出來的。不過在網頁的源代碼中，搜索 authenticity_token 相關的字段，發現了源代碼裏面隱藏着此信息，是由一個隱藏式表單元素。如圖1-4所示。
git

圖1-4 表單元素之authenticity_token
到此，已經獲取到了全部信息，接下來實現模擬登陸。

二、模擬登陸代碼實例
先來定義一個Login 類，初始化一些變量，代碼以下所示：github

 1 import requests  2 from lxml import etree  3 class Login():  4     """登陸類，初始化一些變量"""
 5     def __init__(self):  6         self.headers = {  7             'Referer': 'https://github.com/login',  8             'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',  9             'Host': 'github.com', 10  } 11         self.login_url = 'https://github.com/login'
12         self.post_url = 'https://github.com/session'
13         self.logined_url = 'https://github.com/settings/profile'    # 登陸成功後的頁面
14         self.session = requests.Session()

這段代碼中最重要的一個變量是requests庫的 Session，它能夠維持一個會話，並且能夠自動處理 Cookies，不用擔憂 Cookies的問題。接下來，訪問登陸頁面還要完成兩件事，一是經過登陸頁面獲取初始的 Cookies，二是提取出 authenticity_token。下面實現一個token()方法，代碼以下所示：
web

1 def token(self): 2     response = self.session.get(self.login_url, headers=self.headers) 3     selector = etree.HTML(response.text) 4     token = selector.xpath('//div//input[2]/@value')    # 注意獲取到的是一個列表類型
5     return token

這裏用Session對象的 get() 方法訪問GitHub的登陸頁面，接着用XPath解析出登陸所需的 authenticity_token 信息並返回。如今已經獲取初始的 Cookies和authenticity_token，下面開始模擬登陸，實現一個 login() 方法，代碼以下所示：
redis

 1 def login(self, email, password):  2     post_data = {  3         'commit': 'Sign in',  4         'utf8': '✓',  5         'authenticity_token': self.token()[0],  6         'login': email,  7         'password': password,  8         'webauthn-support': 'supported'
 9  } 10     response = self.session.post(self.post_url, data=post_data, headers=self.headers) 11     if response.status_code == 200: 12  self.dynamics(response.text) 13 
14     response = self.session.get(self.logined_url, headers=self.headers) 15     if response.status_code == 200: 16         self.profile(response.text)

這裏先構造一個表單，複製各個字段，其中email和password是以變量的形式傳遞。而後再用Session對象的post()方法模擬登陸便可。因爲 requests 自動處理了重定向信息，登陸成功後就可直接跳轉到首頁，首頁有顯示所關注人的動態信息，獲得響應後調用dynamics()方法對其進行處理。接下來再用Session對象請求我的詳情頁，調用profile()方法處理我的詳情頁信息。其中，dynamics()和profile()方法的實現以下所示：
數據庫

 1 def dynamics(self, html):  2     """處理登陸成功後的頁面，即主頁面內容"""
 3     # 頁面已經發生跳轉，該段代碼的輸出爲空
 4     selector = etree.HTML(html)  5     print(html)  6     dynamics = selector.xpath('//div[contains(@class, "news")]//div[contains(@class, "Box")]')  7     for item in dynamics:  8         dynamic = ' '.join(item.xpath('.//div[@class="title"]//text()')).strip()  9         print(dynamic) 10 
11 def profile(self, html): 12     """處理登陸成功後的 profile 頁面"""
13     selector = etree.HTML(html) 14     # 下面獲取到的每一項數據都是列表
15     name = selector.xpath('//input[@id="user_profile_name"]/@value') 16     url = selector.xpath('//input[@id="user_profile_blog"]/@value') 17     company = selector.xpath('//input[@id="user_profile_company"]/@value') 18     location = selector.xpath('//input[@id="user_profile_location"]/@value') 19     email = selector.xpath('//select[@id="user_profile_email"]/option[@value!=""]/text()') 20     print(name, email, url, company, location) 21 
22 if __name__ == '__main__': 23     login = Login() 24     login.login(email='email or username', password='password')

這裏用XPath對信息進行提取，在dynamics()方法裏，提取全部的動態信息並輸出（網址已發生跳轉，輸出爲空）。在profile()裏，提取我的信息並將其輸出。如今完成了整個類的編寫，在最後面的if代碼塊中，先建立Login類對象，而後運行程序，經過調用login()方法傳入用戶名和密碼，成功實現了模擬登陸，而且成功輸出用戶我的信息。

利用requests的Session實現模擬登陸操做，最重要的是分析思路，只要各個參數都成功獲取，模擬登陸就沒有問題。登陸成功後，就至關於創建一個 Session會話，Session對象維護着Cookies的信息，直接請求就會獲得模擬登陸成功後的頁面。

2、 Cookies池的搭建

不登陸直接爬取網站內容可能有下面的限制：
（1）、設置了登陸限制的頁面不能爬取。如某些論壇設置了登陸可查看資源，一些博客設置了登陸纔可查看全文等。
（2）、有的頁面請求過於頻繁，訪問容易被限制或者IP被封，可是登陸後不會出現這些問題。所以登陸後被反爬的可能性低。

例如新浪財經官方微博的Ajax接口 https://m.weibo.cn/api/container/getIndex?uid=1804544030&type=uid&page=1&containerid=1076031804544030，這個網站用瀏覽器直接訪問返回JSON格式信息，直接解析JSON便可提取信息。這個接口在沒有登陸的狀況下會有請求頻率檢測。一段時間內請求過於頻繁，請求就會被限制並提示請求過於頻繁。

從新打開瀏覽器窗口，打開 https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/，登陸微博帳號後從新打開這API接口鏈接能夠正常顯示。可是登陸後一直用同一個帳號頻繁請求，也會有可能被封號。所在在大規模抓取，就要擁有不少帳號，每次請求隨機選擇一個帳號，這樣下降單個帳號的訪問頻率，來下降被封的機率。要維護多個帳號的登陸信息，就要用到Cookies池。下面就Cookies池的搭建作一些瞭解。

以新浪微博爲例實現一個Cookies池的搭建過程。Cookies池中保存了許多微博帳號和登陸後的Cookies信息，而且Cookies池還須要定時檢測每一個Cookies的有效性，若是Cookies無效，就刪除該Cookies並模擬登陸生成的Cookies。同時Cookies池還須要一個重要的接口，即獲取隨機Cookies的接口，Cookies運行後，只要請求該接口，便可隨機得到一個Cookies並用其爬取。由此可知，Cookies池須要自動生成Cookies、定時檢測Cookies、提供隨機Cookies等功能。

基本要求：Redis數據庫正常運行。Python的redis-py、requests、Selelnium和Flask庫。以及Chrome瀏覽器的安裝並配置 ChromeDriver。

一、Cookies池架構
Cookies池架構的基本模塊分爲4塊：存儲模塊、生成模塊、檢測模塊和接口模塊。每一個模塊功能以下：
（1）、存儲模塊負責存儲每一個帳號的用戶名密碼以及每一個帳號對應的Cookies信息，同時還須要提供一些方法來實現方便的存取操做。
（2）、生成模塊可生成新的Cookies。從存儲模塊獲取帳號的用戶名和密碼，而後模擬登陸目標頁面，判斷登陸成功，就將Cookies返回並交給存儲模塊存儲。
（3）、檢測模塊定時檢測數據庫中的Cookies。可設置一個檢測鏈接，不一樣的站點檢測鏈接不一樣，檢測模塊會逐個獲取帳號對應的Cookies去請求連接，若是返回的狀態是有效的，此Cookies就沒有失效，不然Cookies失效並移除。接下來等待生成模塊從新生成。
（4）、接口模塊用API對外提供服務接口。可用的Cookies有多個，可隨機返回Cookies的接口，這樣保證每一個Cookies都有可能被取到。Cookies越多，每一個Cookies被取到的機率越小，封號的風險也越小。

二、Cookies 池的實現
對各個模塊的實現過程作一些瞭解。

（1）、存儲模塊
存儲的內容有帳號信息和Cookies信息。帳號由用戶名和密碼組成，將用戶名和密碼在數據庫中存儲成映射關係。Cookies存成JSON字符串，而且要對應用戶名信息，實際也是用戶名和Cookies的映射。能夠用Redis的Hash結構，須要創建兩個Hash結構，用戶名和密碼Hash，用戶名和Cookies的Hash。

Hash的Key對應帳號，Value對應密碼或者Cookies。還要注意的是，Cookies池要作到可擴展，也就是存儲的帳號和Cookies不必定只有新浪微博的，其餘站點一樣能夠對接此Cookies池，因此對Hash的名稱作二級分類，如存微博帳號的Hash名稱能夠是 accounts:weibo，Cookies的名稱能夠是 cookies:weibo。若是要擴展知乎的Cookies池，可以使用 accounts:zhihu和 cookies:zhihu。

下面代碼建立一個存儲模塊類，用以提供一些Hash的基本操做，代碼以下：
首先將一些基本配置放在一個config.py文件，避免各個模塊的代碼雜亂，config.py 文件的代碼以下：
json

 1 # Redis 數據庫地址
 2 REDIS_HOST = '192.168.64.50'
 3 
 4 # Redis 端口
 5 REDIS_PORT = 6379
 6 
 7 # Redis密碼，無密碼就爲 None
 8 REDIS_PASSWORD = None  9 
10 # 產生器使用的瀏覽器
11 BROWSER_TYPE = 'Chrome'
12 
13 # 產生器類，如要擴展其餘站點，就在這裏配置
14 GENERATOR_MAP = { 15     'weibo': 'WeiboCookiesGenerator', 16 } 17 
18 # 測試類，如要擴展其餘站點，就在這裏配置
19 TESTER_MAP = { 20     'weibo': 'WeiboValidTester', 21 } 22 
23 TEST_URL_MAP = { 24     'weibo': 'https://m.weibo.cn/api/container/getIndex?uid=1804544030&type=uid&page=1&containerid=1076031804544030', 25 } 26 
27 # 產生器和驗證器循環週期
28 CYCLE = 120
29 
30 # API地址和端口
31 API_HOST = '0.0.0.0'
32 API_PORT = 5000
33 
34 # 產生器開關，模擬登陸添加Cookies
35 GENERATOR_PROCESS = False 36 # 驗證器開關，循環檢測數據庫中Cookies是否可用，不可用刪除
37 VALID_PROCESS = False 38 # API接口服務
39 API_PROCESS = True

下面是存儲模塊的代碼，代碼以下所示：
flask

 1 import random  2 import redis  3 from cookiespool.config import *
 4 
 5 class RedisClient():  6     def __init__(self, type, website, host=REDIS_HOST, port=REDIS_PORT, password=REDIS_PASSWORD):  7         """
 8  初始化Redis鏈接  9  :param type: 10  :param website: 11  :param host: 地址 12  :param port: 端口 13  :param password: 密碼 14         """
15         self.db = redis.StrictRedis(host=host, port=port, password=password, decode_responses=True) 16         self.type = type 17         self.website = website 18 
19     def name(self): 20         """
21  獲取Hash的名稱 22  :return: Hash名稱 23         """
24         return "{type}:{website}".format(type=self.type, website=self.website) 25 
26     def set(self, username, value): 27         """
28  設置鍵值對 29  :param username: 用戶名 30  :param value: 密碼或Cookies 31  :return: 32         """
33         return self.db.hset(self.name(), username, value) 34 
35     def get(self, username): 36         """
37  根據鍵名獲取鍵值 38  :param username: 用戶名 39  :return: 40         """
41         return self.db.hget(self.name(), username) 42 
43     def delete(self, username): 44         """
45  根據鍵名刪除鍵值對 46  :param username: 用戶名 47  :return: 刪除結果 48         """
49         return self.db.hdel(self.name(), username) 50 
51     def count(self): 52         """
53  獲取數目 54  :return: 數目 55         """
56         return self.db.hlen(self.name()) 57 
58     def random(self): 59         """
60  隨機獲得鍵值，用於隨機Cookies獲取 61  :return: 隨機Cookies 62         """
63         return random.choice(self.db.hvals(self.name())) 64 
65     def username(self): 66         """
67  獲取全部帳戶信息 68  :return: 全部用戶名 69         """
70         return self.db.hkeys(self.name()) 71 
72     def all(self): 73         """
74  獲取全部鍵值對 75  :return: 用戶名和密碼或Cookies的映射表 76         """
77         return self.db.hgetall(self.name()) 78 
79 
80 if __name__ == '__main__': 81     conn = RedisClient('accounts', 'weibo') 82     result = conn.set('michael', 'python') 83     print(result)

首先建立RedisClient類，初始化__init__()方法的兩個關鍵參數type和website，分別表明類型和站點名稱，這是用來拼接Hash名稱的兩個字段。例如存儲帳戶的Hash，type是accounts、website是webo，若是是存儲Cookies的Hash，那麼type是cookies、website是weibo。後面的幾個字段表明了Redis鏈接的初始化信息，初始化StrictRedis對象，創建Redis鏈接。

name()方法用於拼接type和website，組成Hash名稱。set()、get()、delete()分別是設置、獲取、刪除Hash的某一個鍵值對，count()獲取Hash的長度。

random()方法用於從Hash裏隨機選取一個Cookies並返回。每調用一次random()方法，就得到隨機的Cookies，該方法與接口模塊對接用來實現獲取隨機Cookies。

（2）、生成模塊
生成模塊負責獲取各個帳號信息並模擬登陸，隨後生成Cookies並保存。首先獲取兩個Hash的信息，對比帳戶的Hash與Cookies的Hash，看看哪些尚未生成Cookies的帳號，而後將剩餘帳號遍歷，再去生成Cookies便可。詳細代碼以下：

 1 import time  2 from io import BytesIO  3 from PIL import Image  4 #from selenium import webdriver
 5 from selenium.common.exceptions import TimeoutException  6 from selenium.webdriver import ActionChains  7 from selenium.webdriver.common.by import By  8 from selenium.webdriver.support.ui import WebDriverWait  9 from selenium.webdriver.support import expected_conditions as EC  10 from os import listdir  11 from os.path import abspath, dirname  12 
 13 TEMPLATER_FOLDER = dirname(abspath(__file__)) + '/templates/'
 14 
 15 class WeiboCookies():  16     def __init__(self, username, password, browser):  17         self.url = 'https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/'
 18         self.browser = browser  19         self.wait = WebDriverWait(self.browser, 20)  20         self.username = username  21         self.password = password  22 
 23     def open(self):  24         """
 25  打開網頁輸入用戶名密碼並點擊  26  :return: None  27         """
 28         self.browser.delete_all_cookies()       # 首先清除瀏覽器緩存的Cookies
 29  self.browser.get(self.url)  30         username = self.wait.until(EC.presence_of_element_located((By.ID, 'loginName')))  31         password = self.wait.until(EC.presence_of_element_located((By.ID, 'loginPassword')))  32         submit = self.wait.until(EC.element_to_be_clickable((By.ID, 'loginAction')))  33  username.send_keys(self.username)  34  password.send_keys(self.password)  35         time.sleep(1)  36  submit.click()  37 
 38     def password_error(self):  39         """
 40  判斷是否密碼錯誤  41  :return:  42         """
 43         try:  44             return WebDriverWait(self.browser, 5).until(  45                 EC.text_to_be_present_in_element((By.ID, 'errorMsg'), '用戶名或密碼錯誤')  46  )  47         except TimeoutException:  48             return False  49 
 50     def login_successfully(self):  51         """
 52  判斷是否登陸成功  53  :return:  54         """
 55         try:  56             return bool(  57                 WebDriverWait(self.browser, 5).until(EC.presence_of_element_located((By.CLASS_NAME, 'lite-iconf-profile'))))  58         except TimeoutException:  59             return False  60 
 61     def get_position(self):  62         """
 63  獲取驗證碼位置  64  :return: 驗證碼位置元組  65         """
 66         try:  67             img = self.wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'patt-shadow')))  68         except TimeoutException:  69             print('未出現驗證碼')  70  self.open()  71         time.sleep(2)  72         location = img.location  73         size = img.size  74         top, bottom, left, right =location['y'], location['y'] + size['height'], location['x'], location['x'] + size['width']  75         return (top, bottom, left, right)  76 
 77     def get_screenshot(self):  78         """
 79  獲取網頁截圖  80  :return: 截圖對象  81         """
 82         screenshot = self.browser.get_screenshot_as_png()  83         screenshot = Image.open(BytesIO(screenshot))  84         return screenshot  85 
 86     def get_image(self):  87         """
 88  獲取驗證碼圖片  89  :return: 圖片對象  90         """
 91         top, bottom, left, right = self.get_position()  92         print('驗證碼位置', top, bottom, left, right)  93         screenshot = self.get_screenshot()  94         captcha = screenshot.crop((left, top, right, bottom))  95         return captcha  96 
 97     def is_pixel_equal(self, image1, image2, x, y):  98         """
 99  判斷兩個像素是否相同 100  :param image1: 圖片1 101  :param image2: 圖片2 102  :param x: 位置x 103  :param y: 位置y 104  :return: 像素是否相同 105         """
106         # 取兩個圖片的像素點
107         pixel1 = image1.load()[x, y] 108         pixel2 = image2.load()[x, y] 109         threshold = 20
110         if abs(pixel1[0] - pixel2[0]) < threshold and abs(pixel1[1] - pixel2[1]) < threshold and abs( 111             pixel1[2] - pixel2[2]) < threshold: 112             return True 113         else: 114             return False 115 
116     def same_image(self, image, template): 117         """
118  識別類似驗證碼 119  :param image: 待識別的驗證碼 120  :param template: 模板 121  :return: 122         """
123         # 類似度閾值
124         threshold = 0.99
125         count = 0 126         for x in range(image.width): 127             for y in range(image.height): 128                 # 判斷像素是否相同
129                 if self.is_pixel_equal(image, template, x, y): 130                     count += 1
131         result = float(count) / (image.width * image.height) 132         if result > threshold: 133             print('成功匹配') 134             return True 135         return False 136 
137     def detect_image(self, image): 138         """
139  匹配圖片 140  :param image: 圖片 141  :return: 手動順序 142         """
143         for template_name in listdir(TEMPLATER_FOLDER): 144             print('正在匹配', template_name) 145             template = Image.open(TEMPLATER_FOLDER + template_name) 146             if self.same_image(image, template): 147                 # 返回順序
148                 numbers = [int(number) for number in list(template_name.split('.')[0])] 149                 print('拖動順序', numbers) 150                 return numbers 151 
152     def move(self, numbers): 153         """
154  根據順序拖動 155  :param numbers: 156  :return: 157         """
158         # 得到四個按點
159         try: 160             circles = self.browser.find_elements_by_css_selector('.patt-wrap .patt-circ') 161             dx = dy = 0 162             for index in range(4): 163                 circle = circles[numbers[index] - 1] 164                 # 若是是第一次循環
165                 if index == 0: 166                     # 點擊第一個按點
167  ActionChains(self.browser) \ 168                         .move_to_element_with_offset(circle, circle.size['width'] / 2, circle.size['height'] / 2) \ 169  .click_and_hold().perform() 170                 else: 171                     # 小幅移動次數
172                     times = 30
173                     # 拖動
174                     for i in range(times): 175                         ActionChains(self.browser).move_by_offset(dx / times, dy / times).perform() 176                         time.sleep(1 / times) 177                 # 若是是最後一次循環
178                 if index == 3: 179                     # 鬆開鼠標
180  ActionChains(self.browser).release().perform() 181                 else: 182                     # 計算下一次偏移
183                     dx = circle[numbers[index + 1] - 1].location['x'] - circle.location['x'] 184                     dy = circle[numbers[index + 1] - 1].location['y'] - circle.location['y'] 185         except: 186             return False 187 
188     def get_cookies(self): 189         """
190  獲取Cookies 191  :return: 192         """
193         return self.browser.get_cookies() 194 
195     def main(self): 196         """
197  破解入口 198  :return: 199         """
200  self.open() 201         if self.password_error(): 202             return { 203                 'status': 2, 204                 'content': '用戶名或密碼錯誤'
205  } 206         # 若是不需驗證碼直接登陸成功
207         if self.login_successfully(): 208             cookies = self.get_cookies() 209             return { 210                 'status': 1, 211                 'content': cookies 212  } 213         # 獲取驗證碼圖片
214         image = self.get_image() 215         numbers = self.detect_image(image) 216  self.move(numbers) 217         if self.login_successfully(): 218             cookies = self.get_cookies()    # content鍵對應的值是列表，列表內是字典
219             return { 220                 'status': 1, 221                 'content': cookies 222  } 223         else: 224             return { 225                 'status': 3, 226                 'content': '登陸失敗'
227  } 228 
229 
230 if __name__ == '__main__': 231     browser = webdriver.Chrome() 232     result = WeiboCookies('qq_number@qq.com', 'password', browser).main() 233     print(result)

在 WeiboCookies 類中，首先對接了新浪微博的四宮格驗證碼。在main() 方法中，調用cookies的獲取方法，並針對不一樣的狀況返回不一樣的結果。返回結果類型是字典，而且附有狀態碼status，在生成模塊中能夠根據不一樣的狀態碼作不一樣的處理。例如狀態碼爲1時，表示成功獲取Cookies，只需將Cookies保存到數據庫便可。狀態碼爲2表示用戶名和密碼錯誤，這時就應該把當前數據庫中存儲的帳號信息刪除。若是狀態碼爲3時，則表示登陸失敗，此時不能判斷是否用戶名或密碼錯誤，也不能成功獲取Cookies，這時可作一些提示，進行下一個處理便可，完整的實現代碼以下所示：

 1 import json  2 from selenium import webdriver  3 from selenium.webdriver import DesiredCapabilities  4 from cookiespool.config import *
 5 from redisdb import RedisClient  6 from login.weibo.cookies import WeiboCookies  7 
 8 
 9 class CookiesGenerator():  10     def __init__(self, website='default'):  11         """
 12  父類，初始化一些對象  13  :param website: 名稱  14         """
 15         self.website = website  16         self.cookies_db = RedisClient('cookies', self.website)      # 建立Redis數據庫鏈接，參數是Redis的Hash鍵要用到的
 17         self.accounts_db = RedisClient('accounts', self.website)  18  self.init_browser()  19 
 20     def __del__(self):  21  self.close()  22 
 23     def init_browser(self):  24         """
 25  經過browser參數初始化全局瀏覽器供模擬登陸使用  26  :return:  27         """
 28         if BROWSER_TYPE == 'PhantomJS':  29             caps = DesiredCapabilities.PHANTOMJS  30             caps["phantomjs.page.settings.userAgent"] = \  31                 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
 32             self.browser = webdriver.PhantomJS(desired_capabilities=caps)  33             self.browser.set_window_size(1300, 500)  34         elif BROWSER_TYPE == 'Chrome':  35             self.browser = webdriver.Chrome()  36 
 37     def new_cookies(self, username, password):  38         """
 39  新生成Cookies，子類須要重寫  40  :param username: 用戶名  41  :param password: 密碼  42  :return:  43         """
 44         raise NotImplementedError  45 
 46     def process_cookies(self, cookies):  47         """
 48  處理Cookies  49  :param cookies:  50  :return:  51         """
 52         dict = {}  53         for cookie in cookies:  54             dict[cookie['name']] = cookie['value']  55         return dict  56 
 57     def run(self):  58         """
 59  運行，獲得全部帳戶名，而後順序模擬登陸  60  :return:  61         """
 62         accounts_usernames = self.accounts_db.usernames()  63         cookies_usernames = self.cookies_db.usernames()  64 
 65         for username in accounts_usernames:  66             if not username in cookies_usernames:  67                 password = self.accounts_db.get(username)  68                 print('正在生成Cookies', '帳號', username, '密碼', password)  69                 result = self.new_cookies(username, password)  70                 # 獲取成功
 71                 if result.get('status') == 1:  72                     cookies = self.process_cookies(result.get('content'))  73                     print('成功獲取到Cookies', cookies)  74                     if self.cookies_db.set(username, json.dumps(cookies)):  75                         print('成功保存Cookies')  76                 # 密碼錯誤，移除帳號
 77                 elif result.get('status') == 2:  78                     print(result.get('content'))  79                     if self.accounts_db.delete(username):  80                         print('成功刪除帳號')  81                 else:  82                     print(result.get('content'))  83         else:  84             print('全部帳號都已經成功獲取Cookies')  85 
 86     def close(self):  87         """
 88  關閉  89  :return:  90         """
 91         try:  92             print('Closing Browser')  93  self.browser.close()  94             del self.browser  95         except TypeError:  96             print('Browser not opened')  97 
 98 
 99 class WeiboCookiesGenerator(CookiesGenerator): 100     def __init__(self, website='weibo'): 101         """
102  初始化操做 103  :param website: 104         """
105         CookiesGenerator.__init__(self, website) 106         self.website = website 107 
108     def new_cookies(self, username, password): 109         """
110  生成Cookies 111  :param username: 用戶名 112  :param password: 密碼 113  :return: 用戶名和Cookies 114         """
115         # 調用了 login模塊下的cookies.py文件中的 WeiboCookies，self.browser由父類提供
116         return WeiboCookies(username, password, self.browser).main() 117 
118 
119 if __name__ == '__main__': 120     generator = WeiboCookiesGenerator(website='https://passport.weibo.cn/signin/login?entry=mweibo&r=https://m.weibo.cn/') 121     generator.run()

要擴展其餘站點，只要實現new_cookies() 方法便可，而後按此規則返回對應的模擬登陸結果，如1表明獲取成功，2表明用戶名或密碼錯誤。

三、檢測模塊
Cookies時間太長致使失效，或者Cookies使用太頻繁形成沒法正常請求網頁。有這樣的Cookies須要及時清理或者替換。因此須要一個定時檢測模塊來遍歷Cookies池中的全部Cookies，同時設置好對應的檢測連接，用每一個Cookies去請求這個連接。請求成功或者狀態碼合法，則該Cookies有效；請求失敗，或者沒法獲取正常數據，如跳轉到登陸頁面或者驗證頁面，則此Cookies無效，須要將該Cookies從數據庫中移除。

移除Cookies後，前面的生成模塊就會檢測到Cookies的Hash和帳號的Hash相比少了此帳號的Cookies，生成模塊就會認爲這個帳號尚未生成Cookies，就用此帳號從新登陸，此帳號的Cookies又被從新更新。

檢測模塊主要做用是檢測Cookies失效，將其從數據庫中移除。要考慮通用可擴展性，首先定義一個檢測器的父類，聲明一些通用組件，代碼以下所示：

 1 import json  2 import requests  3 from requests.exceptions import ConnectionError  4 from redisdb import *
 5 
 6 class ValidTester():  7     def __init__(self, website='default'):  8         self.website = website  9         self.cookies_db = RedisClient('cookies', self.website) 10         self.accouts_db = RedisClient('account', self.website) 11 
12     def test(self, username, cookies): 13         """爲了便於擴展，該方法由子類來實現"""
14         raise NotImplementedError 15 
16     def run(self): 17         cookies_groups = self.cookies_db.all() 18         for username, cookies in cookies_groups.items(): 19             self.test(username, cookies)        # 調用 test 方法測試，子類提供 test 方法
20 
21 class WeiboValidTester(ValidTester): 22     """測試微博，若是要測試其餘網站，可建立相應的測試類，而且繼承ValidTester類"""
23     def __init__(self, website='weibo'): 24         ValidTester.__init__(self, website) 25 
26     def test(self, username, cookies): 27         print('正在測試Cookies', '用戶名', username) 28         try: 29             cookies = json.loads(cookies) 30         except TypeError: 31             print('Cookies不合法', username) 32  self.cookies_db.delete(username) 33             print('刪除Cookies', username) 34             return
35         # 若是上面的try代碼塊沒有引起異常，就執行下面的try代碼塊
36         try: 37             test_url = TEST_URL_MAP[self.website] 38             response = requests.get(test_url, cookies=cookies, timeout=5, allow_redirects=False) 39             if response.status_code == 200: 40                 print('Cookies有效', username) 41             else: 42                 print(response.status_code, response.headers) 43                 print('Cookies失效', username) 44  self.cookies_db.delete(username) 45                 print('刪除Cookies', username) 46         except ConnectionError as e: 47             print('發生異常', e.args) 48 
49 if __name__ == '__main__': 50     WeiboValidTester().run()

這段代碼中定義了一個父類ValidTester，在其__init__()方法中指定了站點名稱website，另外創建兩個存儲模塊鏈接對象cookies_db 和 accounts_db，分別負責操做Cookies 和帳號的hash，run()方法是入口，這裏遍歷了全部的Cookies，而後調用test()方法進行測試，test()方法由子類來實現，每一個子類負責各自不一樣的網站的檢測。如檢測微博的可定義爲WeiboValidTester，實現其獨有的 test() 方法來檢測微博的Cookies是否合法，而後作相應的處理。WeiboValidTester類就是繼承了ValidTester類的子類。

子類的test()方法首先將Cookies轉化爲字典，檢測Cookies的格式，若是格式不正確，直接將其刪除，若是沒有格式問題，就拿此 Cookies請求被檢測的URL。test()方法在這裏檢測的是微博，檢測的URL能夠是某個Ajax接口，爲了實現可配置化，將測試URL也定義成字典，以下所示：
TEST_URL_MAP = {'weibo': 'https://m.weibo.cn/'}
要擴展（檢測）其餘站點，可統一在字典裏添加。對微博來講，用Cookies去請求目標站點，同時禁止重定向和設置超時時間，獲得響應後檢測其返回狀態碼。返回的是200，則Cookies有效，若是遇到302跳轉等狀況，通常會跳轉到登陸頁面，則 Cookies已失效，此時將失效的Cookies從Cookies的Hash裏移除便可。

四、接口模塊
生成模塊和檢測模塊定時運行可完成Cookies實時檢測和更新。但Cookies最終是給爬蟲用的，同時一個Cookies池可供多個爬蟲使用，因此須要定義一個Web接口，爬蟲訪問該接口就可獲取隨機的Cookies。這個接口用Flask來搭建，代碼以下所示：

 1 import json  2 from flask import Flask, g  3 from cookiespool.config import *
 4 from redisdb import *
 5 
 6 __all__ = ['app']  7 
 8 app = Flask(__name__)  9 
10 @app.route('/') 11 def index(): 12     return '<h2>Welcome to Cookie Pool System</h2>'
13 
14 
15 def get_conn(): 16     """
17  獲取 18  :return: 19     """
20     for website in GENERATOR_MAP: 21         print(website) 22         if not hasattr(g, website): 23             setattr(g, website + '_cookies', eval('RedisClient' + '("cookies","' + website + '")')) 24             setattr(g, website + '_accounts', eval('RedisClient' + '("accounts", "' + website + '")')) 25     return g 26 
27 
28 @app.route('/<website>/random') 29 def random(website): 30     """
31  獲取隨機的Cookie，訪問地址如 /weibo/random 32  :param website: 33  :return: 隨機Cookie 34     """
35     g = get_conn() 36     cookies = getattr(g, website + '_cookies').random() 37     return cookies 38 
39 
40 @app.route('/<website>/add/<username>/<password>') 41 def add(website, username, password): 42     """
43  添加用戶，訪問地址如 /weibo/add/user/password 44  :param website: 站點 45  :param username: 用戶名 46  :param password: 密碼 47  :return: 48     """
49     g = get_conn() 50     print(username, password) 51     getattr(g, website + '_accounts').set(username, password) 52     return json.dumps({'status': '1'}) 53 
54 
55 @app.route('/<website>/count') 56 def count(website): 57     """
58  獲取Cookies總數 59     """
60     g = get_conn() 61     count = getattr(g, website + '_cookies').count() 62     return json.dumps({'status': '1', 'count': count}) 63 
64 if __name__ == '__main__': 65     app.run(host='127.0.0.1')

這裏random方法實現通用的配置來對接不一樣的站點，因此接口連接的第一個字段定義爲站點名稱，第二個字段定義爲獲取方法，例如 /weibo/random是獲取微博的隨機Cookies，/zhihu/random是獲取知乎的隨機Cookies。

五、調度模塊
最後再加一個調度模塊，讓這幾個模塊配合起來運行，主要工做就是驅動幾個模塊定時運行，同時各個模塊須要在不一樣的進程上運行，代碼實現以下所示：

 1 import time  2 from multiprocessing import Process  3 
 4 from cookiesapi import app  5 from cookiespool.config import *
 6 from cookiespool.generator import *
 7 from cookiespool.tester import *
 8 
 9 class Scheduler(object): 10 
11  @staticmethod 12     def valid_cookie(cycle=CYCLE): 13         while True: 14             print('Cookies 檢測進程開始運行') 15             try: 16                 for website, cls in TESTER_MAP.items(): 17                     tester = eval(cls + '(website="' + website + '"")') 18  tester.run() 19                     print('Cookies 檢測完成') 20                     del tester 21  time.sleep(cycle) 22             except Exception as e: 23                 print(e.args) 24 
25  @ staticmethod 26     def generate_cookie(cycle=CYCLE): 27         while True: 28             print("Cookies生成進程開始運行") 29             try: 30                 for website, cls in GENERATOR_MAP.items(): 31                     generator = eval(cls + '(website="' + website + '")') 32  generator.run() 33                     print('Cookies 生成完成') 34  generator.close() 35  time.sleep(cycle) 36             except Exception as e: 37                 print(e.args) 38 
39  @staticmethod 40     def api(): 41         print('API接口開始運行') 42         app.run(host=API_HOST, port=API_PORT) 43 
44     def run(self): 45         if API_PROCESS: 46             api_process = Process(target=Scheduler.api) 47  api_process.start() 48 
49         if GENERATOR_PROCESS: 50             generate_process = Process(target=Scheduler.generate_cookie) 51  generate_process.start() 52 
53         if VALID_PROCESS: 54             valid_process = Process(target=Scheduler.valid_cookie) 55             valid_process.start()

代碼中用到的兩個重要配置是，產生模塊類和測試模塊類的字典配置，該配置信息在 config 模塊中，配置信息以下所示：

1 # 產生器類，如要擴展其餘站點，就在這裏配置
2 GENERATOR_MAP = { 3     'weibo': 'WeiboCookiesGenerator', 4 } 5 
6 # 測試類，如要擴展其餘站點，就在這裏配置
7 TESTER_MAP = { 8     'weibo': 'WeiboValidTester', 9 }

這樣配置可方便動態擴展使用，鍵名是站點名稱，鍵值是類名。若有須要配置其它站點，可在字典中添加，例如要擴展知乎站點的產生模塊，能夠這樣配置：

1 GENERATOR_MAP = { 2     'weibo': 'WeiboCookiesGenerator', 3     'zhihu': 'ZhihuCookiesGenerator', 4 }

Scheduler類裏對字典遍歷，並利用 eval() 方法建立各個類的對象，調用其入口 run() 方法運行各個模塊。同時，各個模塊的多進程使用了 multiprocessing 中的 Process 類，調用其 start()方法便可啓動各個進程。

最後，還須要爲各個模塊設置一個開關，能夠在配置文件中設置開關的開啓和關閉狀態，以下所示：

1 # 產生器開關，模擬登陸添加Cookies
2 GENERATOR_PROCESS = False 3 # 驗證器開關，循環檢測數據庫中Cookies是否可用，不可用刪除
4 VALID_PROCESS = False 5 # API接口服務
6 API_PROCESS = True

這幾個開關的值爲True則開啓，爲False則爲關閉。要讓代碼可以成功運行，還須要導入帳號和密碼，爲此再寫一個導入帳號和密碼的模塊，這個模塊的代碼以下所示：

 1 from redisdb import RedisClient  2 
 3 conn = RedisClient('accounts', 'weibo')  4 
 5 def set(account, sep='----'):  6     username, password = account.split(sep)  7     result = conn.set(username, password)  8     print('帳號', username, '密碼', password)  9     print('錄入成功' if result else '錄入失敗') 10 
11 
12 def scan(): 13     print('請輸入帳號密碼組，輸入exit退出讀入') 14     while True: 15         account = input() 16         if account == 'exit': 17             break
18  set(account) 19 
20 
21 if __name__ == '__main__': 22     scan()

運行這個模塊，就將錄入的帳號和密碼存儲到 Redis 數據庫中。最終，還須要寫一個總的運行程序入口模塊，這個模塊很簡單，主要是調用調度模塊的run()方法運行程序。

1 from cookiespool.scheduler import Scheduler 2 
3 def main(): 4     s = Scheduler() 5  s.run() 6 
7 if __name__ == '__main__': 8     main()

經測試，代碼運行成功，各個模塊都正常啓動，測試模塊逐個測試Cookies，生成模塊獲取還未生成Cookies的帳號的Ccookies，各個模塊並行運行，互不干擾。這裏測試了一個帳號，控制檯的輸出信息以下所示：

Cookies 檢測進程開始運行 API接口開始運行 * Serving Flask app "cookiesapi" (lazy loading) * Environment: production WARNING: Do not use the development server in a production environment. Use a production WSGI server instead. * Debug mode: off Cookies 檢測完成 Cookies生成進程開始運行 * Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) 正在生成Cookies 帳號 1234567890 密碼 abcd1234 (這裏的帳號和密碼不是真實輸出的帳號和密碼) 成功獲取到Cookies {'M_WEIBOCN_PARAMS': 'uicode%3D10000011%26fid%3D102803', 'MLOGIN': '1', ...(後面省略)} 成功保存Cookies 全部帳號都已經成功獲取Cookies Cookies 生成完成 Closing Browser