嘗試一下抓取微信公衆號歷史文章。html
採集的主要信息有:標題、描述、做者、評論數、閱讀數、在看數、發佈時間、文章連接python
fiddlerjson
微信PC客戶端微信
使用python3,相應環境本身搭建。cookie
首先,打開fiddler,操做一下本身的微信,訪問一些公衆號,而後看一下請求,這裏我訪問菜鳥教程的歷史文章,界面以下app
在fiddler裏面找到微信相關的請求。以下ide
若是請求太多,能夠設置過濾「mp.weixin.qq.com」,就能夠只看微信這邊過來的請求了。函數
咱們將上面發現的幾個請求每一個都看一遍,能夠發如今第二個請求「/mp/profile_ext?action=home...」裏面有一些數據工具
能夠看到這個就是上面截圖中的頁面,裏面有個'msgList'的js變量,存的貌似是頁面內容的json。post
也就是說咱們請求到這個頁面就能夠獲取這個文章列表了。
**注意到TextView這個tab的時候可能會告訴你頁面沒有解碼,點擊解碼就能夠了。
到如今是否是就能夠着手開發代碼了?別慌。
咱們只知道這個頁面有數據,但還不知道怎麼下一頁的數據怎麼獲取,先將界面上的文章拉到第二頁看一下。(這個時候須要先關注公衆號)
咱們能夠看到第二頁相同連接的請求已經變成了json。
json數據,比解析網頁彷佛要方便許多。
看一下找到的這個請求的具體內容。
經過觀察,咱們能夠知道這是一個get請求,經過研究,摸清了一些參數的含義,以下:
**通常不懂的參數能夠試試不要,或者直接寫死,不少狀況是能夠採集到數據的。
接下來寫代碼。
# encoding=utf-8 # date: 2019/4/26 __author__ = "Masako" import json import requests import urllib3 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) headers = { 'Host': 'mp.weixin.qq.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat QBCore/3.43.901.400 QQBrowser/9.0.2524.400', 'X-Requested-With': 'XMLHttpRequest', } biz = 'MzA5NDIzNzY1OQ==' # 公衆號id uin = 'MjM4NTIzNzQ5MQ==' # 用戶id key = '08039a5457341b11f0c0b7e68e3cda9f6cbf593f925e8716293a13998bece633ea775eeb0159' \ 'a1183ca88d27b3060f6fc2c3428ef633f851029a64fa0638e41d111e13dce78055e01a39d3d0fdd2f657' # 是個變量 pass_ticket = 'dKBE2K1SSAJHmrnd8fMJpWD6j52ASjpQfBiMjm74DyZd1Y7TsoOD/25GgM80trTX' # 彷佛用處不大 offset = 0 pagesize = 10 proxies = { 'https': '218.86.87.171:53281' } url = "https://mp.weixin.qq.com/mp/profile_ext" params = { "action": "getmsg", "__biz": biz, "f": "json", "offset": offset, "count": pagesize, "is_ok": 1, "scene": 124, "uin": uin, "key": key, "pass_ticket": pass_ticket, "wxtoken": "", # "appmsg_token": appmsg_token, } response = requests.get(url, params=params, headers=headers, proxies=proxies, verify=False) print(response.text)
這裏面的參數都是直接從fiddler裏面複製過來的,因爲開着fiddler,因此使用了verify=False,而後使用urllib3關閉了告警。
這份代碼裏面的參數基本能夠不變,除了key大約十幾分鍾會過時。而後採集文章比較多的時候(好像幾百條?)會封ip,須要使用ip代理。也就是設置proxies。
翻頁的時候改變offset便可。
具體封裝能夠本身寫函數。
運行結果:
獲得json的返回,解析其中的文章列表便可獲得文章標題,頭圖,做者,連接等信息。
在這個json中還有"next_offset"(下一頁的起始位置),"can_msg_continue"(是否能夠繼續翻頁),等相關信息,能夠幫助翻頁採集。
咱們在這裏只獲取到了文章的基本信息,和連接,將連接拿下來使用get訪問也能夠獲取到文章內容,但還獲取不到閱讀數,因此須要進一步分析。
如今咱們在歷史消息列表中隨便找一篇文章,點擊進入。
能夠看到先請求了一個文章連接,就是上一步文章列表中獲取到的連接,裏面包含文章內容,咱們本身也能夠請求到。
在文章連接下面有一條包含「getappmsgext」的請求,點進去能夠看到是一個json,咱們解析這個json,能夠看到read_num,
對比頁面數據能夠知道這個read_num就是閱讀量了,這裏還包括like_num在看數,comment_count評論數等(這個評論數彷佛不是顯示的評論,而是總數)。
那麼咱們要獲取的就是這個請求了,分析過程同上面分析列表同樣,接下來直接上請求代碼:
# encoding=utf-8 # date: 2019/5/15 __author__ = "Masako" import time import requests import urllib3 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) proxies = { 'https': '218.86.87.171:53281' } headers = { 'CSP': 'active', 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat ' 'QBCore/3.43.901.400 QQBrowser/9.0.2524.400', 'X-Requested-With': 'XMLHttpRequest', } biz = 'MzA5NDIzNzY1OQ==' # 公衆號id uin = 'MjM4OTIzNzY1OQ==' # 用戶id key = '333b7957c9b8367188f9a405069beed8a92625eae5e601ffda55443a53b7779af3d96bcd7' \ 'f992fb9f12557105abab467a55862681e76178e39b239a57d0c9aef7b324eb5fd1ae706b3aeef6c8f9d31a4' pass_ticket = 'dKBE2K1SSAJHmrnd8fMJpWD6j52ASjpQfBiMjm74DyZd1Y7TsoOD/25GgM80trTX' url = "https://mp.weixin.qq.com/mp/getappmsgext" params = { "mock": "", "f": "json", "uin": uin, "key": key, "pass_ticket": pass_ticket, "wxtoken": "777", "devicetype": "Windows%26nbsp%3B10", # "appmsg_token": appmsg_token, } t = int(time.time()) # 如下參數先使用複製的,後續再說獲取 appmsg_type = "9" msg_title = "%E7%A8%8B%E5%BA%8F%E5%91%98%E7%9A%84%E6%97%A5%E5%B8%B8%E5%A4%A7%E6%8F%AD%E9%9C%B2%EF%BC%8C" \ "%E5%A4%AA%E7%9C%9F%E5%AE%9E%E4%BA%86%EF%BC%81" req_id = "1516dM576eEqb9OJ50G0ECvJ" comment_id = "802341523856785408" mid = "2735613806" sn = "48862f1fb98b5d1a0550ce27594f1361" idx = "1" scene = "38" appmsg_like_type = "2" data = { # "r": "0.48046619608066976", "__biz": biz, # 公衆號id "appmsg_type": appmsg_type, # 信息類型 "mid": mid, # 一個參數 "sn": sn, # 一個參數 "idx": "1", "scene": scene, # 一個數字 "title": msg_title, # 文章標題 "comment_id": comment_id, # 評論id "ct": t, # 時間戳 "pass_ticket": pass_ticket, # 一個參數 "req_id": req_id, # 一個參數 "abtest_cookie": "", "devicetype": "Windows+10", "version": "62060728", "is_need_ticket": "0", # 後面一些標識直接寫死 "is_need_ad": "0", "is_need_reward": "1", "both_ad": "0", "send_time": "", "msg_daily_idx": "1", "is_original": "0", "is_only_read": "1", "is_temp_url": "0", "item_show_type": "0", "tmp_version": "1", "more_read_type": "0", "appmsg_like_type": "2" } response = requests.post(url, params=params, data=data, headers=headers, proxies=proxies, verify=False) print(response.text)
跑這份代碼能夠獲得一個json
能夠看到閱讀數等信息都在這裏面了。接下來咱們解決上述代碼中的一個問題——參數。
上述代碼中的部分參數是變量,每篇文章不一樣,示例中是直接複製下來使用,實際狀況須要每次去獲取的,若是每次都手動複製就沒有意義了。
回到以前的文章內容頁面
咱們隨便選一個參數在頁面搜索一下就能夠看到,有這些參數的定義,好比上圖中的comment_id。
一樣,其餘參數也有,有的參數在連接中就有,均可以做爲獲取途徑。
看這些參數都是JavaScript的定義,選擇先訪問這個頁面直接用正則獲取這些參數。代碼以下:
# encoding=utf-8 # date: 2019/5/15 __author__ = "Masako" import re import requests import urllib3 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) headers = { 'Host': 'mp.weixin.qq.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat QBCore/3.43.901.400 QQBrowser/9.0.2524.400', 'X-Requested-With': 'XMLHttpRequest', } # 從文章列表獲取到的連接,拼接域名 art_url = 'https://mp.weixin.qq.com/s?__biz=MzA5NDIzNzY1OQ==&mid=2735613806&idx=1&sn=48862f1fb98b5d1a0550ce27594f1361&chksm=b6ab21da81dca8cca0ed20d529a9f550a98f751b326754ef57cd02e5b261ce43fc628d5bf9db&scene=38&key=33ba9b7dde092b04c3cefb3cd24fa4be6815ea9c7ca566093b935014bef02d21bc4c1c28ba937ffdae3935020224da51188ae48f135981b067d3bf1ac5397375ef58670a5e9fcffdeefb069b04876363&ascene=7&uin=MjM4OTI0MzQ5MQ%3D%3D&devicetype=Windows+10&version=62060739&lang=zh_CN&pass_ticket=zGPZpVX8Mp%2BRMvVPKZF6Ci4MecfwbAppLGWvSu3bNP01O8gMXkV7%2B4pMIzep9g30&winzoom=1' # 請求到頁面 response = requests.get(art_url, headers=headers, verify=False) content = response.text # 正則獲取必要參數 appmsg_type = re.findall('appmsg_type = "(\d+)"', content)[0] msg_title = re.findall('msg_title = "(.*?)"', content)[0] req_id = re.findall("req_id = '(.*?)'", content)[0] comment_id = re.findall('comment_id = "(.*?)"', content)[0] appmsg_like_type = re.findall('appmsg_like_type = "(.*?)"', content)[0] scene = re.findall('var source = "(.*?)"', content)[0] print(msg_title)
(代碼未作錯誤處理,正則可能會報錯。)
這份代碼能夠打印一些信息。
將各個部分的代碼整合一下。
1 # encoding=utf-8 2 # date: 2019/5/15 3 __author__ = "Masako" 4 5 import re 6 import json 7 import time 8 import html 9 import requests 10 11 from Elise.crawler import Crawler 12 13 import urllib3 14 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning) 15 16 17 class GZHSpider: 18 19 def __init__(self): 20 self.biz = "" 21 self.uin = "" 22 self.key = "" 23 self.pass_ticket = "" 24 self.proxies = {} 25 self.headers = { 26 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)' 27 ' Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI ' 28 'WindowsWechat QBCore/3.43.901.400 QQBrowser/9.0.2524.400', 29 } 30 31 def get_art_list(self, offset=0, pagesize=10): 32 """ 33 獲取文章列表 34 所需參數是調用時的變量,其餘參數能夠固定,在初始化時設置 35 :param offset: int, 偏移量,至關於頁碼, 可由上一頁的位置獲得 36 :param pagesize: int, 每頁條數,默認爲10 37 :return: 訪問到的json數據 38 """ 39 url = "https://mp.weixin.qq.com/mp/profile_ext" 40 result = {} 41 # offset = page * pagesize 42 params = { 43 "action": "getmsg", 44 "__biz": self.biz, 45 "f": "json", 46 "offset": offset, 47 "count": pagesize, 48 "is_ok": 1, 49 "scene": '38', 50 "uin": self.uin, 51 "key": self.key, 52 "pass_ticket": self.pass_ticket, 53 "wxtoken": "", 54 } 55 try: 56 response = requests.get(url, params=params, headers=self.headers, proxies=self.proxies, verify=False) 57 except Exception as e: 58 result['code'] = 1 59 result['msg'] = str(e) 60 return result 61 62 try: 63 data = json.loads(response.text) 64 data['code'] = 0 65 return data 66 except json.decoder.JSONDecodeError as e: 67 result['code'] = 2 68 result['msg'] = str(e) 69 return result 70 71 def get_art_page(self, art_url): 72 """ 73 從文章頁面獲取採集閱讀量須要的數據 74 :param art_url: str, 文章連接 75 :return: 76 """ 77 result = {} 78 try: 79 response = requests.get(art_url, headers=self.headers, proxies=self.proxies, verify=False) 80 # print(response.text) 81 except Exception as e: 82 result['code'] = 1 83 result['msg'] = str(e) 84 return result 85 86 # 處理文章錯誤 87 try: 88 if '訪問過於頻繁' in response.text: # 訪問頻繁,需換ip 89 result['code'] = 4 90 result['msg'] = "ip banned" 91 return result 92 if '沒法查看' in response.text: # 沒法查看,被刪除或者被違規被舉報 93 result['code'] = 5 94 result['msg'] = "content violation" 95 return result 96 data = self.parse_art_page(response.text) 97 except Exception as e: # 其餘錯誤致使解析失敗 98 result['code'] = 2 99 result['msg'] = str(e) 100 return result 101 102 result['data'] = data 103 result['code'] = 0 104 return result 105 106 @staticmethod 107 def parse_art_page(content): 108 """ 109 解析文章 html 110 :param content: 文章頁面的html, 字符串 111 :return: 112 """ 113 def get_value(s, name): 114 value_str = re.findall('var %s = (.*?);' % name, s)[0] 115 patten = re.compile('"(.*?)"') 116 r_list = re.findall(patten, value_str) 117 for i in r_list: 118 if i: 119 return i 120 else: 121 return '' 122 123 # 直接正則獲取了 124 appmsg_type = re.findall('appmsg_type = "(\d+)"', content)[0] 125 msg_title = re.findall('msg_title = "(.*?)"', content)[0] 126 req_id = re.findall("req_id = '(.*?)'", content)[0] 127 comment_id = re.findall('comment_id = "(.*?)"', content)[0] 128 129 mid = get_value(content, 'mid') 130 sn = get_value(content, 'sn') 131 idx = get_value(content, 'idx') 132 scene = re.findall('var source = "(.*?)"', content)[0] 133 publish_time = re.findall('var publish_time = "(.*?)"', content)[0] 134 135 appmsg_like_type = re.findall('appmsg_like_type = "(.*?)"', content)[0] 136 137 params = { 138 "appmsg_type": appmsg_type, 139 "msg_title": msg_title, 140 "publish_time": publish_time, 141 "mid": mid, 142 "sn": sn, 143 "idx": idx, 144 "scene": scene, 145 "req_id": req_id, 146 "comment_id": comment_id, 147 "appmsg_like_type": appmsg_like_type, 148 } 149 return params 150 151 def get_art_about(self, params_data): 152 """ 153 獲取閱讀量點贊數等相關信息 154 :param params_data: dict, 須要的參數 155 :return: 156 """ 157 url = "https://mp.weixin.qq.com/mp/getappmsgext" 158 result = {} 159 # offset = page * pagesize 160 params = { 161 "mock": "", 162 "f": "json", 163 "uin": self.uin, 164 "key": self.key, 165 "pass_ticket": self.pass_ticket, 166 "wxtoken": "777", 167 "devicetype": "Windows%26nbsp%3B10", 168 # "appmsg_token": appmsg_token, 169 } 170 t = int(time.time()) 171 # title = requests.utils.quote(title) 172 data = { 173 # "r": "0.48046619608066976", 174 "__biz": self.biz, 175 "appmsg_type": "9", # 複製下來的值,會被覆蓋掉 176 "mid": "", 177 "sn": "", 178 "idx": "1", 179 "scene": "", 180 "title": "", # 爲空,後面覆蓋 181 "ct": t, 182 "abtest_cookie": "", 183 "devicetype": "Windows+10", 184 "version": "62060728", 185 "is_need_ticket": "0", 186 "is_need_ad": "0", 187 "comment_id": "", 188 "is_need_reward": "1", 189 "both_ad": "0", 190 "send_time": "", 191 "msg_daily_idx": "1", 192 "is_original": "0", 193 "is_only_read": "1", 194 "pass_ticket": self.pass_ticket, # 也能夠寫死 195 "is_temp_url": "0", 196 "item_show_type": "0", 197 "tmp_version": "1", 198 "more_read_type": "0", 199 "appmsg_like_type": "2" 200 } 201 if isinstance(params_data, dict): # 將傳進來的參數和一些寫死的參數合併到一個字典 202 data.update(params_data) 203 headers = { 204 'CSP': "active", 205 'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8", 206 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)' 207 ' Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI ' 208 'WindowsWechat QBCore/3.43.901.400 QQBrowser/9.0.2524.400', 209 } 210 try: 211 response = requests.post(url, params=params, data=data, headers=headers, proxies=self.proxies, verify=False) 212 except Exception as e: 213 result['code'] = 1 214 result['msg'] = str(e) 215 return result 216 217 try: 218 data = json.loads(response.text) 219 appmsgstat = data.get('appmsgstat') 220 if appmsgstat: 221 result['code'] = 0 222 result['data'] = data 223 return result 224 # {'base_resp': {'ret': 302, 'errmsg': 'default'}} 225 resp = data.get('base_resp', {}) 226 ret = resp.get('ret') 227 if ret == 302: 228 result['code'] = 0 # 先存下來再說 229 result['data'] = data 230 return result 231 except json.decoder.JSONDecodeError as e: 232 result['code'] = 2 233 result['msg'] = str(e) 234 return result 235 236 result['code'] = 3 # 表示登陸信息過時 237 result['data'] = data 238 return result 239 240 def get_art_by_url(self, art_url): 241 """ 242 整合一下獲取閱讀量的過程 243 :param art_url: str, 文章連接 244 :return: 245 """ 246 r_0 = self.get_art_page(art_url) 247 code = r_0.get('code') 248 if code != 0: 249 return r_0 250 data = r_0.get('data', {}) 251 r_1 = self.get_art_about(data) 252 code = r_1.get('code') 253 if code != 0: 254 return r_1 255 result = r_1 256 result['data']['pre_info'] = data 257 # 記錄採集時間 258 t = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time())) 259 result['data']['c_time'] = t 260 return result 261 262 263 class GZHCrawler(Crawler): 264 def __init__(self, spider): 265 Crawler.__init__(self, spider) 266 267 def _stop(self): 268 # self.input_que.clear() 269 self.input_que.unfinished_tasks = 0 # 清空隊列的計數器 270 271 def crawl_list(self): 272 while True: 273 try: 274 offset = self.input_que.get() 275 print(offset) # 打印頁碼,能夠直觀看到進度 276 except Exception as e: 277 time.sleep(1) 278 continue 279 280 ret = self.spider.get_art_list(offset=offset) 281 code = ret.get('code') 282 if code != 0: 283 self.input_que.put(offset) 284 self.input_que.task_done() 285 continue 286 287 status = ret.get('ret') 288 if status == -3: # cookie過時 289 print(offset) 290 print(ret) 291 292 data_list_str = ret.get('general_msg_list') 293 try: 294 data = json.loads(data_list_str) 295 except Exception as e: 296 self.input_que.task_done() 297 continue 298 299 art_list = data.get('list') 300 for a in art_list: 301 # self.out_que.put(a) 302 data_info = a.get('app_msg_ext_info', {}) 303 title = data_info.get('title', '') 304 digest = data_info.get('digest', '') 305 content_url = data_info.get('content_url', '') 306 content_url = html.unescape(content_url) 307 fileid = data_info.get('fileid', '') 308 author = data_info.get('author', '') 309 d = { 310 "title": title, 311 "digest": digest, 312 "content_url": content_url, 313 "fileid": fileid, 314 "author": author, 315 "head": 1 316 } 317 # print(d) # 打印結果看看 318 if fileid: 319 self.out_que.put(d) 320 multi_app_msg_item_list = data_info.get('multi_app_msg_item_list', []) 321 for i in multi_app_msg_item_list: 322 title = i.get('title', '') 323 digest = i.get('digest', '') 324 content_url = i.get('content_url', '') 325 content_url = html.unescape(content_url) 326 fileid = i.get('fileid', '') 327 author = i.get('author', '') 328 if fileid: 329 d = { 330 "title": title, 331 "digest": digest, 332 "content_url": content_url, 333 "fileid": fileid, 334 "author": author, 335 "head": 0 336 } 337 self.out_que.put(d) 338 339 is_not_end = ret.get("can_msg_continue", 0) 340 next_page = ret.get("next_offset") 341 if is_not_end: 342 self.input_que.put(next_page) 343 344 self.input_que.task_done() 345 time.sleep(5) 346 347 def crawl(self): 348 while True: 349 try: 350 params = self.input_que.get(timeout=0.2) 351 print(params) 352 except Exception as e: 353 time.sleep(1) 354 continue 355 356 url = params.get('content_url', '') 357 result_data = self.spider.get_art_by_url(url) 358 data = result_data.get('data', {}) 359 code = result_data.get('code') 360 if code == 3: # 登陸信息錯誤,就退出 361 # self.input_que.task_done() 362 self._stop() 363 if code == 4 or code == 5: # ip被封禁; 內容違規, 就丟棄 364 self.input_que.task_done() 365 continue 366 if code != 0: # 其餘錯誤從新採集 367 self.input_que.put(params) 368 self.input_que.task_done() 369 continue 370 371 if data: 372 data.update(params) 373 self.out_que.put(data) 374 self.input_que.task_done() 375 time.sleep(3) 376 377 378 def test_spider(): 379 spider = GZHSpider() 380 spider.biz = 'MzA5NDIzNzY1OQ==' # 公衆號id 381 spider.uin = 'MjM4OTIzNzY1OQ==' # 微信號id 382 spider.key = '014a8898c5f07cd6845f41fa83ff9b4edfa4556f8e3371f1e7d5081b24b931f317f94c48a4e42931b2a6ae5fe846ddc59749d081e5bbf45fc5ac93ebde78d13e7480dcf0b952752b993ac8158e936dbf' 383 spider.pass_ticket = 'nEfY/UYG8sVbejI2/vtgkoMsxh5cw4FgVeJpRIrQLOAbRTyczaZCoBRr97c9HsCi' 384 385 result_1 = spider.get_art_list() 386 # 打印獲取到的列表 387 print(json.dumps(result_1)) 388 general_msg_list = result_1.get('general_msg_list', {}) 389 data_list_json = json.loads(general_msg_list) 390 art_list = data_list_json.get('list') 391 for article in art_list: 392 data_info = article.get('app_msg_ext_info', {}) 393 content_url = data_info.get('content_url', '') 394 content_url = html.unescape(content_url) 395 print(content_url) 396 result_2 = spider.get_art_by_url(content_url) 397 # 打印一下獲取到的文章信息 398 print(json.dumps(result_2)) 399 break 400 401 402 def test_crawler(): 403 s = GZHSpider() 404 s.biz = 'MzA5NDIzNzY1OQ==' # 公衆號id 405 s.uin = 'MjM4OTIzNzY1OQ==' # 微信號id 406 s.key = '014a8898c5f07cd6845f41fa83ff9b4edfa4556f8e3371f1e7d5081b24b931f317f94c48a4e42931b2a6' \ 407 'ae5fe846ddc59749d081e5bbf45fc5ac93ebde78d13e7480dcf0b952752b993ac8158e936dbf' 408 s.pass_ticket = 'nEfY/UYG8sVbejI2/vtgkoMsxh5cw4FgVeJpRIrQLOAbRTyczaZCoBRr97c9HsCi' 409 s.proxies = { # 設置ip代理 410 'https': '218.86.87.171:53281' 411 } 412 413 # 採集文章列表 414 crawler = GZHCrawler(s) 415 crawler.thd_num = 1 416 crawler.crawl_func = crawler.crawl_list 417 crawler.start_page_list = [0] 418 crawler.out_file = 'runoob_list.json' 419 crawler.run() 420 421 # 採集文章數據 422 crawler.crawl_func = crawler.crawl 423 crawler.input_file = 'runoob_list.json' 424 crawler.out_file = 'runoob_detail.json' 425 crawler.run() 426 427 428 if __name__ == "__main__": 429 test_crawler()
這樣就能夠自動獲取文章列表並保存,隨後獲取文章閱讀數等相關信息並保存。
這個缺點是並不徹底自動化,由於key會過時,測試大約採集三四百條(十幾分鍾?)就過時了。訪問頻繁的時候也會封ip,因此須要ip代理。
以上全部代碼要跑的話,務必更換 uin,key.