微信公衆號文章信息採集

嘗試一下抓取微信公衆號歷史文章。html

採集的主要信息有:標題、描述、做者、評論數、閱讀數、在看數、發佈時間、文章連接python

主要有這幾個步驟:

 須要準備工具:

fiddlerjson

微信PC客戶端微信

使用python3,相應環境本身搭建。cookie

分析流程:

首先,打開fiddler,操做一下本身的微信,訪問一些公衆號,而後看一下請求,這裏我訪問菜鳥教程的歷史文章,界面以下app

 在fiddler裏面找到微信相關的請求。以下ide

若是請求太多,能夠設置過濾「mp.weixin.qq.com」,就能夠只看微信這邊過來的請求了。函數

 

咱們將上面發現的幾個請求每一個都看一遍,能夠發如今第二個請求「/mp/profile_ext?action=home...」裏面有一些數據工具

 

能夠看到這個就是上面截圖中的頁面,裏面有個'msgList'的js變量,存的貌似是頁面內容的json。post

也就是說咱們請求到這個頁面就能夠獲取這個文章列表了。

**注意到TextView這個tab的時候可能會告訴你頁面沒有解碼,點擊解碼就能夠了。

到如今是否是就能夠着手開發代碼了?別慌。

咱們只知道這個頁面有數據,但還不知道怎麼下一頁的數據怎麼獲取,先將界面上的文章拉到第二頁看一下。(這個時候須要先關注公衆號)

咱們能夠看到第二頁相同連接的請求已經變成了json。

json數據,比解析網頁彷佛要方便許多。

 分析請求:

看一下找到的這個請求的具體內容。

經過觀察,咱們能夠知道這是一個get請求,經過研究,摸清了一些參數的含義,以下:

**通常不懂的參數能夠試試不要,或者直接寫死,不少狀況是能夠採集到數據的。

接下來寫代碼。

# encoding=utf-8
# date: 2019/4/26
__author__ = "Masako"

import json
import requests

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

headers = {
    'Host': 'mp.weixin.qq.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat QBCore/3.43.901.400 QQBrowser/9.0.2524.400',
    'X-Requested-With': 'XMLHttpRequest',
}
biz = 'MzA5NDIzNzY1OQ=='   # 公衆號id
uin = 'MjM4NTIzNzQ5MQ=='   # 用戶id
key = '08039a5457341b11f0c0b7e68e3cda9f6cbf593f925e8716293a13998bece633ea775eeb0159' \
      'a1183ca88d27b3060f6fc2c3428ef633f851029a64fa0638e41d111e13dce78055e01a39d3d0fdd2f657'  # 是個變量
pass_ticket = 'dKBE2K1SSAJHmrnd8fMJpWD6j52ASjpQfBiMjm74DyZd1Y7TsoOD/25GgM80trTX'  # 彷佛用處不大

offset = 0
pagesize = 10

proxies = {
    'https': '218.86.87.171:53281'
}

url = "https://mp.weixin.qq.com/mp/profile_ext"
params = {
    "action": "getmsg",
    "__biz": biz,
    "f": "json",
    "offset": offset,
    "count": pagesize,
    "is_ok": 1,
    "scene": 124,
    "uin": uin,
    "key": key,
    "pass_ticket": pass_ticket,
    "wxtoken": "",
    # "appmsg_token": appmsg_token,
}
response = requests.get(url, params=params, headers=headers, proxies=proxies, verify=False)

print(response.text)

這裏面的參數都是直接從fiddler裏面複製過來的,因爲開着fiddler,因此使用了verify=False,而後使用urllib3關閉了告警。

這份代碼裏面的參數基本能夠不變,除了key大約十幾分鍾會過時。而後採集文章比較多的時候(好像幾百條?)會封ip,須要使用ip代理。也就是設置proxies。

翻頁的時候改變offset便可。

具體封裝能夠本身寫函數。

運行結果:

獲得json的返回,解析其中的文章列表便可獲得文章標題,頭圖,做者,連接等信息。

在這個json中還有"next_offset"(下一頁的起始位置),"can_msg_continue"(是否能夠繼續翻頁),等相關信息,能夠幫助翻頁採集。

更詳細的數據

咱們在這裏只獲取到了文章的基本信息,和連接,將連接拿下來使用get訪問也能夠獲取到文章內容,但還獲取不到閱讀數,因此須要進一步分析。

如今咱們在歷史消息列表中隨便找一篇文章,點擊進入。

能夠看到先請求了一個文章連接,就是上一步文章列表中獲取到的連接,裏面包含文章內容,咱們本身也能夠請求到。

在文章連接下面有一條包含「getappmsgext」的請求,點進去能夠看到是一個json,咱們解析這個json,能夠看到read_num,

對比頁面數據能夠知道這個read_num就是閱讀量了,這裏還包括like_num在看數,comment_count評論數等(這個評論數彷佛不是顯示的評論,而是總數)。

那麼咱們要獲取的就是這個請求了,分析過程同上面分析列表同樣,接下來直接上請求代碼:

# encoding=utf-8
# date: 2019/5/15
__author__ = "Masako"

import time
import requests
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

proxies = {
    'https': '218.86.87.171:53281'
}

headers = {
    'CSP': 'active',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat '
                  'QBCore/3.43.901.400 QQBrowser/9.0.2524.400',
    'X-Requested-With': 'XMLHttpRequest',
}
biz = 'MzA5NDIzNzY1OQ=='  # 公衆號id
uin = 'MjM4OTIzNzY1OQ=='  # 用戶id
key = '333b7957c9b8367188f9a405069beed8a92625eae5e601ffda55443a53b7779af3d96bcd7' \
      'f992fb9f12557105abab467a55862681e76178e39b239a57d0c9aef7b324eb5fd1ae706b3aeef6c8f9d31a4'
pass_ticket = 'dKBE2K1SSAJHmrnd8fMJpWD6j52ASjpQfBiMjm74DyZd1Y7TsoOD/25GgM80trTX'

url = "https://mp.weixin.qq.com/mp/getappmsgext"

params = {
    "mock": "",
    "f": "json",
    "uin": uin,
    "key": key,
    "pass_ticket": pass_ticket,
    "wxtoken": "777",
    "devicetype": "Windows%26nbsp%3B10",
    # "appmsg_token": appmsg_token,
}

t = int(time.time())
# 如下參數先使用複製的,後續再說獲取
appmsg_type = "9"
msg_title = "%E7%A8%8B%E5%BA%8F%E5%91%98%E7%9A%84%E6%97%A5%E5%B8%B8%E5%A4%A7%E6%8F%AD%E9%9C%B2%EF%BC%8C" \
            "%E5%A4%AA%E7%9C%9F%E5%AE%9E%E4%BA%86%EF%BC%81"
req_id = "1516dM576eEqb9OJ50G0ECvJ"
comment_id = "802341523856785408"
mid = "2735613806"
sn = "48862f1fb98b5d1a0550ce27594f1361"
idx = "1"
scene = "38"
appmsg_like_type = "2"

data = {
    # "r": "0.48046619608066976",
    "__biz": biz,  # 公衆號id
    "appmsg_type": appmsg_type,  # 信息類型
    "mid": mid,  # 一個參數
    "sn": sn,  # 一個參數
    "idx": "1",
    "scene": scene,  # 一個數字
    "title": msg_title,  # 文章標題
    "comment_id": comment_id,  # 評論id
    "ct": t,  # 時間戳
    "pass_ticket": pass_ticket,  # 一個參數
    "req_id": req_id,  # 一個參數
    "abtest_cookie": "",
    "devicetype": "Windows+10",
    "version": "62060728",
    "is_need_ticket": "0",  # 後面一些標識直接寫死
    "is_need_ad": "0",
    "is_need_reward": "1",
    "both_ad": "0",
    "send_time": "",
    "msg_daily_idx": "1",
    "is_original": "0",
    "is_only_read": "1",
    "is_temp_url": "0",
    "item_show_type": "0",
    "tmp_version": "1",
    "more_read_type": "0",
    "appmsg_like_type": "2"
}

response = requests.post(url, params=params, data=data, headers=headers, proxies=proxies, verify=False)

print(response.text)

跑這份代碼能夠獲得一個json

能夠看到閱讀數等信息都在這裏面了。接下來咱們解決上述代碼中的一個問題——參數。

獲取必要參數

上述代碼中的部分參數是變量,每篇文章不一樣,示例中是直接複製下來使用,實際狀況須要每次去獲取的,若是每次都手動複製就沒有意義了。

回到以前的文章內容頁面

咱們隨便選一個參數在頁面搜索一下就能夠看到,有這些參數的定義,好比上圖中的comment_id。

一樣,其餘參數也有,有的參數在連接中就有,均可以做爲獲取途徑。

看這些參數都是JavaScript的定義,選擇先訪問這個頁面直接用正則獲取這些參數。代碼以下:

# encoding=utf-8
# date: 2019/5/15
__author__ = "Masako"

import re
import requests

import urllib3
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

headers = {
    'Host': 'mp.weixin.qq.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI WindowsWechat QBCore/3.43.901.400 QQBrowser/9.0.2524.400',
    'X-Requested-With': 'XMLHttpRequest',
}

# 從文章列表獲取到的連接,拼接域名
art_url = 'https://mp.weixin.qq.com/s?__biz=MzA5NDIzNzY1OQ==&mid=2735613806&idx=1&sn=48862f1fb98b5d1a0550ce27594f1361&chksm=b6ab21da81dca8cca0ed20d529a9f550a98f751b326754ef57cd02e5b261ce43fc628d5bf9db&scene=38&key=33ba9b7dde092b04c3cefb3cd24fa4be6815ea9c7ca566093b935014bef02d21bc4c1c28ba937ffdae3935020224da51188ae48f135981b067d3bf1ac5397375ef58670a5e9fcffdeefb069b04876363&ascene=7&uin=MjM4OTI0MzQ5MQ%3D%3D&devicetype=Windows+10&version=62060739&lang=zh_CN&pass_ticket=zGPZpVX8Mp%2BRMvVPKZF6Ci4MecfwbAppLGWvSu3bNP01O8gMXkV7%2B4pMIzep9g30&winzoom=1'
# 請求到頁面
response = requests.get(art_url, headers=headers, verify=False)
content = response.text

# 正則獲取必要參數
appmsg_type = re.findall('appmsg_type = "(\d+)"', content)[0]
msg_title = re.findall('msg_title = "(.*?)"', content)[0]
req_id = re.findall("req_id = '(.*?)'", content)[0]
comment_id = re.findall('comment_id = "(.*?)"', content)[0]
appmsg_like_type = re.findall('appmsg_like_type = "(.*?)"', content)[0]
scene = re.findall('var source = "(.*?)"', content)[0]

print(msg_title)

(代碼未作錯誤處理,正則可能會報錯。)

這份代碼能夠打印一些信息。

整合一下

 將各個部分的代碼整合一下。

  1 # encoding=utf-8
  2 # date: 2019/5/15
  3 __author__ = "Masako"
  4 
  5 import re
  6 import json
  7 import time
  8 import html
  9 import requests
 10 
 11 from Elise.crawler import Crawler
 12 
 13 import urllib3
 14 urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
 15 
 16 
 17 class GZHSpider:
 18 
 19     def __init__(self):
 20         self.biz = ""
 21         self.uin = ""
 22         self.key = ""
 23         self.pass_ticket = ""
 24         self.proxies = {}
 25         self.headers = {
 26             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'
 27                           ' Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI '
 28                           'WindowsWechat QBCore/3.43.901.400 QQBrowser/9.0.2524.400',
 29         }
 30 
 31     def get_art_list(self, offset=0, pagesize=10):
 32         """
 33         獲取文章列表
 34         所需參數是調用時的變量,其餘參數能夠固定,在初始化時設置
 35         :param offset: int, 偏移量,至關於頁碼, 可由上一頁的位置獲得
 36         :param pagesize: int, 每頁條數,默認爲10
 37         :return: 訪問到的json數據
 38         """
 39         url = "https://mp.weixin.qq.com/mp/profile_ext"
 40         result = {}
 41         # offset = page * pagesize
 42         params = {
 43             "action": "getmsg",
 44             "__biz": self.biz,
 45             "f": "json",
 46             "offset": offset,
 47             "count": pagesize,
 48             "is_ok": 1,
 49             "scene": '38',
 50             "uin": self.uin,
 51             "key": self.key,
 52             "pass_ticket": self.pass_ticket,
 53             "wxtoken": "",
 54         }
 55         try:
 56             response = requests.get(url, params=params, headers=self.headers, proxies=self.proxies, verify=False)
 57         except Exception as e:
 58             result['code'] = 1
 59             result['msg'] = str(e)
 60             return result
 61 
 62         try:
 63             data = json.loads(response.text)
 64             data['code'] = 0
 65             return data
 66         except json.decoder.JSONDecodeError as e:
 67             result['code'] = 2
 68             result['msg'] = str(e)
 69             return result
 70 
 71     def get_art_page(self, art_url):
 72         """
 73         從文章頁面獲取採集閱讀量須要的數據
 74         :param art_url: str, 文章連接
 75         :return:
 76         """
 77         result = {}
 78         try:
 79             response = requests.get(art_url, headers=self.headers, proxies=self.proxies, verify=False)
 80             # print(response.text)
 81         except Exception as e:
 82             result['code'] = 1
 83             result['msg'] = str(e)
 84             return result
 85 
 86         # 處理文章錯誤
 87         try:
 88             if '訪問過於頻繁' in response.text:  # 訪問頻繁,需換ip
 89                 result['code'] = 4
 90                 result['msg'] = "ip banned"
 91                 return result
 92             if '沒法查看' in response.text:  # 沒法查看,被刪除或者被違規被舉報
 93                 result['code'] = 5
 94                 result['msg'] = "content violation"
 95                 return result
 96             data = self.parse_art_page(response.text)
 97         except Exception as e:  # 其餘錯誤致使解析失敗
 98             result['code'] = 2
 99             result['msg'] = str(e)
100             return result
101 
102         result['data'] = data
103         result['code'] = 0
104         return result
105 
106     @staticmethod
107     def parse_art_page(content):
108         """
109         解析文章 html
110         :param content: 文章頁面的html, 字符串
111         :return:
112         """
113         def get_value(s, name):
114             value_str = re.findall('var %s = (.*?);' % name, s)[0]
115             patten = re.compile('"(.*?)"')
116             r_list = re.findall(patten, value_str)
117             for i in r_list:
118                 if i:
119                     return i
120             else:
121                 return ''
122 
123         # 直接正則獲取了
124         appmsg_type = re.findall('appmsg_type = "(\d+)"', content)[0]
125         msg_title = re.findall('msg_title = "(.*?)"', content)[0]
126         req_id = re.findall("req_id = '(.*?)'", content)[0]
127         comment_id = re.findall('comment_id = "(.*?)"', content)[0]
128 
129         mid = get_value(content, 'mid')
130         sn = get_value(content, 'sn')
131         idx = get_value(content, 'idx')
132         scene = re.findall('var source = "(.*?)"', content)[0]
133         publish_time = re.findall('var publish_time = "(.*?)"', content)[0]
134 
135         appmsg_like_type = re.findall('appmsg_like_type = "(.*?)"', content)[0]
136 
137         params = {
138             "appmsg_type": appmsg_type,
139             "msg_title": msg_title,
140             "publish_time": publish_time,
141             "mid": mid,
142             "sn": sn,
143             "idx": idx,
144             "scene": scene,
145             "req_id": req_id,
146             "comment_id": comment_id,
147             "appmsg_like_type": appmsg_like_type,
148         }
149         return params
150 
151     def get_art_about(self, params_data):
152         """
153         獲取閱讀量點贊數等相關信息
154         :param params_data: dict, 須要的參數
155         :return:
156         """
157         url = "https://mp.weixin.qq.com/mp/getappmsgext"
158         result = {}
159         # offset = page * pagesize
160         params = {
161             "mock": "",
162             "f": "json",
163             "uin": self.uin,
164             "key": self.key,
165             "pass_ticket": self.pass_ticket,
166             "wxtoken": "777",
167             "devicetype": "Windows%26nbsp%3B10",
168             # "appmsg_token": appmsg_token,
169         }
170         t = int(time.time())
171         # title = requests.utils.quote(title)
172         data = {
173             # "r": "0.48046619608066976",
174             "__biz": self.biz,
175             "appmsg_type": "9",  # 複製下來的值,會被覆蓋掉
176             "mid": "",
177             "sn": "",
178             "idx": "1",
179             "scene": "",
180             "title": "",   # 爲空,後面覆蓋
181             "ct": t,
182             "abtest_cookie": "",
183             "devicetype": "Windows+10",
184             "version": "62060728",
185             "is_need_ticket": "0",
186             "is_need_ad": "0",
187             "comment_id": "",
188             "is_need_reward": "1",
189             "both_ad": "0",
190             "send_time": "",
191             "msg_daily_idx": "1",
192             "is_original": "0",
193             "is_only_read": "1",
194             "pass_ticket": self.pass_ticket,  # 也能夠寫死
195             "is_temp_url": "0",
196             "item_show_type": "0",
197             "tmp_version": "1",
198             "more_read_type": "0",
199             "appmsg_like_type": "2"
200         }
201         if isinstance(params_data, dict):   # 將傳進來的參數和一些寫死的參數合併到一個字典
202             data.update(params_data)
203         headers = {
204             'CSP': "active",
205             'Content-Type': "application/x-www-form-urlencoded; charset=UTF-8",
206             'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)'
207                           ' Chrome/39.0.2171.95 Safari/537.36 MicroMessenger/6.5.2.501 NetType/WIFI '
208                           'WindowsWechat QBCore/3.43.901.400 QQBrowser/9.0.2524.400',
209         }
210         try:
211             response = requests.post(url, params=params, data=data, headers=headers, proxies=self.proxies, verify=False)
212         except Exception as e:
213             result['code'] = 1
214             result['msg'] = str(e)
215             return result
216 
217         try:
218             data = json.loads(response.text)
219             appmsgstat = data.get('appmsgstat')
220             if appmsgstat:
221                 result['code'] = 0
222                 result['data'] = data
223                 return result
224             # {'base_resp': {'ret': 302, 'errmsg': 'default'}}
225             resp = data.get('base_resp', {})
226             ret = resp.get('ret')
227             if ret == 302:
228                 result['code'] = 0  # 先存下來再說
229                 result['data'] = data
230                 return result
231         except json.decoder.JSONDecodeError as e:
232             result['code'] = 2
233             result['msg'] = str(e)
234             return result
235 
236         result['code'] = 3  # 表示登陸信息過時
237         result['data'] = data
238         return result
239 
240     def get_art_by_url(self, art_url):
241         """
242         整合一下獲取閱讀量的過程
243         :param art_url: str, 文章連接
244         :return:
245         """
246         r_0 = self.get_art_page(art_url)
247         code = r_0.get('code')
248         if code != 0:
249             return r_0
250         data = r_0.get('data', {})
251         r_1 = self.get_art_about(data)
252         code = r_1.get('code')
253         if code != 0:
254             return r_1
255         result = r_1
256         result['data']['pre_info'] = data
257         # 記錄採集時間
258         t = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time()))
259         result['data']['c_time'] = t
260         return result
261 
262 
263 class GZHCrawler(Crawler):
264     def __init__(self, spider):
265         Crawler.__init__(self, spider)
266 
267     def _stop(self):
268         # self.input_que.clear()
269         self.input_que.unfinished_tasks = 0  # 清空隊列的計數器
270 
271     def crawl_list(self):
272         while True:
273             try:
274                 offset = self.input_que.get()
275                 print(offset)  #  打印頁碼,能夠直觀看到進度
276             except Exception as e:
277                 time.sleep(1)
278                 continue
279 
280             ret = self.spider.get_art_list(offset=offset)
281             code = ret.get('code')
282             if code != 0:
283                 self.input_que.put(offset)
284                 self.input_que.task_done()
285                 continue
286 
287             status = ret.get('ret')
288             if status == -3:  # cookie過時
289                 print(offset)
290                 print(ret)
291 
292             data_list_str = ret.get('general_msg_list')
293             try:
294                 data = json.loads(data_list_str)
295             except Exception as e:
296                 self.input_que.task_done()
297                 continue
298 
299             art_list = data.get('list')
300             for a in art_list:
301                 # self.out_que.put(a)
302                 data_info = a.get('app_msg_ext_info', {})
303                 title = data_info.get('title', '')
304                 digest = data_info.get('digest', '')
305                 content_url = data_info.get('content_url', '')
306                 content_url = html.unescape(content_url)
307                 fileid = data_info.get('fileid', '')
308                 author = data_info.get('author', '')
309                 d = {
310                     "title": title,
311                     "digest": digest,
312                     "content_url": content_url,
313                     "fileid": fileid,
314                     "author": author,
315                     "head": 1
316                 }
317                 # print(d)  # 打印結果看看
318                 if fileid:
319                     self.out_que.put(d)
320                 multi_app_msg_item_list = data_info.get('multi_app_msg_item_list', [])
321                 for i in multi_app_msg_item_list:
322                     title = i.get('title', '')
323                     digest = i.get('digest', '')
324                     content_url = i.get('content_url', '')
325                     content_url = html.unescape(content_url)
326                     fileid = i.get('fileid', '')
327                     author = i.get('author', '')
328                     if fileid:
329                         d = {
330                             "title": title,
331                             "digest": digest,
332                             "content_url": content_url,
333                             "fileid": fileid,
334                             "author": author,
335                             "head": 0
336                         }
337                         self.out_que.put(d)
338 
339             is_not_end = ret.get("can_msg_continue", 0)
340             next_page = ret.get("next_offset")
341             if is_not_end:
342                 self.input_que.put(next_page)
343 
344             self.input_que.task_done()
345             time.sleep(5)
346 
347     def crawl(self):
348         while True:
349             try:
350                 params = self.input_que.get(timeout=0.2)
351                 print(params)
352             except Exception as e:
353                 time.sleep(1)
354                 continue
355 
356             url = params.get('content_url', '')
357             result_data = self.spider.get_art_by_url(url)
358             data = result_data.get('data', {})
359             code = result_data.get('code')
360             if code == 3:  # 登陸信息錯誤,就退出
361                 # self.input_que.task_done()
362                 self._stop()
363             if code == 4 or code == 5:  # ip被封禁; 內容違規, 就丟棄
364                 self.input_que.task_done()
365                 continue
366             if code != 0:  # 其餘錯誤從新採集
367                 self.input_que.put(params)
368                 self.input_que.task_done()
369                 continue
370 
371             if data:
372                 data.update(params)
373                 self.out_que.put(data)
374             self.input_que.task_done()
375             time.sleep(3)
376 
377 
378 def test_spider():
379     spider = GZHSpider()
380     spider.biz = 'MzA5NDIzNzY1OQ=='  # 公衆號id
381     spider.uin = 'MjM4OTIzNzY1OQ=='  # 微信號id
382     spider.key = '014a8898c5f07cd6845f41fa83ff9b4edfa4556f8e3371f1e7d5081b24b931f317f94c48a4e42931b2a6ae5fe846ddc59749d081e5bbf45fc5ac93ebde78d13e7480dcf0b952752b993ac8158e936dbf'
383     spider.pass_ticket = 'nEfY/UYG8sVbejI2/vtgkoMsxh5cw4FgVeJpRIrQLOAbRTyczaZCoBRr97c9HsCi'
384 
385     result_1 = spider.get_art_list()
386     # 打印獲取到的列表
387     print(json.dumps(result_1))
388     general_msg_list = result_1.get('general_msg_list', {})
389     data_list_json = json.loads(general_msg_list)
390     art_list = data_list_json.get('list')
391     for article in art_list:
392         data_info = article.get('app_msg_ext_info', {})
393         content_url = data_info.get('content_url', '')
394         content_url = html.unescape(content_url)
395         print(content_url)
396         result_2 = spider.get_art_by_url(content_url)
397         # 打印一下獲取到的文章信息
398         print(json.dumps(result_2))
399         break
400 
401 
402 def test_crawler():
403     s = GZHSpider()
404     s.biz = 'MzA5NDIzNzY1OQ=='  # 公衆號id
405     s.uin = 'MjM4OTIzNzY1OQ=='  # 微信號id
406     s.key = '014a8898c5f07cd6845f41fa83ff9b4edfa4556f8e3371f1e7d5081b24b931f317f94c48a4e42931b2a6' \
407             'ae5fe846ddc59749d081e5bbf45fc5ac93ebde78d13e7480dcf0b952752b993ac8158e936dbf'
408     s.pass_ticket = 'nEfY/UYG8sVbejI2/vtgkoMsxh5cw4FgVeJpRIrQLOAbRTyczaZCoBRr97c9HsCi'
409     s.proxies = {   # 設置ip代理
410         'https': '218.86.87.171:53281'
411     }
412 
413     # 採集文章列表
414     crawler = GZHCrawler(s)
415     crawler.thd_num = 1
416     crawler.crawl_func = crawler.crawl_list
417     crawler.start_page_list = [0]
418     crawler.out_file = 'runoob_list.json'
419     crawler.run()
420 
421     # 採集文章數據
422     crawler.crawl_func = crawler.crawl
423     crawler.input_file = 'runoob_list.json'
424     crawler.out_file = 'runoob_detail.json'
425     crawler.run()
426 
427 
428 if __name__ == "__main__":
429     test_crawler()
View Code

這樣就能夠自動獲取文章列表並保存,隨後獲取文章閱讀數等相關信息並保存。

這個缺點是並不徹底自動化,由於key會過時,測試大約採集三四百條(十幾分鍾?)就過時了。訪問頻繁的時候也會封ip,因此須要ip代理。

以上全部代碼要跑的話,務必更換 uinkey.

相關文章
相關標籤/搜索