微信公衆號文章爬蟲

不少的微信公衆號都提供了質量比較高的文章閱讀,對於本身喜歡的微信公衆號,因此想作個微信公衆號爬蟲,爬取相關公衆號的全部文章。抓取公衆號的全部的文章,須要獲取兩個比較重要的參數。一個是微信公衆號的惟一ID(__biz)和獲取單一公衆號的文章權限值wap_sid2。接下來講一下思路。html

爬取思路:

 要想獲取微信公衆號的爬蟲,首先要惟一標識這個微信公衆號,因此要獲取這個微信公衆號的id值(即__biz)。看了比較多的相關文章,不少獲取__biz的值比較機械,單純手動複製取__biz;如今搜狗引擎與微信公衆號對接,爲咱們提供了一個很好的獲取途徑,微信公衆號源碼裏面有該號的__biz值(能夠從這個途徑獲取);可是搜狗引擎對微信公衆號有限制,只顯示最近10條文章,因此咱們單純只從搜狗引擎獲取__biz值和經過搜狗搜索任意關鍵詞公衆號列表。python

 下面是搜狗搜索微信公衆號的URL地址,其中query的python是搜索的關鍵詞,其餘能夠不變。git

http://weixin.sogou.com/weixin?type=1&s_from=input&query=python&ie=utf8&_sug_=n&_sug_type_=
複製代碼

搜索的結果頁面:

search

查看源代碼

在源代碼中能夠發現每個公衆號的連接,都是位於id爲sougou_vr_11002301_box_n(n爲整數如1,2,3等)下面的a標籤href屬性值。經過xpath語法能夠獲取,其中n的位置能夠按規律順序獲取:github

//*[@id="sogou_vr_11002301_box_n"]/div/div[2]/p[1]/a
複製代碼

獲取到單個公衆號的地址以下所示:正則表達式

http://mp.weixin.qq.com/profile?src=3&timestamp=1508003829&ver=1&signature=Eu9LOYSA47p6WE0mojhMtFR-gSr7zsQOYo6*w5VxrUgy7RbCsdkuzfFQ1RiSgM3i9buMZPrYzmOne6mJxCtW*g==
複製代碼

打開單個公衆號連接,獲取公衆號源碼,取其中微信公衆號的id值:mongodb

https://user-gold-cdn.xitu.io/2019/5/18/16ac8a25c44f9ca3?w=906&h=606&f=jpeg&s=44262

//其中biz值就是微信公衆號的惟一id值。前面和後面省略了大部分代碼;該段代碼位於script標籤裏面;該代碼還有最近10條文章的數據,若是單純想獲取最近10條,能夠經過正則表達式來直接獲取
var biz = "MzIwNDA1OTM4NQ==" || "";
var src = "3" ; 
var ver = "1" ; 
var timestamp = "1508003829" ; 
var signature = "Eu9LOYSA47p6WE0mojhMtFR-gSr7zsQOYo6*w5VxrUgy7RbCsdkuzfFQ1RiSgM3i9buMZPrYzmOne6mJxCtW*g==" ; 
var name="python6359"||"python";
複製代碼

獲取到微信公衆號的id值以後,就是要獲取wap_sid值(即單個微信公衆號的文章權限值。)這個部分從微信客戶端獲取,接下來經過Fiddler抓包工具獲取,若是不知道抓包工具的環境搭建,能夠參考 fiddler抓取摩拜單車數據包json

獲取微信公衆號文章的權限值的url:

GET /mp/profile_ext?action=home&__biz=MjM5MDI1ODUyMA==&scene=124&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nettype=WIFI&a8scene=3&fontScale=100&pass_ticket=ji%2B3JbA2NNExGwdNCoIa91sbgwDmSmHsdZhHP5eo%2Bgun%2By2V3lxc34GQy3W5u8mE&wx_header=1 HTTP/1.1
複製代碼

相應的請求頭,其中x-wechat-key是隔段時間更換一次,因此須要定時更換一次;X-WECHAT-UIN能夠不變。pass_ticket也能夠一段時間內不作改變:bash

'Host':'mp.weixin.qq.com',
# 'X-WECHAT-KEY': 'a83687cde3ca46be517cdbcba60732159f229a03507e9afa1e0dfee00e3cf00562aee022e84b9011924fdbb0c7af8c647c33b1338b11ebdc8893d5df41dd34a536e1af5b48d15c87b4aef629ad8685f3',
'X-WECHAT-KEY': '33c1fdebcfc1d1ecd9df5003dc9d9ccb6a1f5458eb704e58a05e80c73e8793dede6b52115a74a515d4d12c9a6f2d8f00238afe17cca3635d80d661a612a4a0bf48a2547516b12030efd8a224548636d2',
'X-WECHAT-UIN':'MTU2MzIxNjQwMQ%3D%3D',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Mobile/14A403 MicroMessenger/6.5.18 NetType/WIFI Language/zh_CN',
'Accept-Language':'zh-cn',
'Accept-Encoding':'gzip, deflate',
'Connection':'keep-alive',
'Cookie':'wxuin=1563216401;pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU;ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=;pgv_pvid=7103943278;sd_cookie_crttime=1501115135519;sd_userid=8661501115135519;3g_guest_id=-8872936809911279616;tvfe_boss_uuid=8ed9ed1b3a838836;mobileUV=1_15c8d374ca8_da9c8;pgv_pvi=8005854208',
'Referer':"https://mp.weixin.qq.com/mp/getmasssendmsg?__biz=MjM5MzI5MTQ1Mg==&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nettype=WIFI&ascene=3&fontScale=100&pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU&wx_header=1"
複製代碼

上面的請求url獲取的返回響應頭,是設置wap_sid2獲取單一公衆號文章的權限值,咱們就是要獲取set-cookies中的wap-sid2值:微信

HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Cache-Control: no-cache, must-revalidate
Strict-Transport-Security: max-age=15552000
Set-Cookie: wxuin=1563216401; Path=/; HttpOnly
Set-Cookie: pass_ticket=ji+3JbA2NNExGwdNCoIa91sbgwDmSmHsdZhHP5eo+gun+y2V3lxc34GQy3W5u8mE; Path=/; HttpOnly
Set-Cookie: wap_sid2=CJGUs+kFElxER01KN1ZkVElJMUdhTktDUUk2LUZHNkFwT1Rzc1EwUWpWaW5ZMHlFQi15cUo1VWFjamNLM3pjdzNCbDc2ZFZpOW0xeDdPb0czWXNuQUdmbVdyOFZiNTREQUFBfjC+7YvPBTgMQJRO; Path=/; HttpOnly
Connection: keep-alive
Content-Length: 37211
複製代碼

獲取公衆號列表數據

獲取wap_sid2權限值

獲取到公衆號id值__biz和權限值wap_sid2;咱們就能夠構造請求獲取文章列表了。其中mongodb操做是爲了獲取公衆號id值,而後根據id值,獲取wap_sid2值,而後把id值和wap_sid2對應入庫。cookie

# -*- coding: utf-8 -*-
from scrapy import Spider,Request
from .mongo import MongoOperate
import re
from wechatSpider.items import GetsessionspiderItem
from .settings import *
class GetsessionSpider(Spider):
    name = "getSession"
    allowed_domains = ["mp.weixin.qq.com"]
    start_urls = ['https://mp.weixin.qq.com/']
    headers={
        'Host':'mp.weixin.qq.com',
        # 'X-WECHAT-KEY': 'a83687cde3ca46be517cdbcba60732159f229a03507e9afa1e0dfee00e3cf00562aee022e84b9011924fdbb0c7af8c647c33b1338b11ebdc8893d5df41dd34a536e1af5b48d15c87b4aef629ad8685f3',
        'X-WECHAT-KEY': '33c1fdebcfc1d1ecd9df5003dc9d9ccb6a1f5458eb704e58a05e80c73e8793dede6b52115a74a515d4d12c9a6f2d8f00238afe17cca3635d80d661a612a4a0bf48a2547516b12030efd8a224548636d2',
        'X-WECHAT-UIN':'MTU2MzIxNjQwMQ%3D%3D',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Mobile/14A403 MicroMessenger/6.5.18 NetType/WIFI Language/zh_CN',
        'Accept-Language':'zh-cn',
        'Accept-Encoding':'gzip, deflate',
        'Connection':'keep-alive',
        'Cookie':'wxuin=1563216401;pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU;ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=;pgv_pvid=7103943278;sd_cookie_crttime=1501115135519;sd_userid=8661501115135519;3g_guest_id=-8872936809911279616;tvfe_boss_uuid=8ed9ed1b3a838836;mobileUV=1_15c8d374ca8_da9c8;pgv_pvi=8005854208',
        'Referer':"https://mp.weixin.qq.com/mp/getmasssendmsg?__biz=MjM5MzI5MTQ1Mg==&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nettype=WIFI&ascene=3&fontScale=100&pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU&wx_header=1"
    }
    # 查看歷史消息列表,如今須要捕獲`wap_sid2`這個值,來獲取訪問權限
    url="https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={biz}&scene=124&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nettype=WIFI&a8scene=3&fontScale=100&pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU&wx_header=1"
    def start_requests(self):
        MongoObj=MongoOperate(MONGO_URI,MONGO_DATABASE,MONGO_USER,MONGO_PASS,WECHATID)
        MongoObj.connect()
        items=MongoObj.finddata()
        for item in items:
            biz=item["wechatID"]
            yield Request(url=self.url.format(biz=biz),dont_filter=True,headers=self.headers,callback=self.parse,meta={"proxy":"http://127.0.0.1:8888","biz":biz})
    def parse(self, response):
       item=GetsessionspiderItem()
       data=response.headers
       needCon=data["Set-Cookie"]
       wap=needCon.decode("utf-8")
       wap=wap.split(';')
       wap=wap[0].split('=')
       wap_sid2=wap[1]
       print(wap_sid2)
       item["biz"]=response.request.meta["biz"]
       item["wap_sid2"]=str(wap_sid2)
       yield item
       # print(item)
複製代碼

https://user-gold-cdn.xitu.io/2019/5/18/16ac8a25c60c9c92?w=1089&h=482&f=png&s=17087

獲取文章數據列表

在mongoDB中保存着一個公衆號的id值及對應的wap_sid2值,接下來構造請求文章的值,也是獲取公衆號文章列表url。

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from .mongo import MongoOperate
import json
from .settings import *
class DataSpider(scrapy.Spider):
    name = "data"
    allowed_domains = ["mp.weixin.qq.com"]
    start_urls = ['https://mp.weixin.qq.com/']
    count=10
    url="https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={biz}&f=json&offset={index}&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket=ULeI%2BILkTLA2IpuIDqbIla4jG6zBTm1jj75UIZCgIUAFzOX29YQeTm5UKYuXU6JY&wxtoken=&appmsg_token=925_%252B4oEmoVo6AFzfOotcwPrPnBvKbEdnLNzg5mK8Q~~&x5=0&f=json"
    def start_requests(self):
        MongoObj=MongoOperate(MONGO_URI,MONGO_DATABASE,MONGO_USER,MONGO_PASS,RESPONSE)
        MongoObj.connect()
        items=MongoObj.finddata()
        for item in items:
            headers={
                'Accept-Encoding':'gzip, deflate',
                'Connection':'keep-alive',
                'Accept':'*/*',
                'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Mobile/14A403 MicroMessenger/6.5.18 NetType/WIFI Language/zh_CN',
                'Accept-Language': 'zh-cn',
                'X-Requested-With': 'XMLHttpRequest',
                'X-WECHAT-KEY': '62526065241838a5d44f7e7e14d5ffa3e87f079dc50a66e615fe9b6169c8fdde0f7b9f36f3897212092d73a3a223ffd21514b690dd8503b774918d8e86dfabbf46d1aedb66a2c7d29b8cc4f017eadee6',
                'X-WECHAT-UIN': 'MTU2MzIxNjQwMQ%3D%3D',
                'Cookie':';wxuin=1563216401;pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU;ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=;pgv_pvid=7103943278;sd_cookie_crttime=1501115135519;sd_userid=8661501115135519;3g_guest_id=-8872936809911279616;tvfe_boss_uuid=8ed9ed1b3a838836;mobileUV=1_15c8d374ca8_da9c8;pgv_pvi=8005854208'

            }
            biz=item["biz"]
        #主要驗證是wap_sid2;pass_ticket不同無所謂
            headers["Cookie"]="wap_sid2="+item["wap_sid2"]+headers["Cookie"]
            yield Request(url=self.url.format(biz=biz,index="10"),headers=headers,callback=self.parse,dont_filter=True,meta={"biz":biz,"headers":headers},)
    def parse(self, response):
        biz=response.request.meta["biz"]
        headers=response.request.meta["headers"]
        resText=json.loads(response.text)
        print(resText)
        list=json.loads(resText["general_msg_list"])
        print(list)
        yield list
        if resText["can_msg_continue"]==1:
            self.count=self.count+10
            yield Request(url=self.url.format(biz=biz,index=str(self.count)),headers=headers,callback=self.parse,dont_filter=True,meta={"biz":biz,"headers":headers})
         else:
            print("end")
複製代碼

獲取到的數據以下圖所示:

https://user-gold-cdn.xitu.io/2019/5/18/16ac8a25c61d522b?w=1084&h=492&f=png&s=16583

獲取wap_sid另一種思路

在爬取的過程當中,有時候通過抓包,想獲取一個重定向的網頁的響應頭;可是響應頭cookies已經設置read only,咱們想經過這裏獲取權限值,能夠經過設置Fiddler的rules來生成保存響應文件。在微信文章爬取過程當中,雖然也是想經過這種方式獲取權限值。可是發覺本身是忽略了請求頭x-wechat-keyx-wechat-uin因此獲取不到。因此這種方式在該項目並不須要。可是提供一種獲取動態設置cookies值,而後重定向到新頁面的響應頭方法,好比獲取 mp.weixin.qq.com/mp/profile_…

https://user-gold-cdn.xitu.io/2019/5/18/16ac8a25c62088b2?w=1240&h=633&f=png&s=290729

在Fiddler添加如下代碼,而後在桌面生成一個2.txt文件,上面保存返回的響應頭:

static function OnBeforeResponse(oSession: Session) {
   if (oSession.HostnameIs("mp.weixin.qq.com") && oSession.uriContains("/mp/profile_ext?action=home")) {
       oSession["ui-color"] = "orange";
       oSession.SaveResponse("C:\\Users\\Administrator\\Desktop\\2.txt",false);
       //oSession.SaveResponseBody("C:\\Users\\Administrator\\Desktop\\1.txt")
   }
   if (m_Hide304s && oSession.responseCode == 304) {
       oSession["ui-hide"] = "true";
   }
}
複製代碼

fidder捕獲數據

Github地址:
github.com/Harhao/wech…
參考文章:
微信客戶端公衆號爬蟲

相關文章
相關標籤/搜索