不少的微信公衆號都提供了質量比較高的文章閱讀,對於本身喜歡的微信公衆號,因此想作個微信公衆號爬蟲,爬取相關公衆號的全部文章。抓取公衆號的全部的文章,須要獲取兩個比較重要的參數。一個是微信公衆號的惟一
ID(__biz)
和獲取單一公衆號的文章權限值wap_sid2
。接下來講一下思路。html
要想獲取微信公衆號的爬蟲,首先要惟一標識這個微信公衆號,因此要獲取這個微信公衆號的id值(即__biz
)。看了比較多的相關文章,不少獲取__biz
的值比較機械,單純手動複製取__biz
;如今搜狗引擎與微信公衆號對接,爲咱們提供了一個很好的獲取途徑,微信公衆號源碼裏面有該號的__biz
值(能夠從這個途徑獲取);可是搜狗引擎對微信公衆號有限制,只顯示最近10條文章,因此咱們單純只從搜狗引擎獲取__biz
值和經過搜狗搜索任意關鍵詞公衆號列表。python
下面是搜狗搜索微信公衆號的URL地址,其中query
的python是搜索的關鍵詞,其餘能夠不變。git
http://weixin.sogou.com/weixin?type=1&s_from=input&query=python&ie=utf8&_sug_=n&_sug_type_=
複製代碼
在源代碼中能夠發現每個公衆號的連接,都是位於id爲sougou_vr_11002301_box_n
(n爲整數如1,2,3等)下面的a標籤href
屬性值。經過xpath
語法能夠獲取,其中n的位置能夠按規律順序獲取:github
//*[@id="sogou_vr_11002301_box_n"]/div/div[2]/p[1]/a
複製代碼
獲取到單個公衆號的地址以下所示:正則表達式
http://mp.weixin.qq.com/profile?src=3×tamp=1508003829&ver=1&signature=Eu9LOYSA47p6WE0mojhMtFR-gSr7zsQOYo6*w5VxrUgy7RbCsdkuzfFQ1RiSgM3i9buMZPrYzmOne6mJxCtW*g==
複製代碼
打開單個公衆號連接,獲取公衆號源碼,取其中微信公衆號的id
值:mongodb
//其中biz值就是微信公衆號的惟一id值。前面和後面省略了大部分代碼;該段代碼位於script標籤裏面;該代碼還有最近10條文章的數據,若是單純想獲取最近10條,能夠經過正則表達式來直接獲取
var biz = "MzIwNDA1OTM4NQ==" || "";
var src = "3" ;
var ver = "1" ;
var timestamp = "1508003829" ;
var signature = "Eu9LOYSA47p6WE0mojhMtFR-gSr7zsQOYo6*w5VxrUgy7RbCsdkuzfFQ1RiSgM3i9buMZPrYzmOne6mJxCtW*g==" ;
var name="python6359"||"python";
複製代碼
獲取到微信公衆號的id值以後,就是要獲取wap_sid
值(即單個微信公衆號的文章權限值。)這個部分從微信客戶端獲取,接下來經過Fiddler抓包工具獲取,若是不知道抓包工具的環境搭建,能夠參考 fiddler抓取摩拜單車數據包json
GET /mp/profile_ext?action=home&__biz=MjM5MDI1ODUyMA==&scene=124&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nettype=WIFI&a8scene=3&fontScale=100&pass_ticket=ji%2B3JbA2NNExGwdNCoIa91sbgwDmSmHsdZhHP5eo%2Bgun%2By2V3lxc34GQy3W5u8mE&wx_header=1 HTTP/1.1
複製代碼
相應的請求頭,其中x-wechat-key
是隔段時間更換一次,因此須要定時更換一次;X-WECHAT-UIN
能夠不變。pass_ticket
也能夠一段時間內不作改變:bash
'Host':'mp.weixin.qq.com',
# 'X-WECHAT-KEY': 'a83687cde3ca46be517cdbcba60732159f229a03507e9afa1e0dfee00e3cf00562aee022e84b9011924fdbb0c7af8c647c33b1338b11ebdc8893d5df41dd34a536e1af5b48d15c87b4aef629ad8685f3',
'X-WECHAT-KEY': '33c1fdebcfc1d1ecd9df5003dc9d9ccb6a1f5458eb704e58a05e80c73e8793dede6b52115a74a515d4d12c9a6f2d8f00238afe17cca3635d80d661a612a4a0bf48a2547516b12030efd8a224548636d2',
'X-WECHAT-UIN':'MTU2MzIxNjQwMQ%3D%3D',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Mobile/14A403 MicroMessenger/6.5.18 NetType/WIFI Language/zh_CN',
'Accept-Language':'zh-cn',
'Accept-Encoding':'gzip, deflate',
'Connection':'keep-alive',
'Cookie':'wxuin=1563216401;pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU;ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=;pgv_pvid=7103943278;sd_cookie_crttime=1501115135519;sd_userid=8661501115135519;3g_guest_id=-8872936809911279616;tvfe_boss_uuid=8ed9ed1b3a838836;mobileUV=1_15c8d374ca8_da9c8;pgv_pvi=8005854208',
'Referer':"https://mp.weixin.qq.com/mp/getmasssendmsg?__biz=MjM5MzI5MTQ1Mg==&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nettype=WIFI&ascene=3&fontScale=100&pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU&wx_header=1"
複製代碼
上面的請求url
獲取的返回響應頭,是設置wap_sid2
獲取單一公衆號文章的權限值,咱們就是要獲取set-cookies中的wap-sid2值:微信
HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8
Cache-Control: no-cache, must-revalidate
Strict-Transport-Security: max-age=15552000
Set-Cookie: wxuin=1563216401; Path=/; HttpOnly
Set-Cookie: pass_ticket=ji+3JbA2NNExGwdNCoIa91sbgwDmSmHsdZhHP5eo+gun+y2V3lxc34GQy3W5u8mE; Path=/; HttpOnly
Set-Cookie: wap_sid2=CJGUs+kFElxER01KN1ZkVElJMUdhTktDUUk2LUZHNkFwT1Rzc1EwUWpWaW5ZMHlFQi15cUo1VWFjamNLM3pjdzNCbDc2ZFZpOW0xeDdPb0czWXNuQUdmbVdyOFZiNTREQUFBfjC+7YvPBTgMQJRO; Path=/; HttpOnly
Connection: keep-alive
Content-Length: 37211
複製代碼
wap_sid2
權限值獲取到公衆號id值__biz
和權限值wap_sid2
;咱們就能夠構造請求獲取文章列表了。其中mongodb操做是爲了獲取公衆號id值,而後根據id值,獲取wap_sid2
值,而後把id值和wap_sid2
對應入庫。cookie
# -*- coding: utf-8 -*-
from scrapy import Spider,Request
from .mongo import MongoOperate
import re
from wechatSpider.items import GetsessionspiderItem
from .settings import *
class GetsessionSpider(Spider):
name = "getSession"
allowed_domains = ["mp.weixin.qq.com"]
start_urls = ['https://mp.weixin.qq.com/']
headers={
'Host':'mp.weixin.qq.com',
# 'X-WECHAT-KEY': 'a83687cde3ca46be517cdbcba60732159f229a03507e9afa1e0dfee00e3cf00562aee022e84b9011924fdbb0c7af8c647c33b1338b11ebdc8893d5df41dd34a536e1af5b48d15c87b4aef629ad8685f3',
'X-WECHAT-KEY': '33c1fdebcfc1d1ecd9df5003dc9d9ccb6a1f5458eb704e58a05e80c73e8793dede6b52115a74a515d4d12c9a6f2d8f00238afe17cca3635d80d661a612a4a0bf48a2547516b12030efd8a224548636d2',
'X-WECHAT-UIN':'MTU2MzIxNjQwMQ%3D%3D',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'User-Agent':'Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Mobile/14A403 MicroMessenger/6.5.18 NetType/WIFI Language/zh_CN',
'Accept-Language':'zh-cn',
'Accept-Encoding':'gzip, deflate',
'Connection':'keep-alive',
'Cookie':'wxuin=1563216401;pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU;ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=;pgv_pvid=7103943278;sd_cookie_crttime=1501115135519;sd_userid=8661501115135519;3g_guest_id=-8872936809911279616;tvfe_boss_uuid=8ed9ed1b3a838836;mobileUV=1_15c8d374ca8_da9c8;pgv_pvi=8005854208',
'Referer':"https://mp.weixin.qq.com/mp/getmasssendmsg?__biz=MjM5MzI5MTQ1Mg==&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nettype=WIFI&ascene=3&fontScale=100&pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU&wx_header=1"
}
# 查看歷史消息列表,如今須要捕獲`wap_sid2`這個值,來獲取訪問權限
url="https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={biz}&scene=124&devicetype=iOS10.0.1&version=16051220&lang=zh_CN&nettype=WIFI&a8scene=3&fontScale=100&pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU&wx_header=1"
def start_requests(self):
MongoObj=MongoOperate(MONGO_URI,MONGO_DATABASE,MONGO_USER,MONGO_PASS,WECHATID)
MongoObj.connect()
items=MongoObj.finddata()
for item in items:
biz=item["wechatID"]
yield Request(url=self.url.format(biz=biz),dont_filter=True,headers=self.headers,callback=self.parse,meta={"proxy":"http://127.0.0.1:8888","biz":biz})
def parse(self, response):
item=GetsessionspiderItem()
data=response.headers
needCon=data["Set-Cookie"]
wap=needCon.decode("utf-8")
wap=wap.split(';')
wap=wap[0].split('=')
wap_sid2=wap[1]
print(wap_sid2)
item["biz"]=response.request.meta["biz"]
item["wap_sid2"]=str(wap_sid2)
yield item
# print(item)
複製代碼
在mongoDB中保存着一個公衆號的id值及對應的wap_sid2
值,接下來構造請求文章的值,也是獲取公衆號文章列表url。
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request
from .mongo import MongoOperate
import json
from .settings import *
class DataSpider(scrapy.Spider):
name = "data"
allowed_domains = ["mp.weixin.qq.com"]
start_urls = ['https://mp.weixin.qq.com/']
count=10
url="https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={biz}&f=json&offset={index}&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket=ULeI%2BILkTLA2IpuIDqbIla4jG6zBTm1jj75UIZCgIUAFzOX29YQeTm5UKYuXU6JY&wxtoken=&appmsg_token=925_%252B4oEmoVo6AFzfOotcwPrPnBvKbEdnLNzg5mK8Q~~&x5=0&f=json"
def start_requests(self):
MongoObj=MongoOperate(MONGO_URI,MONGO_DATABASE,MONGO_USER,MONGO_PASS,RESPONSE)
MongoObj.connect()
items=MongoObj.finddata()
for item in items:
headers={
'Accept-Encoding':'gzip, deflate',
'Connection':'keep-alive',
'Accept':'*/*',
'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_0_1 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Mobile/14A403 MicroMessenger/6.5.18 NetType/WIFI Language/zh_CN',
'Accept-Language': 'zh-cn',
'X-Requested-With': 'XMLHttpRequest',
'X-WECHAT-KEY': '62526065241838a5d44f7e7e14d5ffa3e87f079dc50a66e615fe9b6169c8fdde0f7b9f36f3897212092d73a3a223ffd21514b690dd8503b774918d8e86dfabbf46d1aedb66a2c7d29b8cc4f017eadee6',
'X-WECHAT-UIN': 'MTU2MzIxNjQwMQ%3D%3D',
'Cookie':';wxuin=1563216401;pass_ticket=oQDl45NRtfvQIxv2j2pYDSOOeflIXU7V3x1TUaOTpi6SkMp2B3fJwF6TE40ATCpU;ua_id=Wz1u21T8nrdNEyNaAAAAAOcFaBcyz4SH5DoQIVDcnao=;pgv_pvid=7103943278;sd_cookie_crttime=1501115135519;sd_userid=8661501115135519;3g_guest_id=-8872936809911279616;tvfe_boss_uuid=8ed9ed1b3a838836;mobileUV=1_15c8d374ca8_da9c8;pgv_pvi=8005854208'
}
biz=item["biz"]
#主要驗證是wap_sid2;pass_ticket不同無所謂
headers["Cookie"]="wap_sid2="+item["wap_sid2"]+headers["Cookie"]
yield Request(url=self.url.format(biz=biz,index="10"),headers=headers,callback=self.parse,dont_filter=True,meta={"biz":biz,"headers":headers},)
def parse(self, response):
biz=response.request.meta["biz"]
headers=response.request.meta["headers"]
resText=json.loads(response.text)
print(resText)
list=json.loads(resText["general_msg_list"])
print(list)
yield list
if resText["can_msg_continue"]==1:
self.count=self.count+10
yield Request(url=self.url.format(biz=biz,index=str(self.count)),headers=headers,callback=self.parse,dont_filter=True,meta={"biz":biz,"headers":headers})
else:
print("end")
複製代碼
獲取到的數據以下圖所示:
wap_sid
另一種思路在爬取的過程當中,有時候通過抓包,想獲取一個重定向的網頁的響應頭;可是響應頭cookies已經設置read only,咱們想經過這裏獲取權限值,能夠經過設置Fiddler的rules來生成保存響應文件。在微信文章爬取過程當中,雖然也是想經過這種方式獲取權限值。可是發覺本身是忽略了請求頭x-wechat-key
和x-wechat-uin
因此獲取不到。因此這種方式在該項目並不須要。可是提供一種獲取動態設置cookies值,而後重定向到新頁面的響應頭方法,好比獲取 mp.weixin.qq.com/mp/profile_…
在Fiddler添加如下代碼,而後在桌面生成一個2.txt文件,上面保存返回的響應頭:
static function OnBeforeResponse(oSession: Session) {
if (oSession.HostnameIs("mp.weixin.qq.com") && oSession.uriContains("/mp/profile_ext?action=home")) {
oSession["ui-color"] = "orange";
oSession.SaveResponse("C:\\Users\\Administrator\\Desktop\\2.txt",false);
//oSession.SaveResponseBody("C:\\Users\\Administrator\\Desktop\\1.txt")
}
if (m_Hide304s && oSession.responseCode == 304) {
oSession["ui-hide"] = "true";
}
}
複製代碼
Github地址:
github.com/Harhao/wech…
參考文章:
微信客戶端公衆號爬蟲