新浪微博模擬登陸

由於項目緣由,我被領導委任爬取微博用戶的一些信息,而做爲一個爬蟲經驗幾乎爲0的python非老司機,開始了漫長的研究之路。。。。javascript

在瞭解了爬蟲的基本工具和著名框架scrapy後php

博主仍是決定本身參考網上的各路大神的腳本,寫一個登陸腳本。。。。html

 

環境java

toolspython

一、Chrome及其developer toolsajax

二、Charles【這個是fiddler的Mac替代版,付費軟件,可是網上有破解版的,能夠搜一下,用着比Mac版的fiddler舒服多了】正則表達式

三、python3.6json

四、pycharm安全

查詢資料的過程當中,由於微博登陸有好幾個跳轉,不少大神建議preserve log模式打開服務器

 

python3.6中使用的庫


一、urllib.request、urllib.error、urllib.parse

二、re——正則表達式

三、rsa、base64

四、json

五、binascii——對加密數據進行編碼

ps:博主這裏用的是anaconda自帶的庫,發現rsa和base64須要用pip另外下載

 

系統

Mac OS 10.13.2

 

weibo.com登陸

當我登陸微博後,每隔一段時間就會出現push_count.json文件,當咱們點擊輸入用戶名時,會出現prelogin.php文件,引發了咱們的注意

點開查看,會發現一些十分可疑的東西,好比su。

這裏咱們用base64對其解碼試試

1 import base64
2 print(base64.b64decode('MzU4NTEwMjQ5JTQwcXEuY29t'))

輸出結果爲:b'358510249%40qq.com'

果真,是用戶名!!!

須要注意的是,用戶名中可能包含@這樣的符號,而咱們剛纔看到的加密過的su,解碼以後@變成了%40,這實際上是url的編碼

 

而後爲了方便查看,咱們切換到charles工具查看一下prelogin.php的body部分

sinaSSOController.preloginCallBack({
    "retcode": 0,
    "servertime": 1515836591,
    "pcid": "gz-cd9bccf44f515b8765496d8694e51ba7c996",
    "nonce": "JLT53P",
    "pubkey": "EB2A38568661887FA180BDDB5CABD5F21C7BFD59C090CB2D245A87AC253062882729293E5506350508E7F9AA3BB77F4333231490F915F6D63C55FE2F08A49B353F444AD3993CACC02DB784ABBB8E42A9B1BBFFFB38BE18D78E87A0E41B9B8F73A928EE0CCEE1F6739884B9777E4FE9E88A1BBE495927AC4A799B3181D6442443",
    "rsakv": "1330428213",
    "is_openlock": 0,
    "showpin": 0,
    "exectime": 5
})

一眼望過去,哇,長得好像json,嗨森!

有用的彷佛有servertime、nonce、rsakv以及這長長的pubkey。。。。是什麼鬼!!

一查:好嘛,非對稱加密,呵呵好開心。。。。。纔怪!!!QAQ

 

登陸微博

在這裏,咱們須要用到Charles來抓取跳轉的鏈接。結果以下:

咱們抓取的目標就是prelogin後面出現的POST表單login.php?client=ssologin.js(v1.4.19)

觀察一下里面的內容:

entry:weibo
gateway:1
from:
savestate:7
qrcode_flag:false
useticket:1
pagerefer:https://login.sina.com.cn/crossdomain2.php?action=logout&r=https%3A%2F%2Fweibo.com%2Flogout.php%3Fbackurl%3D%252F%252Fs.weibo.com
vsnf:1
su:MzU4NTEwMjQ5JTQwcXEuY29t
service:miniblog
servertime:1515895583
nonce:JLT53P
pwencode:rsa2
rsakv:1330428213
sp:02ca1b627293c21e098882de3e276def93654ffba9817d0d95174b11c403e46e8016bf66ed421198fffaaa691fb0c9d03d45da676de0282a30aef899855262e09164dfef35eb6820ba017ecf8f437643fe94eaf0632095ffcc647ada27b23c9ed1b1c8f7d1d87ce2c69ed4f9997fb9283c42622c677dbecfe60a802f4b621ee3
sr:1680*1050
encoding:UTF-8
prelt:31
url:https://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack
returntype:META

不難看出,su即username,sp即password,sp顯然已經用rsa加密過。

 

爲了解密加密事後的sp,咱們首先需對js進行分析。

首先,登陸的時候會出現一個post表單login.php?client=ssologin.js(v1.4.19),隨後出現一個ssologin.js文件,點開之後,咱們發現了一堆密密麻麻的東西。

結合以前的信息,咱們已經知道RSA加密和一個叫pubkey的參數,搜一下,馬上能獲得咱們想要的信息:

這裏,10001就是rsa加密用到的exponent,須要注意的是,它是16進制的,因此咱們還須要將其轉化爲10進制。

另外一個信息就死咱們的password啦

password=RSAKey.encrypt([me.servertime,me.nonce].join("\t")+"\n"+password)}

對應的Python加密代碼以下:

 1 import rsa
 2 import binascii
 3 def get_encrypted_pw(self, data):
 4     rsa_e = int('10001',16)  # 0x10001
 5     pw_string = str(servertime) + '\t' + str(nonce) + '\n' + str(password)
 6     key = rsa.PublicKey(int(pubkey, 16), rsa_e)
 7     pw_encypted = rsa.encrypt(pw_string.encode('utf-8'), key)
 8     password = ''  # 安全起見清空明文密碼
 9     passwd = binascii.b2a_hex(pw_encypted)  #將二進制編碼轉化爲ascii/hex
10     print(passwd)
11     return passwd

 

最終代碼

  1 # 導入所需模塊
  2 import urllib.error
  3 import urllib.request
  4 import urllib.parse
  5 import re
  6 import rsa
  7 import http.cookiejar  #從前的cookielib
  8 import base64
  9 import json
 10 import urllib
 11 import binascii
 12 
 13 # 簡歷Launcher類
 14 class Launcher():
 15     # 初始化username和password這兩個參數
 16     def __init__(self,username,password):
 17         self.username = username
 18         self.password = password
 19 
 20     #創建get_encrypted_name方法,獲取base64加密後的用戶名
 21     def get_encrypted_name(self):
 22         # 將字符串轉化爲url編碼
 23         username_urllike = urllib.request.quote(self.username)
 24         username_encrypted = base64.b64encode(bytes(username_urllike, encoding='utf-8'))
 25         return username_encrypted.decode('utf-8')  # 將bytes對象轉爲str
 26 
 27     def get_prelogin_args(self):
 28         '''
 29         該函數用於模擬預登陸過程,並獲取服務器返回的 nonce , servertime , pubkey 等信息,用一個字典返回數據
 30         '''
 31         json_pattern = re.compile('\((.*)\)')
 32         url = 'http://login.sina.com.cn/sso/prelogin.php?entry=weibo&callback=sinaSSOController.preloginCallBack&su=&' + self.get_encrypted_name() + '&rsakt=mod&client=ssologin.js(v1.4.19)'
 33         try:
 34             request = urllib.request.Request(url)
 35             response = urllib.request.urlopen(request)
 36             raw_data = response.read().decode('utf-8')
 37             # 利用正則取出json
 38             json_data = json_pattern.search(raw_data).group(1)
 39             # 講json包裝成字典
 40             data = json.loads(json_data)
 41             # print(data)
 42             return data
 43         except urllib.error as e:
 44             print("%d" % e.code)
 45             return None
 46 
 47     # 創建get_encrypeted_pw獲取登陸信息生成的rsa加密版密碼
 48     def get_encrypted_pw(self, data):
 49         rsa_e = int('10001',16)  # 0x10001
 50         pw_string = str(data['servertime']) + '\t' + str(data['nonce']) + '\n' + str(self.password)
 51         key = rsa.PublicKey(int(data['pubkey'], 16), rsa_e)
 52         pw_encypted = rsa.encrypt(pw_string.encode('utf-8'), key)
 53         self.password = ''  # 安全起見清空明文密碼
 54         passwd = binascii.b2a_hex(pw_encypted)
 55         print(passwd)
 56         return passwd
 57 
 58     def enableCookies(self):
 59         # 創建一個cookies 容器
 60         cookie_container = http.cookiejar.CookieJar()
 61         # 將一個cookies容器和一個HTTP的cookie的處理器綁定
 62         cookie_support = urllib.request.HTTPCookieProcessor(cookie_container)
 63         # 建立一個opener,設置一個handler用於處理http的url打開
 64         opener = urllib.request.build_opener(cookie_support, urllib.request.HTTPHandler)
 65         # 安裝opener,此後調用urlopen()時會使用安裝過的opener對象
 66         urllib.request.install_opener(opener)
 67 
 68     # 構造build_post_data方法,用於包裝一個POST方法所需的數據
 69     def build_post_data(self, raw):
 70         post_data = {
 71             "entry": "weibo",
 72             "gateway": "1",
 73             "from": "",
 74             "savestate": "7",
 75             "qrcode_flag":'false',
 76             "useticket": "1",
 77             "pagerefer": "https://login.sina.com.cn/crossdomain2.php?action=logout&r=https%3A%2F%2Fweibo.com%2Flogout.php%3Fbackurl%3D%252F",
 78             "vsnf": "1",
 79             "su": self.get_encrypted_name(),
 80             "service": "miniblog",
 81             "servertime": raw['servertime'],
 82             "nonce": raw['nonce'],
 83             "pwencode": "rsa2",
 84             "rsakv": raw['rsakv'],
 85             "sp": self.get_encrypted_pw(raw),
 86             "sr": "1680*1050",
 87             "encoding": "UTF-8",
 88             "prelt": "194",
 89             "url": "https://weibo.com/ajaxlogin.php?framelogin=1&callback=parent.sinaSSOController.feedBackUrlCallBack",
 90             "returntype": "META"
 91         }
 92         data = urllib.parse.urlencode(post_data).encode('utf-8')
 93         return data
 94 
 95     # 登陸,注意這裏須要進行三次跳轉
 96     def login(self):
 97         url = 'https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.19)'
 98         self.enableCookies()
 99         data = self.get_prelogin_args()
100         post_data = self.build_post_data(data)
101         headers = {
102             "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
103         }
104         try:
105             request = urllib.request.Request(url=url, data=post_data, headers=headers)
106             response = urllib.request.urlopen(request)
107             html = response.read().decode('GBK')
108             '''
109             一開始用的是utf-8解碼,然而獲得的數據很醜陋,卻隱約看見一個GBK字樣。因此這裏直接採用GBK解碼
110             '''
111             # print(html)
112         except urllib.error as e:
113             print(e.code)
114 
115         p = re.compile('location\.replace\("(.*?)"\)')
116         p2 = re.compile("location\.replace\('(.*?)'\)")
117         p3 = re.compile(r'"userdomain":"(.*?)"')
118         try:
119             login_url = p.search(html).group(1)
120             request = urllib.request.Request(login_url)
121             response = urllib.request.urlopen(request)
122             page = response.read().decode('GBK')
123             # print(page)
124             login_url2 = p2.search(page).group(1)
125             request = urllib.request.Request(login_url2)
126             response = urllib.request.urlopen(request)
127             page2 = response.read().decode('utf-8')
128             # print(page2)
129             login_url = 'http://weibo.com/' + p3.search(page2).group(1)
130             request = urllib.request.Request(login_url)
131             response = urllib.request.urlopen(request)
132             final = response.read().decode('utf-8')
133             print(final)
134 
135             print("Login success!")
136         except:
137             print('Login error!')
138             return 0
View Code

 

值得注意的是,在最後的login中,咱們現嘗試直接登陸,看看返回的是什麼。

 1 def login(self):
 2     url = 'https://login.sina.com.cn/sso/login.php?client=ssologin.js(v1.4.19)'
 3     self.enableCookies()
 4     data = self.get_prelogin_args()
 5     post_data = self.build_post_data(data)
 6     headers = {
 7         "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
 8     }
 9     try:
10         request = urllib.request.Request(url=url, data=post_data, headers=headers)
11         response = urllib.request.urlopen(request)
12         html = response.read().decode('GBK')
13         '''
14         一開始用的是utf-8解碼,然而獲得的數據很醜陋,卻隱約看見一個GBK字樣。因此這裏直接採用GBK解碼
15         '''
16         print(html)
17         print('-------------------------')
18     except urllib.error as e:
19         print(e.code)

很好,咱們看到的果真是一堆奇怪的東西呢!!

<html>
<head>
<title>新浪通行證</title>
<meta http-equiv="refresh" content="0; url=&#39;https://login.sina.com.cn/crossdomain2.php?action=login&entry=weibo&r=https%3A%2F%2Fpassport.weibo.com%2Fwbsso%2Flogin%3Fssosavestate%3D1547533996%26url%3Dhttps%253A%252F%252Fweibo.com%252Fajaxlogin.php%253Fframelogin%253D1%2526callback%253Dparent.sinaSSOController.feedBackUrlCallBack%26display%3D0%26ticket%3DST-MjQ2Nzk2MDk3Mg%3D%3D-1515997996-gz-5F219439701347BC4686F4A3E10C79C9-1%26retcode%3D0&sr=1680%2A1050&#39;"/>
<meta http-equiv="Content-Type" content="text/html; charset=GBK" />
</head>
<body bgcolor="#ffffff" text="#000000" link="#0000cc" vlink="#551a8b" alink="#ff0000">
<script type="text/javascript" language="javascript">
location.replace("https://login.sina.com.cn/crossdomain2.php?action=login&entry=weibo&r=https%3A%2F%2Fpassport.weibo.com%2Fwbsso%2Flogin%3Fssosavestate%3D1547533996%26url%3Dhttps%253A%252F%252Fweibo.com%252Fajaxlogin.php%253Fframelogin%253D1%2526callback%253Dparent.sinaSSOController.feedBackUrlCallBack%26display%3D0%26ticket%3DST-MjQ2Nzk2MDk3Mg%3D%3D-1515997996-gz-5F219439701347BC4686F4A3E10C79C9-1%26retcode%3D0&sr=1680%2A1050");
</script>
</body>
</html>

 

看大神的解釋說,這是一段從新定向的的代碼,從新定向的url寫在location.replace後面,因此咱們須要編寫一段正則表達式將這段url爬取下來。

 1 p = re.compile('location\.replace\("(.*?)"\)')
 2 try:
 3     login_url = p.search(html).group(1)
 4     request = urllib.request.Request(login_url)
 5     response = urllib.request.urlopen(request)
 6     page = response.read().decode('GBK')
 7     print(page)
 8 except:
 9     print('Login error!')
10     return 0

來看看結果:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=GBK" />
<title>新浪通行證</title>


<script charset="utf-8" src="https://i.sso.sina.com.cn/js/ssologin.js"></script>
</head>
<body>
正在登陸 ...
<script>
try{sinaSSOController.setCrossDomainUrlList({"retcode":0,"arrURL":["https:\/\/passport.weibo.com\/wbsso\/login?ticket=ST-MjQ2Nzk2MDk3Mg%3D%3D-1515997996-gz-0D1D8222688249D4F950E05810AD22DD-1&ssosavestate=1547533996","https:\/\/passport.97973.com\/sso\/crossdomain?action=login&savestate=1547533996","https:\/\/passport.krcom.cn\/sso\/crossdomain?service=krvideo&savestate=1&ticket=ST-MjQ2Nzk2MDk3Mg%3D%3D-1515997996-gz-94884ABE2B9E4113CE7B809F4B5C92DC-1&ssosavestate=1547533996","https:\/\/passport.weibo.cn\/sso\/crossdomain?action=login&savestate=1"]});}
        catch(e){
            var msg = e.message;
            var img = new Image();
            var type = 1;
            img.src = 'https://login.sina.com.cn/sso/debuglog?msg=' + msg +'&type=' + type;
        }try{sinaSSOController.crossDomainAction('login',function(){location.replace('https://passport.weibo.com/wbsso/login?ssosavestate=1547533996&url=https%3A%2F%2Fweibo.com%2Fajaxlogin.php%3Fframelogin%3D1%26callback%3Dparent.sinaSSOController.feedBackUrlCallBack&display=0&ticket=ST-MjQ2Nzk2MDk3Mg==-1515997996-gz-5F219439701347BC4686F4A3E10C79C9-1&retcode=0');});}
        catch(e){
            var msg = e.message;
            var img = new Image();
            var type = 2;
            img.src = 'https://login.sina.com.cn/sso/debuglog?msg=' + msg +'&type=' + type;
        }
</script>
</body>
</html>

很好又是一堆奇怪的東西!(ノಠ益ಠ)ノ彡┻━┻不過仔細一看,是否是還挺眼熟??

OMG!!location.replace again!!!只是此次後面的連接彷佛是passport.weibo.com,哇哦~是否是敲像正式登錄的~

 

話很少說,馬上先用這則表達把這段url提取出來再說!

 1 p2 = re.compile("location\.replace\('(.*?)'\)")
 2 try:
 3     login_url2 = p2.search(page).group(1)
 4     request = urllib.request.Request(login_url2)
 5     response = urllib.request.urlopen(request)
 6     page2 = response.read().decode('utf-8')
 7     print(page2)
 8 except:
 9     print('Login error!')
10     return 0

本覺得此次妥妥的了的我看到的結果倒是。。。。

<html><head><script language='javascript'>parent.sinaSSOController.feedBackUrlCallBack({"result":true,"userinfo":{"uniqueid":"2467960972","userid":null,"displayname":null,"userdomain":"?wvr=5&lf=reg"}});</script></head><body></body></html>

呵呵,又是一個重定向= =

然而此次很輕易地注意到裏面有個"?wvr=5&lf=reg"字段肥腸眼熟,看看剛纔手工登錄抓到的包,果真,這是最終連接的一部分。

因此再搞一個正則表達式,把該字段也搞出來,而後拼接一個最終url出來,就能夠輕鬆而愉悅地模擬登錄了!

以上~

相關文章
相關標籤/搜索