每一年終都有一個習慣,就是整理資料進行歸檔,結果發現手機照片全備份在華爲雲裏,在官網上找了一圈,沒找到官方的pc工具用來同步照片。html
因而找出上次寫的程序,看看能不能爬到數據,然而……果真很差用。由於華爲在登陸上又增長了一些驗證機制,譬如:帳號保護python
抓了一下報文,發現邏輯變複雜了不少,部分邏輯還封裝在js裏。git
算了,懶得琢磨了,直接用selenium吧。github
一、用Python + selenium +瀏覽器 ,人工登陸,保存cookie及簽名信息。web
二、再調用requests加第一步保存的cookie和前面,直接向後臺發post請求,獲取數據。chrome
思路肯定,開幹。json
一、python3.6,在最近的一個項目中因爲屢次遇到中文問題,實在是煩不勝煩,因此就把開發工具升級到了py3,確實方便多了。api
說到py2升到py3,雖然仍是有些寫法調整,有些包在py3下不支持,但整體來講,遷移很平穩,寫法問題,百度一下基本就能夠解決。瀏覽器
我用的Anaconda的python包。微信
3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)] Python Type "help", "copyright", "credits" or "license" for more information.
二、selenium 3.9.0,用conda現安裝的。
conda install selenium
三、瀏覽器,試用了firefox,edge,chrome,phantomjs,分別版本以下:
firefox: 58.0.2 (64 位) edge: Microsoft Edge 41.16299.248.0 ,Microsoft Edge 41.16299.248.0 chrome: 版本 63.0.3239.132(正式版本) (32 位) phantomjs: 2.1.1 另外,操做系統:Microsoft Windows [版本 10.0.16299.248]
四、瀏覽器驅動:
firefox驅動,https://github.com/mozilla/geckodriver/releases/,支持 Firefox 55及以上版本。
edge驅動,https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/#downloads,最新版本 Release 16299,Version: 5.16299,支持 Edge version supported: 16.16299 。注意edge驅動只有在edge瀏覽器未啓動的狀況下才能正常運行,不然會報錯。
chrome驅動,https://sites.google.com/a/chromium.org/chromedriver/downloads,這裏須要注意的是:最新版本是2.35(不是2.9),2.35才支持chrome 61-63版本。
phantomjs,http://phantomjs.org/download.html,phantomjs能夠理解成沒有界面的瀏覽器,因此驅動跟瀏覽器是一體的。
驅動版本必定要選對,不然會有奇奇怪怪的問題。
huaweiphoto_sele.py,以下:
#-*-coding=utf-8-*- #Create by : zhongtang #Create Date : 2018.2.28 from selenium import webdriver from selenium.webdriver.common.desired_capabilities import DesiredCapabilities from selenium.webdriver.common.proxy import ProxyType from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By from PIL import Image import json,re,os,time,requests,socket #下載函數 from huaweiphoto_py3 import HuaWei class hwSele: SeleBrowser=None TimeOUT=30 Headers=None Username='*****' Passwd='****' DriverType="Edge".lower() def __init__(self,ip=None,port=None,SeleDriver="Edge",SeleHeader=None): print (u'proxy %s %s...' %(ip,port)) if not SeleHeader : self.Headers = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:56.0) Gecko/20100101 Firefox/56.0" else: self.Headers = SeleHeader if SeleDriver: self.DriverType= SeleDriver.lower() #加代理的目的是爲了更便於抓報文。 if self.DriverType=='chrome' : chromeOptions = webdriver.ChromeOptions() if ip: chromeOptions.add_argument('--proxy-server=http://%s:%s' %(ip,port)) self.SeleBrowser = webdriver.Chrome(chrome_options=chromeOptions) else: self.SeleBrowser = webdriver.Chrome() #DriverType='Edge' elif self.DriverType=='phantomjs': #設置userAgent dcap = dict(DesiredCapabilities.PHANTOMJS) dcap["phantomjs.page.settings.userAgent"] = (self.Headers) self.SeleBrowser = webdriver.PhantomJS(executable_path=r'D:\python\toupiao\phantomjs\bin\phantomjs.exe',desired_capabilities=dcap) if ip: proxy=webdriver.Proxy() proxy.proxy_type=ProxyType.MANUAL proxy.http_proxy='%s:%s' %(ip,port) proxy.add_to_capabilities(webdriver.DesiredCapabilities.PHANTOMJS) else: self.SeleBrowser.start_session(webdriver.DesiredCapabilities.PHANTOMJS) elif self.DriverType=='edge': self.KillSeleProc() #edge,默認先kill掉已啓動的瀏覽器。 self.SeleBrowser = webdriver.Edge() elif self.DriverType=='firefox': webdriver.DesiredCapabilities.FIREFOX['firefox.page.settings.userAgent'] = self.Headers profile = webdriver.FirefoxProfile() if ip: profile.set_preference('network.proxy.type', 1) # 默認值0,就是直接鏈接;1就是手工配置代理。 profile.set_preference('network.proxy.http', ip) profile.set_preference('network.proxy.http_port', port) profile.set_preference('network.proxy.ssl', ip) profile.set_preference('network.proxy.ssl_port', port) profile.update_preferences() self.SeleBrowser = webdriver.Firefox(profile) else: self.SeleBrowser = webdriver.Firefox() socket.setdefaulttimeout(self.TimeOUT) # 設置10秒頁面超時返回,相似於requests.get()的timeout選項,driver.get()沒有timeout選項 # 之前遇到過driver.get(url)一直不返回,但也不報錯的問題,這時程序會卡住,設置超時選項能解決這個問題。 self.SeleBrowser.set_page_load_timeout(self.TimeOUT) # 設置10秒腳本超時時間 self.SeleBrowser.set_script_timeout(self.TimeOUT) # 隱式等待30秒,能夠本身調節 self.SeleBrowser.implicitly_wait(self.TimeOUT) def KillSeleProc(self): if self.DriverType=='edge': command = 'taskkill /F /IM MicrosoftWebDriver.exe & taskkill /F /IM MicrosoftEdge.exe' #好比這裏關閉edge進程 elif self.DriverType=='chrome': command = 'taskkill /F /IM chromedriver.exe & taskkill /F /IM chrome.exe' elif self.DriverType=='firefox': command = 'taskkill /F /IM geckodriver.exe & taskkill /F /IM firefox.exe' elif self.DriverType=="phantomjs": command = 'taskkill /F /IM phantomjs.exe ' if command: os.system(command) def QuitSele(self,e,mess=None,iRet= -1): print (mess,e) if self.SeleBrowser: self.SeleBrowser.save_screenshot('error.png') self.SeleBrowser.close() self.KillSeleProc() return iRet def LoginHW(self): ''' try: element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, "loadedButton"))) finally: print(driver.find_element_by_id("content").text) driver.close() #等待頁面加載完畢1,顯示等待 try: auth_img = WebDriverWait(self.SeleBrowser, 5).until(EC.presence_of_element_located((By.ID, "randomCodeImg"))) except Exception as e: print (u'加載驗證碼超時...',e) SeleBrowser.save_screenshot(r'd:\python\toupiao\error.jpg') self.SeleBrowser.close() return -1 #等待頁面加載完畢2,隱式等待 dr=WebDriverWait(self.SeleBrowser,20,0.5) dr.until(lambda the_driver:the_driver.find_element_by_xpath("//img[@id='randomCodeImg']").is_displayed()) ''' try: self.SeleBrowser.get('http://cloud.huawei.com') except Exception as e: return self.QuitSele(e,"打開主頁出錯!") try: #等待頁面加載完畢 dr=WebDriverWait(self.SeleBrowser,self.TimeOUT,0.5) dr.until(lambda the_driver:the_driver.find_element_by_id("randomCodeImg").is_displayed()) except Exception as e: return self.QuitSele(e,"加載驗證碼超時!") elem_user = self.SeleBrowser.find_element_by_id("login_userName") elem_user.clear() elem_user.send_keys(self.Username) elem_pwd = self.SeleBrowser.find_element_by_id("login_password") elem_pwd.clear() elem_pwd.send_keys(self.Passwd) auth_img = self.SeleBrowser.find_element_by_id("randomCodeImg") if not auth_img.is_displayed() : if not auth_img.is_displayed(): return self.QuitSele(e,"驗證碼未正常顯示!") if self.DriverType=='firefox': #firefox驅動支持直接 元素另存圖片 auth_img.screenshot("captcha.png") im = Image.open('captcha.png') else: #chrome ,edge 都不支持,phantomjs存的仍是整個窗口 self.SeleBrowser.save_screenshot('captcha.png') im = Image.open('captcha.png') x= eval(auth_img.get_attribute("x")) y= eval(auth_img.get_attribute("y")) width= eval(auth_img.get_attribute("width")) height= eval(auth_img.get_attribute("height")) im = im.crop((x, y, x+width, y+height)) #這裏採用最原始、最準確的方法:顯示圖片,人工識別^_^,智能輸入驗證碼。 #固然也能夠調用三方的圖像識別api進行識別,譬如pytesseract或者鵝廠的圖像識別api,不復雜,但懶得寫了。 im.show() authCode= input(u'請輸入驗證碼:') # 先獲取焦點,再賦值,再點擊登陸 ''' js= '$("#randomCode").attr("value","%s");$("#randomCode").trigger("onchange");' %authCode self.SeleBrowser.execute_script(js) js= '$("#btnLogin").trigger("click");' self.SeleBrowser.execute_script(js) ''' randomCode = self.SeleBrowser.find_element_by_id("randomCode") randomCode.clear() randomCode.send_keys(authCode) #休息五秒,等待完成後臺預驗證交互 time.sleep(5) btnLogin = self.SeleBrowser.find_element_by_id("btnLogin") btnLogin.click() #帳號保護有時候會提示 ''' <div class="global_dialog_confirm_main" style="display: block; margin-top: -163.5px;"> <div class="global_dialog_confirm_title"> <h3 class="ellipsis" title="賬號保護">賬號保護</h3> </div> <div class="global_dialog_confirm_content" style="padding-bottom: 0px;"><div> <div id="authenDialog"><p class="inptips2">您已開啓賬號保護,請輸入驗證碼以完成登陸。</p> <div class="margin10-EMUI5"><div id="accountDiv" class="fixAccountDrt ddrop-EMU5"> ''' try : #loginConfirm = self.SeleBrowser.find_element_by_class_name("global_dialog_confirm_main") loginConfirm =WebDriverWait(self.SeleBrowser, 5, 0.5).until(EC.presence_of_element_located((By.CLASS_NAME, 'global_dialog_confirm_main') )) #須要驗證,這塊懶得實現了,休眠60秒,手動操做吧。 if loginConfirm.is_displayed(): time.sleep(self.TimeOUT*2) except: #不須要驗證,直接下一步 pass #等待頁面加載完畢 ''' <span class="index-span" data-bind="lang.common.album">圖庫</span> ''' try : #loginConfirm = self.SeleBrowser.find_element_by_class_name("global_dialog_confirm_main") success =WebDriverWait(self.SeleBrowser, 20, 0.5).until(EC.presence_of_element_located((By.XPATH, '//span[@data-bind="lang.common.album"]') )) except Exception as e: #登陸失敗 return self.QuitSele(e,"登陸失敗!",iRet=-999) #判斷登陸結果 if not success.is_displayed(): return self.QuitSele(None,"登陸失敗!",iRet=-999) #再次判斷,增長一次意外處理 source_code =self.SeleBrowser.page_source if '聯繫人' not in source_code or '圖庫' not in source_code : return self.QuitSele(None,"登陸失敗!",iRet = -9999 ) cookie = [item["name"] + "=" + item["value"] for item in self.SeleBrowser.get_cookies()] cookiestr = ';'.join(item for item in cookie) #保存CSRFToken pattern = re.compile('CSRFToken = "(.*?)"',re.S) content = re.search(pattern,source_code) if content : CSRFToken = content.group(1) else : print ('獲取CSRFToken出錯!') self.Headers={ 'User-Agent': '%s' %self.Headers, 'CSRFToken': '%s' %CSRFToken, 'Cookie': '%s' %cookiestr } return 1 if __name__ == '__main__': photohw= HuaWei() count =0 while (count <100): count += 1 selehw= hwSele(SeleDriver='edge') iRet = selehw.LoginHW() if iRet !=1: print( '登陸華爲失敗!!!\n\n') continue photohw.loginHeaders = selehw.Headers page = photohw.getAlbumList() if page=='' : print( '獲取到相冊列表失敗!!!\n\n') break #保存相冊列表 iRet = photohw.getFileList(page,'albumList','albumId') if iRet <=0 : print('保存相冊出錯,從新登陸') continue #保存公共相冊列表 iRet = photohw.getFileList(page,'ownShareList','shareId') if iRet ==0 : print('運行結束,能夠用迅雷打開相冊文件進行批量下載到本地!!!\n\n') #運行結束 selehw.QuitSele(None) break else: continue
huaweiphoto_py3.py以下:
# -*- coding=utf-8 -*- # Create by : zhongtang # Create date : 2018.2.28 import json import requests from requests.adapters import HTTPAdapter import html class HuaWei: #華爲雲服務登陸 def __init__(self): self.getalbumsUrl= 'https://www.hicloud.com/album/getCloudAlbums.action' self.getalbumfileUrl = 'https://www.hicloud.com/album/getCloudFiles.action' self.loginHeaders = { } self.SReq=requests.session() self.SReq.mount('http://', HTTPAdapter(max_retries=3)) self.SReq.mount('https://', HTTPAdapter(max_retries=3)) self.OnceMaxFile=100 #單次最大獲取文件數量 self.FileNum=0 self.AlbumList={} #保存相冊照片地址到文件 ,不一樣相冊保存到不一樣的文件 def saveFileList2Txt(self,filename,hjsondata,flag): if len(hjsondata)<= 0 : return -1 hjson2 = {} try: hjson2 = json.loads(hjsondata) except: print('獲取相冊明細出錯\n') return -1 lfilename = filename+u".txt" if flag == 0 : #新建文件 print( u'建立相冊文件'+lfilename+"\n") #新建文件,表明新的相冊從新開始計數 self.FileNum = 0 f = open(lfilename, 'w') else: #追加文件 f = open(lfilename, 'a') i = 0 if hjson2.get("fileList"): for each in hjson2["fileList"]: fileurl= html.unescape(hjson2["fileList"][i]["fileUrl"]) f.write(fileurl+"\n") #每一千行分頁 self.FileNum += 1 if self.FileNum%1000 ==0 :f.write('\n\n\n\n\n\n--------------------page %s ------------------\n\n\n\n\n\n' %(int(self.FileNum/1000))) i += 1 f.close() return i #循環讀取相冊文件 def getFileList(self,hjsondata,parentkey,childkey): #step 3 getCoverFiles.action,循環取相冊文件列表,單次最多取100條記錄。 #每次count都是最大數量49,無論實際數量是否夠,每次currentnum遞增,直到返回空列表。 #albumIds[]=default-album-2&ownerId=220086000029851117&height=300&width=300&count=49¤tNum=0&thumbType=imgcropa&fileType=0 #albumIds[]=default-album-1&ownerId=220086000029851117&height=300&width=300&count=49¤tNum=49&thumbType=imgcropa&fileType=0 #albumIds[]=default-album-1&ownerId=220086000029851117&height=300&width=300&count=49¤tNum=98&thumbType=imgcropa&fileType=0 #albumIds[]=default-album-2&ownerId=220086000029851117&height=300&width=300&count=49¤tNum=101&thumbType=imgcropa&fileType=0 #最後一次返回 空列表 #{"albumSortFlag":true,"code":0,"info":"success!","fileList":[]} #第一次取文件時,例如文件總數量只有2個,count也是放最大值49。 #albumIds[]=default-album-102-220086000029851117&ownerId=220086000029851117&height=300&width=300&count=49¤tNum=0&thumbType=imgcropa&fileType=0 #[{u'photoNum': 2518, u'albumName': u'default-album-1', u'iversion': -1, u'albumId': u'default-album-1', u'flversion': -1, u'createTime': 1448065264550L, u'size': 0}, #{u'photoNum': 100, u'albumName': u'default-album-2', u'iversion': -1, u'albumId': u'default-album-2', u'flversion': -1, u'createTime': 1453090781646L, u'size': 0}] try: hjson = json.loads(hjsondata) except Exception: print ('加載json出錯!') return -1 #字典獲取出錯 if not hjson.get(parentkey): print ('加載json根節點[%s]出錯!' %parentkey) return -1 #初始化全局 albumlist if not self.AlbumList : self.AlbumList=hjson for idx,album in enumerate(self.AlbumList[parentkey]): if 'currentNum' not in self.AlbumList[parentkey][idx].keys(): self.AlbumList[parentkey][idx]['currentNum']=0 #循環保存相冊 for each in hjson[parentkey]: #該相冊已經進入記錄 paraAlbum={} paraAlbum['albumIds[]'] = each[childkey] paraAlbum['ownerId'] = hjson['ownerId'] paraAlbum['height'] = '300' paraAlbum['width'] = '300' paraAlbum['count'] = self.OnceMaxFile paraAlbum['thumbType'] = 'imgcropa' paraAlbum['fileType'] = '0' itotal= each['photoNum'] #取當前節點的當前記錄 for idx,album in enumerate(self.AlbumList[parentkey]): if each[childkey]==album[childkey]: icurrentnum = self.AlbumList[parentkey][idx]['currentNum'] break #保存相冊中全部文件 while icurrentnum<itotal: paraAlbum['currentNum'] = icurrentnum response=self.SReq.post(self.getalbumfileUrl,headers=self.loginHeaders,data=paraAlbum,verify=False) page = response.text #保存下載地址到文本文件中,但不下載文件 iret = self.saveFileList2Txt(each[childkey],page,icurrentnum) if iret >0 : self.AlbumList[parentkey][idx]['currentNum'] += iret icurrentnum = self.AlbumList[parentkey][idx]['currentNum'] else: #出錯!!! return -1 return 1 #step 1 getCloudAlbums,取相冊列表 def getAlbumList(self): response=self.SReq.post(self.getalbumsUrl,headers=self.loginHeaders,verify=False) page=response.text '''#返回報文 {"ownerId":"220086000029851117","code":0, "albumList":[{"albumId":"default-album-1","albumName":"default-album-1","createTime":1448065264550,"photoNum":2521,"flversion":-1,"iversion":-1,"size":0}, {"albumId":"default-album-2","albumName":"default-album-2","createTime":1453090781646,"photoNum":101,"flversion":-1,"iversion":-1,"size":0}], "ownShareList":[{"ownerId":"220086000029851117","resource":"album","shareId":"default-album-102-220086000029851117","shareName":"微信","photoNum":2,"flversion":-1,"iversion":-1,"createTime":1448070407055,"source":"HUAWEI MT7-TL00","size":0,"ownerAcc":"****","receiverList":[]}], "recShareList":[]}' ''' if len(page)<=0 : print( u'取相冊列表出錯,無返回報文!!!\n\n') return page
程序會在當前目錄生成華爲雲相冊照片下載地址文件,內容以下:
https://d167.g03.dbankcloud.com/file/MDAwMTZBODissQaaaaaaaaaaaaaaaaaaaaQc2CR-znjyRnw../162807b277aaaaaaaaaaaaaaaaaa9ee1/IMG_20170606_141952.jpg?key=AWqIQFqVkEaaaaaaaaaaaaaaaaaaaaaaWNLIosPR_EKv8VQ..&a=220086000029851117-3da1ab76-92808-5840&nsp_ver=3.0
https://d167.g03.dbankcloud.com/file/MDAwMTZBODhhoFaaaaaaaaaaaaaaaaaaaa7r6jPU67bWTQA../4039a1be5caaaaaaaaaaaaaaaaaac726/IMG_20170605_203519.jpg?key=AWqIQFqVkEaaaaaaaaaaaaaaaaaaaaaajvduIL8cXufhNhQ..&a=220086000029851117-3da1ab76-92808-5840&nsp_ver=3.0
https://d167.g03.dbankcloud.com/file/MDAwMTZBODgpoWaaaaaaaaaaaaaaaaaaaaaciAQlIVHRbXg../9e336da286aaaaaaaaaaaaaaaaaaf89d/IMG_20170604_171032.jpg?key=AWqIQFqVkEaaaaaaaaaaaaaaaaaaaaaaThkDgJKHpBtiG5w..&a=220086000029851117-3da1ab76-92808-5840&nsp_ver=3.0
https://d167.g03.dbankcloud.com/file/MDAwMTZBODgUz2aaaaaaaaaaaaaaaaaaaatyFMDr71YpXGg../b3c17582ccaaaaaaaaaaaaaaaaac278b/IMG_20170603_134831.jpg?key=AWqIQFqVkEaaaaaaaaaaaaaaaaaaaaaaTI_xPSzF_VzUsJA..&a=220086000029851117-3da1ab76-92808-5840&nsp_ver=3.0
https://d167.g03.dbankcloud.com/file/MDAwMTZBODgGfwaaaaaaaaaaaaaaaaaaaad7DLSHH4rwKVA../2722df087baaaaaaaaaaaaaaaaa915e4/IMG_20170603_133833.jpg?key=AWqIQFqVkEaaaaaaaaaaaaaaaaaaaaaaluTQ8grDHok9BzQ..&a=220086000029851117-3da1ab76-92808-5840&nsp_ver=3.0
https://d167.g03.dbankcloud.com/file/MDAwMTZBODgsrHaaaaaaaaaaaaaaaaaaaatq2yOJ-OnkDtQ../77e0ef0560aaaaaaaaaaaaaaaaa44702/IMG_20170602_183736.jpg?key=AWqIQFqVkEaaaaaaaaaaaaaaaaaaaaaa20WJbfxn-qoqIeQ..&a=220086000029851117-3da1ab76-92808-5840&nsp_ver=3.0
https://d167.g03.dbankcloud.com/file/MDAwMTZBODiAcDaaaaaaaaaaaaaaaaaaaaEXW-ONoF0Shuw../df033e69ffaaaaaaaaaaaaaaaaa8c1b1/IMG_20170601_185446.jpg?key=AWqIQFqVkEaaaaaaaaaaaaaaaaaaaaaaxIC-spleDG_xxVg..&a=220086000029851117-3da1ab76-92808-5840&nsp_ver=3.0
https://d167.g03.dbankcloud.com/file/MDAwMTZBODjY8EaaaaaaaaaaaaaaaaaaaaVbr9kC-JU0M8g../d5230d2032aaaaaaaaaaaaaaaaa903b3/IMG_20170601_102059.jpg?key=AWqIQFqVkEaaaaaaaaaaaaaaaaaaaaaa49iM03bK-Cm-Z9g..&a=220086000029851117-3da1ab76-92808-5840&nsp_ver=3.0
https://d167.g03.dbankcloud.com/file/MDAwMTZBODi04Caaaaaaaaaaaaaaaaaaaaaaw41bSNB4pxBw../6ee510e28aaaaaaaaaaaaaaaaa57cb5a/IMG_20170601_102042.jpg?key=AWqIQFqVkaaaaaaaaaaaaaaaaaaaaaaxlapdsHLoRCSITVw..&a=220086000029851117-3da1ab76-92808-5840&nsp_ver=3.0
把上述下載連接複製到迅雷,添加批量任務就能夠下載圖片到本地。
以上,-- End --