從在知乎關注了幾個大神,我發現我知乎的主頁畫風突變。常常會出現html
***長得好看是怎樣一種體驗呢? 不用***,卻長得好看是一種怎樣的體驗? 什麼樣***做爲頭像? ...
諸如此類的問答。點進去以後發現果真很不錯啊,大神果真是大神,關注的焦點就是不同。node
看多了幾回以後,以爲太麻煩了。做爲一個基佬,不,直男,其實並不關注中間的過程(文字)。其實就是喜歡看圖片而已,得想個法子方便快捷地瀏覽,不,是欣賞這些圖片。python
python果真是個好東西,簡單代碼就能夠方便快捷地down下一個頁面中的圖片:app
#coding=utf-8 import urllib import re def getHtml(url): page = urllib.urlopen(url) html = page.read() return html def getImg(html): reg = r'original="([0-9a-zA-Z:/._]+?)" data-actualsrc' imgre = re.compile(reg) imglist = re.findall(imgre,html) x = 0 for imgurl in imglist: print imgurl subreg = r'\.([a-z]+?$)' subre = re.compile(subreg) subs2 = re.findall(subre,imgurl) name = 'e://pics/%s.%s' % (x, subs2[0]) urllib.urlretrieve(imgurl, name) x += 1 def getPage(text): reg = r'data-pagesize="([0-9]+?)"' rec = re.compile(reg) list = re.findall(rec,text) return list[0] url = "https://www.zhihu.com/question/****" # 把問題url貼到這裏 html = getHtml(url) getImg(html) print "page=%s" % getPage(html) print "done!"
(好像畫風不太對啊)post
怎麼才幾張圖片,原文裏面應該不少圖片的。url
調試一下能夠發現,網頁並非一次性加載出全部答案的。點擊網頁最底下的【更多】按鈕,服務端纔會返回剩下的內容。那麼腳本就須要修改一下了:spa
#coding=utf-8 import requests import shutil import re import urllib import ast count=0 def getHtml(url): r = requests.get(url) return r.text def saveImage(url, path): r = requests.get(url, stream=True) if r.status_code == 200: with open(path, 'wb') as f: r.raw.decode_content = True shutil.copyfileobj(r.raw, f) del r return 0 def getImg(html): global count reg = r'original="([0-9a-zA-Z:/._]+?)" data-actualsrc' imgre = re.compile(reg) imglist = re.findall(imgre,html) for imgurl in imglist: count += 1 subreg = r'\.([a-z]+?$)' subre = re.compile(subreg) subs2 = re.findall(subre,imgurl) path = 'e://pics/%s.%s' % (count, subs2[0]) I = saveImage(imgurl, path) print '%s --> %s ' % (count, imgurl) def getPage(text): reg = r'data-pagesize="([0-9]+?)"' rec = re.compile(reg) list = re.findall(rec,text) return list[0] question = 27203*** # 問題ID url = "https://www.zhihu.com/question/%s" % (question) html = getHtml(url) getImg(html) page = int(getPage(html)) next_url = "https://www.zhihu.com/node/QuestionAnswerListV2" if page > 1: for i_page in range(2, page): next_page = i_page * 10 params = '{"url_token":%s, "pagesize":%s, "offset": %s}' % (question, page, next_page) post_data = {'method':'next', 'params':params, '_xsrf': '521beffc0ca2d5747d6d981c6cc25dea'} data=urllib.urlencode(post_data) headers = {'Content-Type':'application/x-www-form-urlencoded'} r = requests.post(next_url, data=data, headers=headers) text = r.text text = ast.literal_eval(text) text = text['msg'] text = ''.join(text) text = text.replace('\\', '') getImg(text) print "page=%s" % page print "Down %s pics !!!" % count
畫風終於對了,這個腳本順利地爬下了10頁中的全部圖片。調試
呃,我趕着去欣賞圖片去了,拜了個拜。code