無心中發現貼吧也出了個漂流瓶的東西,隨手翻了翻發現竟然有好多妹子圖,閒來無事因而就想寫個爬蟲程序把圖片所有抓取下來。python
這裏是貼吧漂流瓶地址
http://tieba.baidu.com/bottle...json
首先打開抓包神器 Fiddler ,而後打開漂流瓶首頁,加載幾頁試試,在Fiddler中過濾掉圖片數據以及非 http 200 狀態碼的干擾數據後,發現每一頁的數據獲取都頗有規律,這就給抓取提供了便利。具體獲取一頁內容的url以下:windows
http://tieba.baidu.com/bottle...python2.7
看參數很容易明白,page_number 就是當前頁碼,page_size 就是當前頁中包含的漂流瓶數量。網站
訪問後獲得的是一個json格式的數據,結構大體以下:編碼
{ "error_code": 0, "error_msg": "success", "data": { "has_more": 1, "bottles": [ { "thread_id": "5057974188", "title": "美得不可一世", "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg" }, { "thread_id": "5057974188", "title": "美得不可一世", "img_url": "http://imgsrc.baidu.com/forum/pic/item/a8c87dd062d9f2d3f0113c2ea0ec8a136227cca9.jpg" }, ... } }
內容很直白一眼就看出,bottles
中的數據就是咱們想要的(thread_id
瓶子具體id, title
妹紙吐槽的內容, img_url
照片真實地址),遍歷 bottles
就能夠得到當前頁的全部漂流瓶子。(其實如今獲得的只是封面圖哦,打開具體的瓶子有驚喜,由於我比較懶就懶得寫了,不過我也分析了內部的數據,具體url是:http://tieba.baidu.com/bottle...瓶子thread_id>)url
還有一個參數 has_more
猜想是是否存在下一頁的意思。
到這裏採集方式應該能夠肯定了。就是從第一頁不停日後循環採集,直到 has_more
這個參數不爲 1
結束。spa
這裏採用的是 python2.7 + urllib2 + demjson 來完成此項工做。urllib2 是python2.7自帶的庫,demjson 須要本身安裝下(通常狀況下用python自帶的json庫就能夠完成json解析任務,可是如今好多網站提供的json並不規範,這就讓自帶json庫無能爲力了。)code
demjson 安裝方式 (windows 不須要 sudo)blog
sudo pip install demjson
或者
sudo esay_install demjson
def bottlegen(): page_number = 1 while True: try: data = urllib2.urlopen( "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read() json = demjson.decode(data) if json["error_code"] == 0: data = json["data"] has_more = data["has_more"] bottles = data["bottles"] for bottle in bottles: thread_id = bottle["thread_id"] title = bottle["title"] img_url = bottle["img_url"] yield (thread_id, title, img_url) if has_more != 1: break page_number += 1 except: raise print("bottlegen exception") time.sleep(5)
這裏使用python的生成器來源源不斷的輸出分析到的內容。
for thread_id, title, img_url in bottlegen(): filename = os.path.basename(img_url) pathname = "tieba/bottles/%s_%s" % (thread_id, filename) print filename with open(pathname, "wb") as f: f.write(urllib2.urlopen(img_url).read()) f.close()
# -*- encoding: utf-8 -*- import urllib2 import demjson import time import re import os def bottlegen(): page_number = 1 while True: try: data = urllib2.urlopen( "http://tieba.baidu.com/bottle/bottles?page_number=%d&page_size=30" % page_number).read() json = demjson.decode(data) if json["error_code"] == 0: data = json["data"] has_more = data["has_more"] bottles = data["bottles"] for bottle in bottles: thread_id = bottle["thread_id"] title = bottle["title"] img_url = bottle["img_url"] yield (thread_id, title, img_url) if has_more != 1: break page_number += 1 except: raise print("bottlegen exception") time.sleep(5) def imggen(thread_id): try: data = urllib2.urlopen( "http://tieba.baidu.com/bottle/photopbPage?thread_id=%s" % thread_id).read() match = re.search(r"\_\.Module\.use\(\'encourage\/widget\/bottle\',(.*?),function\(\)\{\}\);", data) data = match.group(1) json = demjson.decode(data) json = demjson.decode(json[1].replace("\r\n", "")) for i in json: thread_id = i["thread_id"] text = i["text"] img_url = i["img_url"] yield (thread_id, text, img_url) except: raise print("imggen exception") try: os.makedirs("tieba/bottles") except: pass for thread_id, _, _ in bottlegen(): for _, title, img_url in imggen(thread_id): filename = os.path.basename(img_url) pathname = "tieba/bottles/%s_%s" % (thread_id, filename) print filename with open(pathname, "wb") as f: f.write(urllib2.urlopen(img_url).read()) f.close()
運行後會先得到每頁全部瓶子,而後再得到具體瓶子中的全部圖片,輸出到 tieba/bottles/xxxxx.jpg 中。(由於比較懶就沒作錯誤兼容,見諒 ^_^,,,)
結論是,,, 都是騙人的就首頁有幾張好看的 - -,,, 他喵的,,,