爬取知乎某個問題下全部的圖片

時間 2019-11-25

標籤某個問題全部圖片欄目快樂工作简体版

原文原文鏈接

最近在逛知乎時，看到這麼一個問題html

最高讚的答案寫了個爬蟲，把全部的照片都爬下來了。程序員

嘿嘿嘿，技術的力量json

正好本身也在學習，加上答主的答案是好久以前的，知乎已經改版了，因此決定本身用Python3寫一個練習一下（絕對不是爲了下照片）....api

設個小小的目標：爬取全部「女性」程序員的照片。瀏覽器

首先是要知道「總的回答數」，這個比較簡單：app

url="https://www.zhihu.com/question/37787176"
html=requests.get(url,headers=headers).text
answer=BeautifulSoup(html,"lxml").find("h4",class_="List-headerText").find("span").get_text()
answer_num=int(re.sub("\s\S+","",answer))

知乎加載內容是經過點擊「更多」，而後加載出20個回答，利用selenium模擬登錄太慢太麻煩，全部查看知乎的Ajax請求比較靠譜，此處感謝崔大神的教學（http://cuiqingcai.com/4380.html）。學習

經過瀏覽器，能夠看到每次點擊更多，請求內容是一個「fetch」類型的文件和相關的圖片（jpeg），這個"fetch"文件包含了回答者信息和回答內容fetch

經過json處理後，經過gender判斷回答者性別（0爲女，1爲男）。ui

抓取「content」下的全部src屬性的圖片連接，就搞定了。url

附註：請求頭要加一個"authorization"

下面是全代碼：

import requestsimport osimport jsonfrom bs4 import BeautifulSoupimport reimport timeheaders = {    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',    "Connection": "keep - alive",    "Accept": "text/html,application/xhtml+xml,application/xml;",    "authorization": "Bearer Mi4wQUFEQVB4VkNBQUFBVU1MWktySDJDeGNBQUFCaEFsVk5TZ0YyV1FBaGsxRnJlTFd3ZGR6QzZrTXptcDFuWGNOQk5B|1498313802|2d5466ef4550588f5fc28553ea8981e7a4e398ad"    }isExists = os.path.exists("D:/crawler_study/zhihu")if not isExists:    os.makedirs("D:/crawler_study/zhihu")    os.chdir("D:/crawler_study/zhihu")else:    os.chdir("D:/crawler_study/zhihu")url="https://www.zhihu.com/question/37787176"html=requests.get(url,headers=headers).textanswer=BeautifulSoup(html,"lxml").find("h4",class_="List-headerText").find("span").get_text()answer_num=int(re.sub("\s\S+","",answer))url_prefix="https://www.zhihu.com/api/v4/questions/37787176/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cis_collapsed%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset="offset=0while offset < answer_num:    answer_url=url_prefix+str(offset)    html=requests.get(answer_url,headers=headers).text    content=json.loads(html)["data"]    for row in content:        gender=row["author"]["gender"]        if gender == 0:            answer=row["content"]            pic_list=BeautifulSoup(answer,'lxml').find_all("img")            for pic in pic_list:                down_url=pic["src"]                if down_url.startswith("http"):                    name=re.sub(".*/","",down_url)                    file=open(name,"ab")                    print("開始下載：",name)                    file.write(requests.get(down_url).content)                    print("下載完：", name)                    file.close()        else:            pass    offset+=20    time.sleep(3)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。