[爬蟲]美術做業，爬蟲和百度圖片

時間 2020-02-28

原文原文鏈接

當博主正在看機率論的時候，QQ羣忽然出現了：html

但是博主的手繪板還沒到，明天又要交差了，不管怎麼趕，都搞不出一份像模像樣的做品了。正則表達式

但博主想起曾經在知乎上看到的文章（https://www.zhihu.com/question/27621722），不久前還學習了爬蟲技術，再加上學校的包容開放，便有了這個想法：ide

　　將相關的圖片拼接在一塊兒，組成內容。函數

說幹就幹。在查閱資料後，博主選擇了舊版的百度圖片（方便操做，也沒有爬蟲警告和防爬機制）。通過分析，咱們發現：學習

https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E6%AD%A6%E6%B1%89%E5%8A%A0%E6%B2%B9&pn=20&gsm=3c&ct=&ic=0&lm=-1&width=0&height=0ui

對於一個特定的關鍵詞（就是word後面的部分，這裏是「武漢加油」），百度會蒐集與之相關的圖片。然後面pn則是相應的偏移數目，因爲舊版百度圖片一頁上會放20張圖，20就至關於翻了一頁（說實話，我以爲舊版的這樣的設計好多了，新版的還會不停加載，很是難受和彆扭）。url

接下來是得到url。根據百度的特性，咱們不難發現：
spa

這裏用正則表達式:"objURL":"(.*?)"去匹配就行了，效果不錯。設計

代碼：3d

 1 import requests
 2 import os
 3 from bs4 import BeautifulSoup as bs
 4 import re
 5 
 6 maxstep=10
 7 tot=0
 8 path="picture"
 9 
10 headers={
11     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)     Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
12 }
13 #######################################新建文件夾
14 def mkdir(path):
15     if os.path.exists(path):
16         return
17     else:
18         os.makedirs(path)
19 #######################################保存圖片
20 def save(content):
21     global tot,path
22     mkdir(path)
23     with open(path+"/"+str(tot)+".png","wb+") as file:
24         file.write(content)
25         file.close()
26 #######################################下載圖片
27 def download(url):
28     global tot
29     tot=tot+1
30     try:
31         html=requests.get(url,timeout=2)
32         save(html.content)
33         print(tot,"succeeded")
34     except:
35         print(tot,"failed")
36 #######################################得到相應信息
37 def getHtml(url):
38     html=requests.get(url,headers=headers)
39     html.encoding="utf-8"
40     return html.content
41 #######################################主函數
42 def main():
43     for pages in range(1,30):
44         print("Now page",pages)
45         url="https://image.baidu.com/search/flip?tn=baiduimage&ie=utf-8&word=%E6%AD%A6%E6%B1%89%E5%8A%A0%E6%B2%B9&pn="+str(pages*20)+"&gsm=3c&ct=&ic=0&lm=-1&width=0&height=0"
46         html=getHtml(url)
47         pat='"objURL":"(.*?)"'
48         result=re.compile(pat).findall(str(html))
49         for i in result:
50             print(i)
51             download(i)
52 #    file=open("observe.txt","w",encoding="utf-8")
53 #    file.write(soup.prettify())
54 #######################################
55 if(__name__=="__main__"):
56     main()