美國農業部爲全世界已知水果製做了 7500 幅水彩「證件照」並提供高清下載,連接在這裏html
此次的爬蟲的目的是保存這些證件照到本地磁盤。python
原頁面共收錄了7584張圖片,分爲380頁,每頁20條。git
第一頁的連接: https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=0
第二頁的連接: https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=20
... 以此類推,仍是比較簡單的。github
每條數據的HTML元素佈局以下:ide
咱們能夠獲取到:佈局
點擊圖片進入到詳情頁面:網站
點擊click to enlarge
按鈕,咱們就能夠獲取到原圖了。ui
可是這樣的話就意味着每張圖都要打開一個新的頁面,後來發現縮略圖的url和原圖的url有關聯:url
../download/POM00007435/thumbnail
https://usdawatercolors.nal.usda.gov/download/POM00007435/screen
咱們只要從縮略圖的url中獲取到POM00007435,就能夠拼出對應的原圖地址了。spa
import requests
from bs4 import BeautifulSoup
IMG_FOLDER = 'fruit_images/'
def run():
for (idx, page) in enumerate(range(380)):
resp = requests.get(
'https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start={}&searchText=&searchField=&sortField='.format(
idx * 20))
soup = BeautifulSoup(resp.text, 'html.parser')
for (div_idx, div) in enumerate(soup.select('div.document')):
doc = div.select_one('dl.defList')
artist = doc.select_one(':nth-child(2)>a').get_text()
year = doc.select_one(':nth-child(4)>a').get_text()
# cannot parse scientific name or common name for some pictures, just use 'none' instead to avoid terminating
scientific_name = 'none' if doc.select_one(':nth-child(6)>a') is None else doc.select_one(
':nth-child(6)>a').get_text()
common_name = 'none' if doc.select_one(':nth-child(8)>a') is None else doc.select_one(
':nth-child(8)>a').get_text()
thumb_pic_src = div.select_one('div.thumb-frame>a>img')['src']
id = (idx + 1) * 20 + div_idx + 1
info = FruitInfo(id, artist, year, scientific_name, common_name, thumb_pic_src)
print(info)
info.download_and_save()
class FruitInfo:
def __init__(self, id, artist, year, scientific_name, common_name, thumb_pic_url):
self.id = id
self.artist = artist
self.year = year
self.scientific_name = scientific_name
self.common_name = common_name
self.thumb_pic_url = thumb_pic_url
def download_and_save(self):
filename = '{}-{}-{}-{}.png'.format(self.id, self.common_name, self.year, self.artist).replace(' ', '_')
print('filename = ', filename)
ori_img_url = self.__parse_ori_img_url()
print('original img url = ', ori_img_url)
resp = requests.get(ori_img_url)
with open(IMG_FOLDER + filename, 'wb') as f:
f.write(resp.content)
print('saved...', filename)
def __parse_ori_img_url(self) -> str:
img_id = self.thumb_pic_url.split('/')[2]
print('img id = ', img_id)
return 'https://usdawatercolors.nal.usda.gov/download/{}/screen'.format(img_id)
def __str__(self):
return 'FruitInfo(artist={},year={},scientific_name={},common_name={},thumb_pic_url={})'.format(self.artist,
self.year,
self.scientific_name,
self.common_name,
self.thumb_pic_url)
if __name__ == '__main__':
run()
複製代碼
本地運行須要設置代理,不然打不開美國農業部的網站