用python爬蟲保存美國農業部網站上的水果圖片

導語

美國農業部爲全世界已知水果製做了 7500 幅水彩「證件照」並提供高清下載,連接在這裏html

草莓

此次的爬蟲的目的是保存這些證件照到本地磁盤。python

分析

原頁面共收錄了7584張圖片,分爲380頁,每頁20條。git

第一頁的連接: https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=0 第二頁的連接: https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=20 ... 以此類推,仍是比較簡單的。github

每條數據的HTML元素佈局以下:ide

咱們能夠獲取到:佈局

  • artist
  • year
  • scientific name
  • common name
  • 縮略圖的url

點擊圖片進入到詳情頁面:網站

點擊click to enlarge按鈕,咱們就能夠獲取到原圖了。ui

可是這樣的話就意味着每張圖都要打開一個新的頁面,後來發現縮略圖的url和原圖的url有關聯:url

  • 縮略圖url, ../download/POM00007435/thumbnail
  • 原圖url, https://usdawatercolors.nal.usda.gov/download/POM00007435/screen

咱們只要從縮略圖的url中獲取到POM00007435,就能夠拼出對應的原圖地址了。spa

爬蟲

依賴

  1. requests
  2. beautifulsoup4

源碼

  1. 循環380次,對應380頁
  2. 每一個頁面獲取20條記錄對應的html標籤
  3. 對於每一個html標籤
  4. 獲取artist,year等信息
  5. 從縮略圖url拼出對應的原圖url
  6. 下載原圖,保存到本地
import requests
from bs4 import BeautifulSoup

IMG_FOLDER = 'fruit_images/'


def run():
    for (idx, page) in enumerate(range(380)):
        resp = requests.get(
            'https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start={}&searchText=&searchField=&sortField='.format(
                idx * 20))
        soup = BeautifulSoup(resp.text, 'html.parser')
        for (div_idx, div) in enumerate(soup.select('div.document')):
            doc = div.select_one('dl.defList')
            artist = doc.select_one(':nth-child(2)>a').get_text()
            year = doc.select_one(':nth-child(4)>a').get_text()
            # cannot parse scientific name or common name for some pictures, just use 'none' instead to avoid terminating
            scientific_name = 'none' if doc.select_one(':nth-child(6)>a') is None else doc.select_one(
                ':nth-child(6)>a').get_text()
            common_name = 'none' if doc.select_one(':nth-child(8)>a') is None else doc.select_one(
                ':nth-child(8)>a').get_text()
            thumb_pic_src = div.select_one('div.thumb-frame>a>img')['src']
            id = (idx + 1) * 20 + div_idx + 1
            info = FruitInfo(id, artist, year, scientific_name, common_name, thumb_pic_src)
            print(info)
            info.download_and_save()


class FruitInfo:
    def __init__(self, id, artist, year, scientific_name, common_name, thumb_pic_url):
        self.id = id
        self.artist = artist
        self.year = year
        self.scientific_name = scientific_name
        self.common_name = common_name
        self.thumb_pic_url = thumb_pic_url

    def download_and_save(self):
        filename = '{}-{}-{}-{}.png'.format(self.id, self.common_name, self.year, self.artist).replace(' ', '_')
        print('filename = ', filename)
        ori_img_url = self.__parse_ori_img_url()
        print('original img url = ', ori_img_url)
        resp = requests.get(ori_img_url)
        with open(IMG_FOLDER + filename, 'wb') as f:
            f.write(resp.content)
            print('saved...', filename)

    def __parse_ori_img_url(self) -> str:
        img_id = self.thumb_pic_url.split('/')[2]
        print('img id = ', img_id)
        return 'https://usdawatercolors.nal.usda.gov/download/{}/screen'.format(img_id)

    def __str__(self):
        return 'FruitInfo(artist={},year={},scientific_name={},common_name={},thumb_pic_url={})'.format(self.artist,
                                                                                                        self.year,
                                                                                                        self.scientific_name,
                                                                                                        self.common_name,
                                                                                                        self.thumb_pic_url)


if __name__ == '__main__':
    run()
複製代碼

本地運行須要設置代理,不然打不開美國農業部的網站

Github

usda-fruit-img-spider

打包好的images.zip, 1.1Gb

相關文章
相關標籤/搜索