用python爬蟲保存美國農業部網站上的水果圖片

時間 2019-11-06

標籤 python 爬蟲保存農業部網站水果圖片欄目 Python 简体版

原文原文鏈接

導語

美國農業部爲全世界已知水果製做了 7500 幅水彩「證件照」並提供高清下載，連接在這裏html

此次的爬蟲的目的是保存這些證件照到本地磁盤。python

分析

原頁面共收錄了7584張圖片，分爲380頁，每頁20條。git

第一頁的連接： https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=0 第二頁的連接： https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=20 ... 以此類推，仍是比較簡單的。github

每條數據的HTML元素佈局以下：ide

咱們能夠獲取到：佈局

artist
year
scientific name
common name
縮略圖的url

點擊圖片進入到詳情頁面：網站

點擊click to enlarge按鈕，咱們就能夠獲取到原圖了。ui

可是這樣的話就意味着每張圖都要打開一個新的頁面，後來發現縮略圖的url和原圖的url有關聯：url

縮略圖url， ../download/POM00007435/thumbnail
原圖url， https://usdawatercolors.nal.usda.gov/download/POM00007435/screen

咱們只要從縮略圖的url中獲取到POM00007435，就能夠拼出對應的原圖地址了。spa

爬蟲

依賴

requests
beautifulsoup4

源碼

循環380次，對應380頁
每一個頁面獲取20條記錄對應的html標籤
對於每一個html標籤
獲取artist，year等信息
從縮略圖url拼出對應的原圖url
下載原圖，保存到本地

import requests
from bs4 import BeautifulSoup

IMG_FOLDER = 'fruit_images/'


def run():
    for (idx, page) in enumerate(range(380)):
        resp = requests.get(
            'https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start={}&searchText=&searchField=&sortField='.format(
                idx * 20))
        soup = BeautifulSoup(resp.text, 'html.parser')
        for (div_idx, div) in enumerate(soup.select('div.document')):
            doc = div.select_one('dl.defList')
            artist = doc.select_one(':nth-child(2)>a').get_text()
            year = doc.select_one(':nth-child(4)>a').get_text()
            # cannot parse scientific name or common name for some pictures, just use 'none' instead to avoid terminating
            scientific_name = 'none' if doc.select_one(':nth-child(6)>a') is None else doc.select_one(
                ':nth-child(6)>a').get_text()
            common_name = 'none' if doc.select_one(':nth-child(8)>a') is None else doc.select_one(
                ':nth-child(8)>a').get_text()
            thumb_pic_src = div.select_one('div.thumb-frame>a>img')['src']
            id = (idx + 1) * 20 + div_idx + 1
            info = FruitInfo(id, artist, year, scientific_name, common_name, thumb_pic_src)
            print(info)
            info.download_and_save()


class FruitInfo:
    def __init__(self, id, artist, year, scientific_name, common_name, thumb_pic_url):
        self.id = id
        self.artist = artist
        self.year = year
        self.scientific_name = scientific_name
        self.common_name = common_name
        self.thumb_pic_url = thumb_pic_url

    def download_and_save(self):
        filename = '{}-{}-{}-{}.png'.format(self.id, self.common_name, self.year, self.artist).replace(' ', '_')
        print('filename = ', filename)
        ori_img_url = self.__parse_ori_img_url()
        print('original img url = ', ori_img_url)
        resp = requests.get(ori_img_url)
        with open(IMG_FOLDER + filename, 'wb') as f:
            f.write(resp.content)
            print('saved...', filename)

    def __parse_ori_img_url(self) -> str:
        img_id = self.thumb_pic_url.split('/')[2]
        print('img id = ', img_id)
        return 'https://usdawatercolors.nal.usda.gov/download/{}/screen'.format(img_id)

    def __str__(self):
        return 'FruitInfo(artist={},year={},scientific_name={},common_name={},thumb_pic_url={})'.format(self.artist,
                                                                                                        self.year,
                                                                                                        self.scientific_name,
                                                                                                        self.common_name,
                                                                                                        self.thumb_pic_url)


if __name__ == '__main__':
    run()
複製代碼