是的,今天就是要爬取這個網站上的圖片,這個網站上的圖片基本上都是一些高清大圖,有不少的beautiful girls,因此我要爬下來,當作個人電腦背景。
css
這裏藉助了python的幾個模塊:html
下面只對bs4
作一個簡單的介紹python
這裏會用到前三個對象:
Tag
,NavigableString
,BeautifulSoup
正則表達式
<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p>
from bs4 import BeautifulSoup # 傳入上面的那一段html soup = BeautifulSoup(html_doc, 'html.parser') print(soup.prettify())
執行結果
json
html_doc = """ <html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1"> Elsie </a> , <a class="sister" href="http://example.com/lacie" id="link2"> Lacie </a> and <a class="sister" href="http://example.com/tillie" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html> """
soup = BeautifulSoup(html_doc, 'html.parser') soup.title
執行結果windows
The Dormouse's story
soup.p
執行結果數組
<p class="title"><b>The Dormouse's story</b></p>
soup.find_all('a')
這裏須要注意的是,find_all方法查找的是全部的某個標籤,例如這裏寫的是查找全部的
a
標籤,返回的是一個列表。瀏覽器
soup.p['class']
執行結果網絡
['title']
Tag
對象跟原生的xml或者html中的tag(標籤)相同,能夠直接經過對應的名稱來獲取,什麼意思呢?以下所示:
來打印下tag的全部屬性就知道了app
源內容爲:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
soup = BeautifulSoup(html_doc, 'html.parser') t = soup.a print(t.attrs)
輸出結果爲:
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
能夠看到t
這個標籤有href
,class
,id
這些屬性,那咱們分別來打印下這些屬性的結果
soup = BeautifulSoup(html_doc, 'html.parser') t = soup.a print(t['href']) print(t['class']) print(t['id'])
輸出結果爲
http://example.com/elsie ['sister'] link1
class輸出的結果爲一個數組,之因此是數組,是由於class爲多值屬性
另外Tag
對象還有倆比較重要的屬性:name
和string
,咱們先經過結果來看下這倆屬性的做用
soup = BeautifulSoup(html_doc, 'html.parser') t = soup.a print(t.name) print(t.string)
結果爲
a Elsie
可見,name
即爲標籤的名稱,string
即爲標籤中包含的字符串。
源內容:
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all('a'))
這裏就是查找全部的a
標籤,返回的是一個數組(列表)
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
既然返回的是一個列表,那麼咱們就能夠對這個返回結果進行遍歷
soup = BeautifulSoup(html_doc, 'html.parser') for r in soup.find_all('a'): print(r.string)
這裏就是獲取<a></a>
標籤中包含的字符串,結果以下:
Elsie Lacie Tillie
soup = BeautifulSoup(html_doc, 'html.parser') for r in soup.find_all(id=re.compile(r'link(\d+)')): print(r)
這個就是經過正則表達式來指定咱們要匹配的內容,id=link+數字,知足條件的就是那三個a
標籤
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all(['a','p']))
結果爲
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, <p class="story">...</p>]
這樣的話輸出的結果就會不少,那麼咱們能不能添加一些過濾器呢?
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. </p> <a href="http://example.com/tillie" class="sister" id="link4">Tillie</a> <p class="story">...</p> """
新增了一個a
標籤,id=link4
,方便咱們後面調試,下面定義一個過濾器,參考官方文檔
soup = BeautifulSoup(html_doc, 'html.parser') def has_class_but_no_id(tag): return tag.has_attr('class') and not tag.has_attr('id') print(soup.find_all(has_class_but_no_id))
執行結果
[<p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well.</p>, <p class="story">...</p>]
最終結果沒有咱們剛加的那個id=link4
的a
標籤,so, 過濾器生效了。
soup = BeautifulSoup(html_doc, 'html.parser') print(soup.find_all(id='link4'))
執行結果以下
[<a class="sister" href="http://example.com/tillie" id="link4">Tillie</a>]
結果就是咱們剛加的a
標籤
若是咱們想搜索包含
id
這個屬性的全部tag,則可使用find_all(id=True)
可是有時候有的屬性沒法搜索,例如: data-*屬性,這個時候就能夠經過attrs
參數來定義一個字典參數來搜索包含特殊屬性的tag,以下:
soup.find_all(atrs={"data-foo": "要搜索的值"})
soup.find_all("a", class_="story")
class_
參數一樣接受不一樣類型的 過濾器 ,字符串,正則表達式,方法或True
:
# 指定正則 soup.find_all(class_=re.compile("itl")) # 經過自定義過濾器 def has_six_characters(css_class): return css_class is not None and len(css_class) == 6 soup.find_all(class_=has_six_characters)
最後一個執行結果爲
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
到這裏基本上就對bs4
這個模塊有了一個基本的認識,知道這些咱們就能夠來爬取咱們想要的圖片了。
首先咱們先訪問https://wallhaven.cc/這個網站,搜索一下咱們想搜的一些圖片,例如輸入關鍵詞sexy girl
,瀏覽器地址欄上就變成了https://wallhaven.cc/search?q=sexy girl&page=2這個地址,而後在搜索下其餘的,發現這個網站的搜索結果的連接是有規律的,以下所示:
https://wallhaven.cc/search?q=關鍵詞&參數
知道這個信息後,那咱們就直接使用requests
來獲取這個網頁信息了。
f12
看了下請求時的一些header
,就隨便拿了幾個,而後直接使用requests
import requests def request_client(url): user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36' headers = { 'user-agent': user_agent, 'accept-ranges': 'bytes', 'accept-language': 'zh-CN,zh;q=0.9' } req = requests.get(url, headers=headers) return req print(request_client("https://wallhaven.cc/search?q=sexy%20girl").text)
結果就返回了一個html內容,下面這段是關於獲取圖片地址的地方
<li> <figure class="thumb thumb-4y9pv7 thumb-sfw thumb-general" data-wallpaper-id="4y9pv7" style="width:300px;height:200px"> <img alt="loading" class="lazyload" data-src="https://th.wallhaven.cc/small/4y/4y9pv7.jpg" src="" /> <a class="preview" href="https://wallhaven.cc/w/4y9pv7" target="_blank"> </a> <div class="thumb-info"> <span class="wall-res"> 1920 x 1200 </span> <a class="jsAnchor overlay-anchor wall-favs" data-href="https://wallhaven.cc/wallpaper/fav/4y9pv7"> 9 <i class="fa fa-fw fa-star"> </i> </a> <a class="jsAnchor thumb-tags-toggle tagged" data-href="https://wallhaven.cc/wallpaper/tags/4y9pv7" title="Tags"> <i class="fas fa-fw fa-tags"> </i> </a> </div> </figure> </li>
能夠看到圖片地址是在data-src
這個屬性下的,另外咱們還知道這個<img>
標籤下的class=lazyload
, 待會咱們能夠經過這兩點信息來使用正則來獲取到圖片URL
def get_img_url_list(soup): # 主要是爲了取出url,並將url解析成能夠進行下載的連接 def get_url(tag): re_img = re.compile(r'data-src="(.+?\.jpg)"') url = re_img.findall(str(tag))[0] _, img_name = os.path.split(url) replace_content = { 'th.wallhaven.cc': 'w.wallhaven.cc', '/small/': '/full/', img_name: 'wallhaven-' + img_name } for k, v in replace_content.items(): url = url.replace(k, v) return url img_url_list = [] for tag in soup.find_all("img", class_="lazyload"): img_url_list.append(get_url(tag)) return img_url_list
這一步咱們返回了一個元素爲圖片URL的列表,而且代碼裏對獲取的URL作了處理,由於咱們拿到的URL並非真正的圖片地址,經過打開一個圖片,在瀏覽器f12
上分析圖片地址變成了
# 真正的下載地址 https://w.wallhaven.cc/full/4o/wallhaven-4ozvv9.jpg # html中的地址 https://th.wallhaven.cc/small/4o/4ozvv9.jpg
因此在代碼裏作了以下替換, small ---> full, 4ozvv9.jpg ---> wallhaven-4ozvv9.jpg
這一步須要繼續分析剛獲取的html,截取關鍵一段
<ul class="pagination" data-pagination='{"total":638,"current":1,"url":"https:\/\/wallhaven.cc\/search?q=animals&page=1"}' role="navigation"> <li> <span aria-hidden="true" original-tile="Previous Page"> <i class="far fa-angle-double-left"> </i> </span> </li> <li aria-current="page" class="current"> <span original-title="Page 1"> 1 </span> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=2" original-title="Page 2"> 2 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=3" original-title="Page 3"> 3 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=4" original-title="Page 4"> 4 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=5" original-title="Page 5"> 5 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=6" original-title="Page 6"> 6 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=7" original-title="Page 7"> 7 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=8" original-title="Page 8"> 8 </a> </li> <li aria-disabled="true"> <span> … </span> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=637" original-title="Page 637"> 637 </a> </li> <li> <a href="https://wallhaven.cc/search?q=animals&page=638" original-title="Page 638"> 638 </a> </li> <li> <a aria-label="next" class="next" href="https://wallhaven.cc/search?q=animals&page=2" rel="next"> <i class="far fa-angle-double-right"> </i> </a> </li> </ul>
分析<ul></ul>
標籤裏的內容,能夠看出,頁數是在data-pagination
這個屬性下的,因此咱們只須要拿到這個屬性對應的value就能夠了
def get_max_page(soup): result = soup.find('ul', class_='pagination')['data-pagination'] to_json = json.loads(result) return to_json['total'] if 'total' in to_json else 1
在返回的時候簡單判斷下,保證返回的值能讓後面的代碼繼續運行,由於頁數不影響咱們的結果。
def getImg(img_url_list: list, save_path): if not os.path.isdir(save_path): os.makedirs(save_path) # 對保存的路徑簡單處理下 end_swith = '\\' if platform.system().lower() == 'windows' else '/' if not save_path.endswith(end_swith): save_path = save_path + end_swith # 開始下載並保存到指定目錄下 for img in img_url_list: _, save_name = os.path.split(img) whole_save_path = save_path + save_name img_content = request_client(img).content with open(whole_save_path, 'wb') as fw: fw.write(img_content) print("ImageUrl: %s download successfully." % img) return
下載比較簡單,只要拿到圖片地址就能夠正常下載了。
爲了提升下載的速度,這裏使用了多進程multiprocessing
,另外爲了保證使用多進程時,不把機器CPU跑滿,這裏不會使用所有的核數
def run(base_url, save_path, page=1): url = base_url + '&page=%d' % page pageHtml = request_client(url).text img_url_list = get_img_url_list(BeautifulSoup(pageHtml, 'lxml')) getImg(img_url_list, save_path) if __name__ == '__main__': start_time = time.time() baseUrl = "https://wallhaven.cc/search?q=sexy%20girls&atleast=2560x1080&sorting=favorites&order=desc" save_path = '/data/home/dogfei/Pictures/Wallpapers' baseHtml = request_client(baseUrl).text pages = get_max_page(BeautifulSoup(baseHtml, 'lxml')) # 將CPU核數減一,避免CPU跑滿 cpu = cpu_count() - 1 print("Cpu cores: %d" % cpu) pages = cpu if pages > cpu else pages # 建立一個進程池 pool = Pool(processes=cpu) for p in range(1, pages + 1): pool.apply_async(run, args=(baseUrl, save_path, p,)) pool.close() pool.join() end_time = time.time() print("Total time: %.2f seconds" % (end_time - start_time))
這裏在下載的時候,不會把全部頁的圖片都下載了,會作一個簡單的判斷,當總頁數不超過CPU的核數的時候,會所有下載,不然,只會下載CPU核數對應的頁數。
源碼:
import re import os import json import time import requests import platform from bs4 import BeautifulSoup from bs4 import NavigableString from multiprocessing import Pool, cpu_count def request_client(url): user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36' headers = { 'user-agent': user_agent, 'accept-ranges': 'bytes', 'accept-language': 'zh-CN,zh;q=0.9' } req = requests.get(url, headers=headers) return req def get_max_page(soup): result = soup.find('ul', class_='pagination')['data-pagination'] to_json = json.loads(result) return to_json['total'] if 'total' in to_json else 1 def get_img_url_list(soup): # 主要是爲了取出url,並將url解析成能夠進行下載的連接 def get_url(tag): re_img = re.compile(r'data-src="(.+?\.jpg)"') url = re_img.findall(str(tag))[0] _, img_name = os.path.split(url) replace_content = { 'th.wallhaven.cc': 'w.wallhaven.cc', '/small/': '/full/', img_name: 'wallhaven-' + img_name } for k, v in replace_content.items(): url = url.replace(k, v) return url img_url_list = [] for tag in soup.find_all("img", class_="lazyload"): img_url_list.append(get_url(tag)) return img_url_list def getImg(img_url_list: list, save_path): if not os.path.isdir(save_path): os.makedirs(save_path) end_swith = '\\' if platform.system().lower() == 'windows' else '/' if not save_path.endswith(end_swith): save_path = save_path + end_swith for img in img_url_list: _, save_name = os.path.split(img) whole_save_path = save_path + save_name img_content = request_client(img).content with open(whole_save_path, 'wb') as fw: fw.write(img_content) print("ImageUrl: %s download successfully." % img) return def run(base_url, save_path, page=1): url = base_url + '&page=%d' % page pageHtml = request_client(url).text img_url_list = get_img_url_list(BeautifulSoup(pageHtml, 'lxml')) getImg(img_url_list, save_path) if __name__ == '__main__': # 指定要下載的連接 baseUrl = "https://wallhaven.cc/search?q=sexy%20girls&atleast=2560x1080&sorting=favorites&order=desc" # 指定要保存的目錄位置 save_path = '/data/home/dogfei/Pictures/Wallpapers' ######## 如下不須要修改 start_time = time.time() baseHtml = request_client(baseUrl).text pages = get_max_page(BeautifulSoup(baseHtml, 'lxml')) cpu = cpu_count() - 1 print("Cpu cores: %d" % cpu) pages = cpu if pages > cpu else pages pool = Pool(processes=cpu) for p in range(1, pages + 1): pool.apply_async(run, args=(baseUrl, save_path, p,)) pool.close() pool.join() end_time = time.time() print("Total time: %.2f seconds" % (end_time - start_time))
歡迎各位朋友關注個人公衆號,來一塊兒學習進步哦