Python爬蟲網頁圖片

時間 2019-11-16

原文原文鏈接

一概述html

　　參考http://www.cnblogs.com/abelsu/p/4540711.html 弄了個Python捉取單一網頁的圖片，可是Python已經升到3+版本了。參考的已經失效，基本用不上。修改了下，從新實現網頁圖片捉取。正則表達式

二代碼函數

#coding=utf-8

#urllib模塊提供了讀取Web頁面數據的接口
import urllib
#re模塊主要包含了正則表達式
import re
import urllib.parse
import urllib.request
#定義一個getHtml()函數
def getHtml(url):
    page = urllib.request.urlopen(url)  #urllib.urlopen()方法用於打開一個URL地址
    html = page.read() #read()方法用於讀取URL上的數據
    html = html.decode('UTF8')
    #print(html)
    return html

def getImg(html):
    reg = r'img.*? src="(.+?\.jpg)"'    #正則表達式，獲得圖片地址
    imgre = re.compile(reg)     #re.compile() 能夠把正則表達式編譯成一個正則表達式對象.
    
    imglist = re.findall(imgre,html)      #re.findall() 方法讀取html 中包含 imgre（正則表達式）的    數據
    #把篩選的圖片地址經過for循環遍歷並保存到本地
    #核心是urllib.urlretrieve()方法,直接將遠程數據下載到本地，圖片經過x依次遞增命名
    x = 0
    
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl,'E:\Raumrot\%s.jpg' % x)
        x+=1
        print(imgurl)

html = getHtml("http://raumrot.com/photo-set-landing/")
getImg(html)