所謂爬取其實就是獲取連接的內容保存到本地。因此爬以前須要先知道要爬的連接是什麼。python
要爬取的頁面是這個:http://findicons.com/pack/2787/beautiful_flat_iconsweb
裏面有不少不錯的圖標,目標就是把這些文件圖片爬下來,保存成本地圖片。app
用python3怎麼作呢?url
第一步:獲取要爬取的母網頁的內容spa
import urllib.request import re url = "http://findicons.com/pack/2787/beautiful_flat_icons" webPage=urllib.request.urlopen(url) data = webPage.read() data = data.decode('UTF-8')
第二步:對母網頁內容處理,提取出裏面的圖片連接3d
k = re.split(r'\s+',data) s = [] sp = [] si = [] for i in k : if (re.match(r'src',i) or re.match(r'href',i)): if (not re.match(r'href="#"',i)): if (re.match(r'.*?png"',i) or re.match(r'.*?ico"',i)): if (re.match(r'src',i)): s.append(i) for it in s : if (re.match(r'.*?png"',it)): sp.append(it)
第三步:獲取這些圖片連接的內容,並保存成本地圖片code
cnt = 0 cou = 1 for it in sp: m = re.search(r'src="(.*?)"',it) iturl = m.group(1) print(iturl) if (iturl[0]=='/'): continue; web = urllib.request.urlopen(iturl) itdata = web.read() if (cnt%3==1 and cnt>=4 and cou<=30): f = open('d:/pythoncode/simplecodes/image/'+str(cou)+'.png',"wb") cou = cou+1 f.write(itdata) f.close() print(it) cnt = cnt+1
保存目錄能夠自行設定。blog
如下是綜合起來的代碼:圖片
import urllib.request import re url = "http://findicons.com/pack/2787/beautiful_flat_icons" webPage=urllib.request.urlopen(url) data = webPage.read() data = data.decode('UTF-8') k = re.split(r'\s+',data) s = [] sp = [] si = [] for i in k : if (re.match(r'src',i) or re.match(r'href',i)): if (not re.match(r'href="#"',i)): if (re.match(r'.*?png"',i) or re.match(r'.*?ico"',i)): if (re.match(r'src',i)): s.append(i) for it in s : if (re.match(r'.*?png"',it)): sp.append(it) cnt = 0 cou = 1 for it in sp: m = re.search(r'src="(.*?)"',it) iturl = m.group(1) print(iturl) if (iturl[0]=='/'): continue; web = urllib.request.urlopen(iturl) itdata = web.read() if (cnt%3==1 and cnt>=4 and cou<=30): f = open('d:/pythoncode/simplecodes/image/'+str(cou)+'.png',"wb") cou = cou+1 f.write(itdata) f.close() print(it) cnt = cnt+1