python 爬蟲獲得網頁的圖片

時間 2019-12-09

原文原文鏈接

import urllib.request,os
import re


# 獲取html 中的內容
def getHtml(url):
    page=urllib.request.urlopen(url)
    html=page.read()
    return html

path='本地存儲位置'

# 保存路徑
def saveFile(x):
    if not os.path.isdir(path):
        os.makedirs(path)
    t = os.path.join(path,'%s.jpg'%x)
    return  t

html=getHtml('https://。。。')

# 獲取網頁的圖片
def getImg(html):
    # 正則表達式
    reg=r'src="(https://imgsa[^>]+\.(?:jpeg|jpg))"'
    # 編譯正則表達式
    imgre=re.compile(reg)
    imglist=re.findall(imgre,html.decode('utf-8'))
    x=0
    for imgurl in imglist:
        # 下載圖片
        urllib.request.urlretrieve(imgurl,saveFile(x))
        print(imgurl)
        x+=1
        if x==23:
            break
    print(x)
    return imglist

getImg(html)
print('end')

正則表達式：html

^ ：字符串的開始，python

$: 字符串的末尾正則表達式

. ：匹配任意字符，除換行符url

* ：任意多的字符spa

+：任意大於1 的字符code

?: 匹配0或1個, home-?brew ： homebrew, 或home-brewhtm

[]: 指定一個字符類別，能夠單獨列出，也能夠使用- 表示一個區間。[abc]匹配a,b,c 中的任意一個字符，也能夠表示[a-c]的字符集blog

[^]: ^ 做爲類別的首個字符，[^5]將匹配除5以外的任意字符圖片

\ ：轉義字符homebrew

加反斜槓取消特殊性。\ section, 爲了匹配反斜槓，就得寫爲\\, 可是\\ 又有別的意思。。大量反斜槓。。。使用raw字符串表示，在字符串前加r,反斜槓就不會當作特殊處理，\n 表示兩個字符\ 和n,而不是換行。

如： https://imgsa[^>]+\.(?:jpeg|jpg) 表示 https://imgsa（不匹配>的多餘1個的字符串）.

方法/屬性	做用
match()	決定 RE 是否在字符串剛開始的位置匹配
search()	掃描字符串，找到這個 RE 匹配的位置
findall()	找到 RE 匹配的全部子串，並把它們做爲一個列表返回
finditer()	找到 RE 匹配的全部子串，並把它們做爲一個迭代器返回

方法/屬性	做用
group()	返回被 RE 匹配的字符串
start()	返回匹配開始的位置
end()	返回匹配結束的位置
span()	返回一個元組包含匹配 (開始,結束) 的位置

實現：在一個文檔中找到system('***'); 而且在後面加上print('***')

文檔爲：

aba
cdc
system('a');
cde;
system('d');

寫入 system$[\s\S]*$ 查找（\s \t\n..空白字符，\S 非空白字符，[]表示選擇匹配一個，* 表示0個或多個），找到的爲：

system('a'); cde; system('d');

由於會匹配最長的一個，要匹配第一個匹配的字符串：system$[\s\S]*?$。

要替換爲：

aba
cdc
system('a');'a' cde; system('d');'d'

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。