熟悉最經常使用的正則語法。html
.
匹配除換行符以外的任意一個字符。[...]
表示匹配一個字符集集合,如[A-Za-z0-9]
表示匹配全部字母和數字。 [^...]
表示匹配除該字符集集合指定字符外的任意字符。如[^0-9]
表示匹配除數字以外的全部字符。\
轉義字符,用來改變特殊字符的原有含義(使其表示自己)。\d
表示數字\D
表示非數字\s
表示空白字符\S
表示非空白字符\w
表示字母和數字\W
表示非字母和數字*
匹配前一個字符0或者無限次+
匹配前一個字符1或者無限次?
匹配前一個字符0或者1次^
匹配字符串開頭$
匹配字符串結尾(...)
分組(?P<NAME>)
分組,而且指定該分組的名稱爲NAME。(?P=NAME)
引用名稱爲NAME的分組所匹配到的字符串,配合上一個使用。 http://qwd.jd.com/fcgi-bin/qwd_searchitem_ex?skuid=26878432382%7C1658610413%7C26222795271%7C25168000024%7C11731514723%7C26348513019%7C20000220615%7C4813030%7C25965247088%7C5327182%7C19588651151%7C1780924%7C15495544751%7C10114188069%7C27036535156%7C10123099847%7C26016197600%7C10503200866%7C16675691362%7C15904713681
獲得的json字符串,使用正則匹配,查找出商品對應的skuid
(商品惟一編碼)和skuimgurl
(商品圖片)。 1 2 3 4 5 6 7 8 9 10 11 |
import re import requests url = "http://qwd.jd.com/fcgi-bin/qwd_searchitem_ex?skuid=26878432382%7C1658610413%7C26222795271%7C25168000024%7C11731514723%7C26348513019%7C20000220615%7C4813030%7C25965247088%7C5327182%7C19588651151%7C1780924%7C15495544751%7C10114188069%7C27036535156%7C10123099847%7C26016197600%7C10503200866%7C16675691362%7C15904713681" session = requests.session() r = session.get(url) #簡單爬蟲使用示意,後面會講到 html = r.text reg = re.compile(r"\s*\"skuid\":\"(\d+)\",\s*\S*\s*\S*\s*\"skuimgurl\":\"(\S*.jpg)\"") #正則表達式 result = reg.findall(html) print(result) #使用()分組,輸出結果爲2個分組的數據 |
輸出結果正則表達式
1 |
[('26878432382', 'https://img20.360buyimg.com/n7/jfs/t18226/169/1318243724/390477/5b0718ff/5ac44edcNa350dbd9.jpg'), ('5327182', 'https://img20.360buyimg.com/n7/jfs/t17461/138/1837663326/68820/5f8da5cd/5ad9b1e2N42bce837.jpg'), ('11731514723', 'https://img20.360buyimg.com/n7/jfs/t19231/337/2147939016/196162/4210a6ae/5aea6250N0235cd05.jpg'), ('19588651151', 'https://img20.360buyimg.com/n7/jfs/t11341/60/1553062810/120774/ab9534ff/5a02c3f4Naebe34b7.jpg'), ('15495544751', 'https://img20.360buyimg.com/n7/jfs/t18088/43/2048465630/167669/dd3c8b7b/5ae12c40N57c98ea8.jpg'), ('16675691362', 'https://img20.360buyimg.com/n7/jfs/t18490/21/2141098141/120513/b3ca521a/5ae90247N3b4909ae.jpg'), ('26222795271', 'https://img20.360buyimg.com/n7/jfs/t19441/291/1597121495/310550/9bc2e141/5ad05fc0N1510cae5.jpg'), ('1780924', 'https://img20.360buyimg.com/n7/jfs/t17167/97/1957869461/43204/d064647b/5adda3e0Ne1d3aa86.jpg'), ('4813030', 'https://img20.360buyimg.com/n7/jfs/t19198/83/1908967366/189260/7538e84b/5adda865N8f547981.jpg'), ('27036535156', 'https://img20.360buyimg.com/n7/jfs/t19399/140/2175516321/123017/41e6d6a8/5aea87d3N9736cc9d.jpg'), ('26348513019', 'https://img20.360buyimg.com/n7/jfs/t14857/240/2643838980/220943/c982fda1/5aaf2002Ndd25bc52.jpg'), ('26016197600', 'https://img20.360buyimg.com/n7/jfs/t19894/76/195725612/190103/23c60ca1/5aeabb94N3e0266bc.jpg'), ('25168000024', 'https://img20.360buyimg.com/n7/jfs/t17629/301/2062161127/434152/aa3560a5/5ae319f9N1ae1146c.jpg'), ('25965247088', 'https://img20.360buyimg.com/n7/jfs/t19270/67/2232771964/253207/25f41fd9/5aea61b0Nfd21a809.jpg'), ('10123099847', 'https://img20.360buyimg.com/n7/jfs/t15511/14/1469153129/729958/b0af0ca1/5a533063N15fea56c.jpg'), ('20000220615', 'https://img20.360buyimg.com/n7/jfs/t16426/172/2638358261/151693/87020840/5ab869ddN30621fec.jpg'), ('15904713681', 'https://img20.360buyimg.com/n7/jfs/t17287/197/2249621651/366556/d36ae213/5aeadb4cN97f413f3.jpg'), ('10114188069', 'https://img20.360buyimg.com/n7/jfs/t19927/88/179058964/386205/afd08ef1/5ae9717fN07f116d9.jpg'), ('10503200866', 'https://img20.360buyimg.com/n7/jfs/t18139/246/1628563908/114414/9315ac7c/5ad0647eNa9f1e2af.jpg'), ('1658610413', 'https://img20.360buyimg.com/n7/jfs/t19411/79/1017814440/108641/1b185d6d/5ab8b479Nd2417e97.jpg')] |
根據文件ga10.wms5.jd.com.txt
中的內容,分別匹配upstream
和location
{}中的內容,將對應內容分別寫入文件夾upstream
和location
,文件夾中分別是以配置名稱命名的配置內容。顯示結果以下
regular。json
upstream
內容,分組應包括名稱及所有內容,名稱用於文件命名,所有內容用於寫入文件。os
模塊進行文件夾判斷、建立、切換等功能的實現。location
處理方法基本一致。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
import codecs import re import os regupstream = re.compile(r"\s*(upstream\s+(\S+)\s+{[^}]+})") with codecs.open("ga10.wms5.jd.com.txt") as fum: upstmlist = regupstream.findall(fum.read()) if not os.path.exists("upstream"): os.mkdir("upstream") os.chdir("upstream") for item in upstmlist: with codecs.open(item[1], "w") as fumw: fumw.write(item[0]) os.chdir("..") reglocation = re.compile(r"\s*(location\s+\/(\S+)\/\s+{[^}]+})") with codecs.open("ga10.wms5.jd.com.txt") as flc: lcalist = reglocation.findall(flc.read()) if not os.path.exists("location"): os.mkdir("location") os.chdir("location") for ilocal in lcalist: filename1 = ilocal[1]+".conf" with codecs.open(filename1, "w") as flcw: flcw.write(ilocal[0]) |