NSHipster: NSRegularExpression 中文版

時間 2019-11-16

標籤 nshipster nsregularexpression 中文版简体版

原文原文鏈接

這個連接完整: https://www.jianshu.com/p/c86...javascript

原文： NSRegularExpressionhtml

原做者：Nate Cookjava

遇到問題，哦，要用NSRegularExpression了。git

其實呢，有一些，是要注意的。github

正則表達式是一種DSL, 有一些討論。說他很差，畢竟Regex裏面都是各類符號。說他好，Regex簡明強大，用途普遍。正則表達式

公認的是，Cocoa 給NSRegularExpression 設計了一套冗長的API. 先比對下Ruby,這段Ruby代碼的做用是，從HTML代碼片斷中提取URL.express

htmlSource = "Questions? Corrections? <a href=\"https://twitter.com/NSHipster\"> @NSHipsteror</a> or <a href=\"https://github.com/NSHipster/articles\">on GitHub.</a>"
linkRegex = /]*href="([^"]*)"[^>]*>/i
links = htmlSource.scan(linkRegex)
puts(links)
# https://twitter.com/NSHipste
# https://github.com/NSHipster/articles

Ruby 三行代碼實現。數組

如今看Swift 中用 NSRegularExpression ，一樣的功能實現（從HTML代碼片斷中提取URL.）閉包

let htmlSource = "Questions? Corrections? <a href=\"https://twitter.com/NSHipster\"> @NSHipsteror</a> or <a href=\"https://github.com/NSHipster/articles\">on GitHub.</a>"
let linkRegexPattern = "]*href=\"([^\"]*)\"[^>]*>"   
// 比起Ruby 的， 多了一個轉義字符 '\'
let linkRegex = try! NSRegularExpression(pattern: linkRegexPattern, options: .caseInsensitive )
let matches = linkRegex.matches(in: htmlSource,  range: NSRange(location: 0, length: htmlSource.utf16.count))
let links = matches.map{ result -> String in
       let hrefRange = result.rangeAt(1)
       let start = String.UTF16Index(encodedOffset: hrefRange.location)         
       let end = String.UTF16Index(encodedOffset: hrefRange.location + hrefRange.length)
      return String(htmlSource.utf16[start..
}
print(links)
// ["https://twitter.com/NSHipster", "https://github.com/NSHipster/articles"]

{app

效果圖：


簡單說明:

第一段， <a\\s+ , 先找  <a 兩個特定字符， 再來一個轉義，尋找一到多個空格。

第二段， [^>]* , 要求 緊接着的任意的字符串中，不能包含 > .

第三段， href=\" . 尋找緊接着 href=\"

第四段， （[^\"]*）,緊接着的任意字符串不得包含 \"

第五段， \"[^>]*> , 先來一個 轉義，再要求緊接着的字符串知足 ，* 和 > 之間， 不包含 > .

}

NSRegularExpression 很差的，就說到這裏。

原文（英文原版）不會深刻淺出地講解正則表達式（要本身學習通配符‘*’ ‘+’ ，反向引用‘^’ ，提早量‘[]’ ，等等）

Swift 中的 Regex 學習， NSRegularExpression, NSTextCheckingResult , 注意下難點、特例，就能夠了。

字符串方法， NSString Methods

上手Cocoa中的正則，固然是不用 NSRegularExpression .

NSString 中的range(of:...) 方法可實現輕量級的字符串查找，須要用.regularExpression 切換 regular expression mode . ( OC 的 NSString，對應 Swift 中的 String)

let source="For NSSet and NSDictionary, the breaking..."
// Matches anything that looks like a Cocoa type: 
// UIButton, NSCharacterSet, NSURLSession, etc.
let typePattern = "[A-Z]{3,}[A-Za-z0-9]+"
if let typeRange = source.range(of: typePattern , options: .regularExpression){
        print("First type: \(source[typeRange])")
        // First type: NSSet
}

{

link: https://regex101.com/r/U7TC8v/1

第一段， [A-Z]{3,} , 用於匹配至少3個A-Z 中的字符。

第二段， [A-Za-z0-9]+ , 用於匹配至少一個該集合中的字符，A-Z 之間加上 a-z 之間，再加上 0-9 之間

}

替換也是經常使用的功能，一樣的選項option, 使用 replacingOccurrences(of:with:...) .

下面，用一個看起來怪的代碼，在上句中的coco 類型單詞外面加括號。看起來清楚一些吧。

let markedUpSource = source.replacingOccurrences(of: typePattern, with: "`$0`", options: .regularExpression)
print(markedUpSource)
// "For `NSSet` and `NSDictionary`, the breaking...""

{

說明：

這裏有一個正則表達式中，獲取正則分段的概念。

能夠參見這個連接: https://stackoverflow.com/que...

}

用上面的替換模版，正則能夠處理推導分組。西方有一個關於元音的字母轉換，

let ourcesay = source.replacingOccurrences(of: "([bcdfghjklmnpqrstvwxyz]*)([a-z]+)", with: "$2$1ay", options: [.regularExpression,.caseInsensitive])
print(ourcesay)
// "orFay etNSSay anday ictionaryNSDay, ethay eakingbray..."
{

link : https://regex101.com/r/lZxWuY/2

第一段， ([bcdfghjklmnpqrstvwxyz]*) , 匹配不限長度的不含 a e i o u 的任意英文字母。

第二段， ([a-z]+) , 匹配至少一個長度的任意英文字母

}

不少須要運用正則的場景下，上面兩個方法就能夠了。複雜的功能實現，就要用到NSRegularExpression這個類了。首先，解決Swift中的一個正則新手易犯錯誤。

NSRangeand Swift

比起 Foundation 的 NSString , Swift有着做用域更大、更復雜的API ，來處理字符串的字符和子串。Swift的標準庫有四種接口來處理字符數據，能夠用字符、Unicode 標量、UTF-8 碼、 UTF-16 碼來獲取字符串的數據。

這與 NSRegularExpression 相關，不少 NSRegularExpression 方法使用 NSRange，用 NSTextCheckingResult 對象保存匹配到的數據。 NSRange 使用整型 integer ，記錄他的起始點 location 和字符長度 length 。可是字符串 String 是不用整型 integer 做爲索引的

let range = NSRange(location: 4, length: 5)
// 下面的代碼，是編不過的
source[range]
source.characters[range]
source.substring(with:range)
source.substring(with:range.toRange()!)

接着來。

Swift 中的 String 實際上是經過 utf16 接口操做的，同 Foundation 框架下 NSString 的 API 同樣。能夠經過 utf16 接口的方法，用整型 integer 建立索引。

let start = String.UTF16Index(encodedOffset: range.location)
let end = String.UTF16Index(encodedOffset: range.location + range.length)
let substring = String(source.utf16[start..
// substring 如今是 "NSSet"

下面放一些 String 的 Util 代碼，調用 Swift 相關正則的語法糖，有 Objective-C 的感受

extension String{
        /// 這個 nsrange 屬性 ，包含了字符串的全部範圍
        var nsrange: NSRange{
                return NSRange(location:0,length:utf16.count)
        }
        /// 用以前給出的 nsrange 屬性，返回一個字串。 
       /// 若是字符串中沒有這個範圍， 就 nil 了
        func substring( with nsrange: NSRange) -> String?{
                guard let range = Range(nsrange, in: self)
                    else { return nil }
                 return String( self[range] )
        }
        /// 返回 與以前掏出來的 nsrange 屬性，等同的 range
        /// 若是字符串中沒有這個範圍， 就 nil 了
        func range(from nsrange: NSRange) -> Range?{
                guard let range = Range(nsrange, in: self)           
                        else { return nil }       
              return range
            }
      }

接下來體驗的 NSRegularExpression ，有用到上面的 Util 方法。

NSRegularExpression 和 NSTextCheckingResult

以前學習了在字符串中找出第一個匹配到的數據，與匹配到的數據之間的替換。複雜些的狀況，就要用到 NSRegularExpression 了。先造一個簡單的文本各式匹配 miniPattern ，找出文本中的 bold 和 italic

造一個 NSRegularExpression 對象，要傳入一個匹配規則的字符串 pattern ，還有一些選項能夠設置。miniPattern 用星號 * 或下劃線 _ 開始查找匹配的單詞。找到星號或下劃線後，就匹配一個到多個字符的格式，用找到的第一個匹配的字符再次match終止一次查找。匹配到的首字母和文本，都會被保存到查詢結果中。

let miniPattern = "([*_])(.+?)\\1"
let miniFormatter = try! NSRegularExpression(pattern: miniPattern, options: .dotMatchesLineSeparators)
// 若是 miniPattern 有誤， NSRegularExpression 初始化就會拋異常。

若是 pattern有誤， NSRegularExpression 初始化就會拋異常。一旦 NSRegularExpression 對象建好了，就能夠用它處理不一樣的字符串。
{

說明：

"([*_])(.+?)\1" ，這個正則表達式分三段，

第一段([_]) ，匹配中括號中的任意一個字符，就是 或者 _ ;

第二段(.+?) ，匹配長度大於1的任意字符串；

第三段 \1，有一個轉義字符，匹配以前獲取到的第一個同等字符串

}

let text = "MiniFormatter handles *bold* and _italic_ text."
let matches = miniFormatter.matches(in: text, options: [], range: text.nsrange )
// matches.count == 2

調用matches(in:options:range:) 方法，能夠取出包含 NSTextCheckingResult 元素的數組。多種文本處理類都有用到NSTextCheckingResult 類型，譬如 NSDataDetector 和 NSSpellChecker . 返回的數組中，一個匹配有一個NSTextCheckingResult .

一般要取得的是匹配到的範圍，就在每一個結果的range屬性裏面。一般要取得的還有，正則表達式中任意匹配到的範圍。能夠經過numberOfRanges 屬性和rangeAt(_:) 方法，找出指定的範圍。

range(at:) Returns the result type that the range represents.
range(at:) 方法，返回的結果就是對應的範圍 Discussion A result must have at least one
range, but may optionally have more (for example, to represent regular
expression capture groups). Passingrange(at:)the value 0 always
returns the value of the therangeproperty. Additional ranges, if any,
will have indexes from 1 to numberOfRanges-1. 討論下，
返回的結果，至少有一個範圍。每每有更多，可選的。（正則表達式捕獲組，對應的） range(at:) 方法返回的第一個結果，就是 range
屬性的值。若是有額外的，返回的結果對應的索引就是從 1 到 numberOfRanges-1
引用下蘋果文檔，
https://developer.apple.com/d...

range 0 是徹底匹配到的範圍，也是確定能取到的。

而後從第1個到第（numberOfRanges - 1）個的 ranges 數組中的值，就是分段，對應每一段正則匹配的結果。

使用以前給出的NSRange的取子串方法，就能夠用 range 來取出匹配到的結果。

for match in matches {
    let stringToFormat = text.substring(with: match.range(at: 2) )!
    switch text.substring(with: match.range(at: 1)  )! {
    case "*" :
            print("Make bold: '\(stringToFormat)'")
    case "_":
            print("Make italic: '\(stringToFormat)'")
    default: break
    }
}
// 打印出
// Make bold: 'bold'
// Make italic: 'italic'

對於基礎的替換，直接用stringByReplacingMatches(in:options:range:with:) 方法，String.replacingOccurences(of:with:options:) 的增強版。上例中，不一樣的正則匹配（ bold ， italic），用不一樣的替換模版。

按照倒敘，循環訪問這些匹配結果，這樣就不會把後面的 match 範圍搞亂。

var formattedText = text
Format: 
for match inmatches.reversed () {
    let template: String
    switch text.substring(with: match.range(at:1)  ) ?? ""{
    case "*":
template  = "$2"
    case "_": 
template = "$2"
    default:    break Format
    }
    let matchRange = formattedText.range(from:match.range)!         // see above 
    let replacement = miniFormatter.replacementString( for: match, in: formattedText, offset: 0, template: template)
    formattedText.replaceSubrange( matchRange , with: replacement)
}
// 'formattedText' is now:
// "MiniFormatter handles bold and italic text."

經過自定義的模版，調用miniFormatter.replacementString(for:in:...) 方法，而後呢，每個NSTextCheckingResult 實例會隨之產生一個對應的替換字符串。

Expression and Matching Options ，表達式與匹配選項

NSRegularExpression 是高度可配置的。弄一個實例，或者調用執行正則匹配的方法，均可以傳不一樣選項的組合。

NSRegularExpression.Options

.caseInsensitive : 字母大小寫忽略。開啓字母大小寫忽略的匹配，就是 i 標記
.allowCommentsAndWhitespace : 容許註釋、空格。忽略 # 和句尾間任意的空格和註釋。因此因此你能夠嘗試格式化和記錄正則匹配，有了註釋和空格，正則會好讀一點。等價於 x 標記
.ignoreMetacharacters：忽略元符號，忽略關鍵字。String.range(of:options:) 方法中的去正則化，與 .regularExpression 正則選項相反。這實際上就是正則變爲簡單的文本搜索，忽略全部的正則關鍵字和運算符。
.dotMatchesLineSeparators: 句點分行匹配。容許 , 關鍵字匹配換行符以及其餘字符。就是 s 標記。
.anchorsMatchLines: 句中錨點匹配。容許 ^ （開始）和 $ （結束）關鍵字，匹配句中的開始和結束。而不只僅是輸入的整段的開始和結尾。就是 m 標記
.useUnixLineSeparators, .useUnicodeWordBoundaries: 最後兩項優化了更多特定的行和字的邊界處理。Unix 行分隔符。

NSRegularExpression.MatchingOptions 正則表達式的匹配選項

一個 NSRegularExpression 正則表達式實例中，能夠傳入選項來調整匹配的方法。

.anchored: 錨定的。僅匹配搜索範圍的開頭第一段。
.withTransparentBounds: 超過界限。容許正則在搜索範圍前，向前查找。反之，向後查找。還有單詞的邊界。（儘管不適用於，實際的匹配字符）

static var withTransparentBounds: NSRegularExpression.MatchingOptions
Specifies that matching may examine parts of the string beyond the
bounds of the search range, for purposes such as word boundary
detection, lookahead, etc. This constant has no effect if the search
range contains the entire string.

SeeenumerateMatches(in:options:range:using:)for a description of the
constant in context.
蘋果連接:
https://developer.apple.com/d...

.withoutAnchoringBounds : 無錨定界限。讓 ^ 和 $ 關鍵字僅匹配字符串的開始和結尾，而不是搜索範圍的開始和結束。
.reportCompletion ( 報告完成 ) , .reportProgress ( 報告進度 ): 這些參數選項僅在下節講的部分匹配方法中有用。當正則查找完成了，或者是耗時的匹配上有進度，相應選項會通知 NSRegularExpression 傳入附加時間，調用枚舉塊。

Partial Matching 部分匹配

最後， NSRegularExpression 最強大的特性之一是，僅掃描字符串中須要的部分。處理長文本，挺有用的。處理耗資源的正則匹配，也是。

不要用這兩個方法firstMatch(in:...) 和 matches(in:...) ，調用 enumerateMatches(in:options:range:using:) ，用閉包處理對應的匹配。

func enumerateMatches( instring :String, options:NSRegularExpression.MatchingOptions= [], range:NSRange, usingblock: (NSTextCheckingResult?,NSRegularExpression.MatchingFlags,UnsafeMutablePointer<ObjCBool>) ->Void)
蘋果連接： https://developer.apple.com/d...
這個閉包接收三個參數，匹配的正則結果，一組標誌選項，一個布爾指針。這個 bool 指針是一個只出參數，能夠經過它在設定的時機中止處理。

能夠用這個方法在 Dostoevsky 的 Karamazov兄弟一書中, 查找開始的幾個名字。名字聽從的規則是，首名，中間父姓（例如: 「Ivan Fyodorovitch」）

let nameRegex = try! NSRegularExpression( pattern: "([A-Z]\\S+)\\s+([A-Z]\\S+(vitch|vna))" )
let bookString = ...
var names:Set = []
nameRegex.enumerateMatches( in: bookString, range: bookString.nsrange ){
    ( result , _ , stopPointer )      in
    guard let result = result else { return }
    let name = nameRegex.replacementString( for: result , in:  bookString , offset : 0 , template: "$1 $2" )
    names.insert(name)
    // stop once we've found six unique names ，經過 Set 確保，6個不同的名字文本 
    stopPointer.pointee = ObjCBool( names.count==6 )
}
// names.sorted():
// ["Adelaïda Ivanovna", "Alexey Fyodorovitch", "Dmitri Fyodorovitch",
// "Fyodor Pavlovitch", "Pyotr Alexandrovitch", "Sofya Ivanovna"]

經過這種途徑，咱們只需查找前 45 個匹配，而不是把全書中接近1300個名字都找一遍。性能顯著提升。

一旦有所認識，NSRegularExpression 就會超級有用。除了 NSRegularExpression , 還有一個類NSDataDetector.NSDataDetector是一個用於識別有用信息的類，能夠用來處理用戶相關的文本，查找日期，地址與手機號碼。經過Fundation 框架處理文本，NSRegularExpression 強大，健壯，有簡潔的接口，也有深刻