split() 函數解析 (一)

時間 2019-11-15

標籤 split 函數解析简体版

原文原文鏈接

起源

忽然研究split()函數是有必定緣由的，昨天晚上有個厲害的學長在實驗室的羣裏拋了這樣一個問題：java

假設存在一個數組 array ={"AB", "12"},還存在一個字符串string = abcAB0123，有一個函數f(String s)，
使得 {"abc", "AB", "0"， "12", "3"} == f(string)。也就是把string按array中的元素拆分。求解這個·f函數。
ps：string.split("AB|12")的到的結果是｛"abc", "0", "3"｝，不知足條件。git

既然string.split()沒法實現，那麼看一下split()的源碼是如何實現的，對其進行改造就能夠了。正則表達式

源碼追蹤

首先是String.split(String regex)，它的實現是這樣的：express

public String[] split(String regex) {
    return split(regex, 0);
}複製代碼

繼續追蹤，看一下String.split(String regex, int limit)的內部是如何實現的：數組

/** * @param regex * the delimiting regular expression * 這裏的regex能夠是一個正則表達式 * @param limit * the result threshold, as described above * 限制切割字符串的段數 * @return the array of strings computed by splitting this string * around matches of the given regular expression * * @throws PatternSyntaxException * if the regular expression's syntax is invalid */
public String[] split(String regex, int limit) {
    /* fastpath if the regex is a (1)one-char String and this character is not one of the RegEx's meta characters ".$|()[{^?*+\\", or (2)two-char String and the first char is the backslash and the second is not the ascii digit or ascii letter. */
    char ch = 0;
    if (((regex.value.length == 1 &&
         ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
         (regex.length() == 2 &&
          regex.charAt(0) == '\\' &&
          (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
          ((ch-'a')|('z'-ch)) < 0 &&
          ((ch-'A')|('Z'-ch)) < 0)) &&
        (ch < Character.MIN_HIGH_SURROGATE ||
         ch > Character.MAX_LOW_SURROGATE))
    {
        int off = 0;  //偏移量
        int next = 0; //下一次切割的地方
        boolean limited = limit > 0;  //判斷是否有限制，若是limit = 0則表示無限制
        ArrayList<String> list = new ArrayList<>(); //盛裝切割以後的字符串
        while ((next = indexOf(ch, off)) != -1) { //offset以後還有該字符
            if (!limited || list.size() < limit - 1) {
                list.add(substring(off, next));
                off = next + 1;
            } else {    // last one
                //assert (list.size() == limit - 1);
                list.add(substring(off, value.length));
                off = value.length;
                break;
            }
        }
        // If no match was found, return this
        if (off == 0)
            return new String[]{this};

        // Add remaining segment
        if (!limited || list.size() < limit)
            list.add(substring(off, value.length));

        // Construct result
        int resultSize = list.size();
        if (limit == 0) {
            while (resultSize > 0 && list.get(resultSize - 1).length() == 0) {
                resultSize--;
            }
        }
        String[] result = new String[resultSize];
        return list.subList(0, resultSize).toArray(result);
    }
    return Pattern.compile(regex).split(this, limit);
}複製代碼

首先看一下函數第一行的註釋：函數

fastpath if the regex is a
(1)one-char String and this character is not one of the
RegEx's meta characters ".$|()[{^?*+\", or
(2)two-char String and the first char is the backslash and
the second is not the ascii digit or ascii letter.this

翻譯過來就是：spa

當regex是一個字符的字符串而且這個字符不是正則表達式中的元字符或者翻譯
regex是兩個字符的字符串而且第一個字符是'/'，第二個字符不是數字或者字母的時候（其實這裏也至關於一個字符，使一些轉義字符）code

本函數的時間複雜度會很低（fastpath）？

那麼對應的代碼就是：

if (((regex.value.length == 1 &&
             ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
             (regex.length() == 2 &&
              regex.charAt(0) == '\\' &&
              (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
              ((ch-'a')|('z'-ch)) < 0 &&
              ((ch-'A')|('Z'-ch)) < 0)) &&
            (ch < Character.MIN_HIGH_SURROGATE ||
             ch > Character.MAX_LOW_SURROGATE))複製代碼

上面的這個if語句就是是否符合fastpath的條件：

(regex.value.length == 1 && ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) 長度爲1而且不是正則表達式中的元字符. $ | ( ) [ { ^ ? * + \
(regex.length() == 2 && regex.charAt(0) == '\\' && (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 && ((ch-'a')|('z'-ch)) < 0 && ((ch-'A')|('Z'-ch)) < 0) 長度爲2，第一個字符是'\' 而且第二個字符不是字母或者數字。((ch-'A')|('Z'-ch)) < 0這個判斷是否爲字符的方式很nice，很獨特！
(ch < Character.MIN_HIGH_SURROGATE || ch > Character.MAX_LOW_SURROGATE) 這是一個必要條件，字符必須在範圍內

再繼續往下看，定義了幾個變量，分別是：

int off = 0;  //偏移量
int next = 0; //下一次出現目標字符的位置
boolean limited = limit > 0;  //是否有限制
ArrayList<String> list = new ArrayList<>(); //盛裝切割以後的字符片斷複製代碼

而後就是一個while循環：while ((next = indexOf(ch, off)) != -1)

出現了一個indexOf(int ch, int fromIndex)方法，這個方法的英文解釋是這樣的：

Returns the index within this string of the first occurrence of the specified character, starting the search at the specified index.

翻譯一下就是：返回目標字符在字符串中fromIndex位置以後第一次 出現的位置，若是沒有的話就返回-1。

因此這個while循環就是當剩下的字符串還有目標字符的話，就會繼續循環。

接下來的部分就簡單多了，根據offset和目標出現的下一個位置使用substring函數對字符串進行切割，並把切割下的部分添加到list中去，若是這是目標的最後一次出現位置或者超出limit的範圍，直接把字符串最後的部分添加到list中。注意，每次循環都要調整偏移量，若是不是最後一次循環，令off = next + 1。

循環結束以後，若是off仍然是0，說明沒有匹配到，直接返回就好。而後把list轉換爲字符串數組返回就能夠了。