尋找Java中String.split性能更好的方法

時間 2019-11-10

標籤尋找 java string.split string split 性能更好方法欄目 Java 简体版

原文原文鏈接

String.split 是Java裏很經常使用的字符串操做，在普通業務操做裏使用的話並無什麼問題，但若是須要追求高性能的分割的話，須要花一點心思找出能夠提升性能的方法。java

String.split方法的分割參數regex實際不是字符串，而是正則表達式，就是說分隔字符串支持按正則進行分割，雖然這個特性看上去很是好，但從另外一個角度來講也是性能殺手。git

在Java6的實現裏，String.split每次調用都直接新建Pattern對象對參數進行正則表達式的編譯，再進行字符串分隔，而正則表達式的編譯從字面上看就知道須要耗很多時間，而且實現中也沒有對Pattern進行緩存，所以屢次頻繁調用的使用場景下性能不好，若是是要使用正則表達式分隔的話，應該自行對Pattern進行緩存。正則表達式

public String[] split(String regex, int limit) {
    return Pattern.compile(regex).split(this, limit);
}

但不少時候咱們並不會真的想使用正則表達式分隔字符串，咱們其實想的只是用一個簡單的字符好比空格、下劃線分隔字符串而已，爲了須要是知足這個需求卻要背上正則表達式支持的性能損耗，很是不值得。apache

所以在Java7的實現裏，針對單字符的分隔進行了優化，對這種場景實現了更合適的方法。單字符不走正則表達式的實現，直接利用indexOf快速定位分隔位置，提升性能。數組

/* fastpath if the regex is a
    (1)one-char String and this character is not one of the
    RegEx's meta characters ".$|()[{^?*+\\", or
    (2)two-char String and the first char is the backslash and
    the second is not the ascii digit or ascii letter.
*/
char ch = 0;
if (((regex.value.length == 1 &&
        ".$|()[{^?*+\\".indexOf(ch = regex.charAt(0)) == -1) ||
        (regex.length() == 2 &&
        regex.charAt(0) == '\\' &&
        (((ch = regex.charAt(1))-'0')|('9'-ch)) < 0 &&
        ((ch-'a')|('z'-ch)) < 0 &&
        ((ch-'A')|('Z'-ch)) < 0)) &&
    (ch < Character.MIN_HIGH_SURROGATE ||
        ch > Character.MAX_LOW_SURROGATE))
{
    int off = 0;
    int next = 0;
    boolean limited = limit > 0;
    ArrayList<String> list = new ArrayList<>();
    while ((next = indexOf(ch, off)) != -1) {
        if (!limited || list.size() < limit - 1) {
            list.add(substring(off, next));
            off = next + 1;
        } else {    // last one
            //assert (list.size() == limit - 1);
            list.add(substring(off, value.length));
            off = value.length;
            break;
        }
    }
    // If no match was found, return this
    if (off == 0)
        return new String[]{this};

    // Add remaining segment
    if (!limited || list.size() < limit)
        list.add(substring(off, value.length));

    // Construct result
    int resultSize = list.size();
    if (limit == 0)
        while (resultSize > 0 && list.get(resultSize - 1).length() == 0)
            resultSize--;
    String[] result = new String[resultSize];
    return list.subList(0, resultSize).toArray(result);
}

有沒有更快的方法？若是分隔符不是單字符並且也不須要按正則分隔的話，使用split的方法還會和Java6同樣使用正則表達式。這裏還有其餘備用手段：緩存

使用StringTokenizer,StringTokenizer沒有正則表達式分隔的功能，單純的根據分隔符逐次返回分隔的子串，默認按空格分隔，性能比String.split方法稍好，但這個類實現比較老，屬於jdk的遺留類，並且註釋上也說明不建議使用這個類。
使用org.apache.commons.lang3.StringUtils.split分隔字符串，針對不須要按正則分隔的場景提供更好的實現，分隔符支持字符串。

還能有更快的方法麼？注意到String.split和StringUtils.split方法返回值是String[], 原始數組的大小是固定的，而在分隔字符串不可能提早知道分隔了多少個子串，那這個數組確定藏了貓膩，看看是怎麼實現的。函數

定位String.split單字符實現，發現分隔的子串其實保存在ArrayList裏，並無高深的技巧，直到路徑的最後一行，代碼對存儲了子串的ArrayList再轉成數組，而toArray的實現裏對數組進行了複製。性能

return list.subList(0, resultSize).toArray(result);

StringUtils.split方法裏一樣也是這樣。優化

return list.toArray(new String[list.size()]);

所以這裏能夠作一個優化，把代碼實現複製過來，而後將方法參數返回類型改成List，減小數組複製的內存消耗。this

還能有更快的方法麼？其實不少時候咱們須要對分隔後的字符串進行遍歷訪問作一些操做，並非真的須要這個數組，這和文件讀取是同樣的道理，讀文件不須要把整個文件讀入到內存中再使用，徹底能夠一次讀取一行進行處理，所以還能夠作一個優化，增長參數做爲子串處理方法的回調，在相應地方改成對回調的調用，這樣能徹底避免數組的建立。也就是說，把字符串分隔看作一個流。

private static void splitWorker(final String str, final String separatorChars, final int max, final boolean preserveAllTokens, Consumer<String> onSplit) {
    if (str == null) {
        return;
    }
    final int len = str.length();
    if (len == 0) {
        return;
    }
    int sizePlus1 = 1;
    int i = 0, start = 0;
    boolean match = false;
    boolean lastMatch = false;
    if (separatorChars == null) {
        // Null separator means use whitespace
        while (i < len) {
            if (Character.isWhitespace(str.charAt(i))) {
                if (match || preserveAllTokens) {
                    lastMatch = true;
                    if (sizePlus1++ == max) {
                        i = len;
                        lastMatch = false;
                    }
                    onSplit.accept(str.substring(start, i));
                    match = false;
                }
                start = ++i;
                continue;
            }
            lastMatch = false;
            match = true;
            i++;
        }
    } else if (separatorChars.length() == 1) {
        // Optimise 1 character case
        final char sep = separatorChars.charAt(0);
        while (i < len) {
            if (str.charAt(i) == sep) {
                if (match || preserveAllTokens) {
                    lastMatch = true;
                    if (sizePlus1++ == max) {
                        i = len;
                        lastMatch = false;
                    }
                    onSplit.accept(str.substring(start, i));
                    match = false;
                }
                start = ++i;
                continue;
            }
            lastMatch = false;
            match = true;
            i++;
        }
    } else {
        // standard case
        while (i < len) {
            if (separatorChars.indexOf(str.charAt(i)) >= 0) {
                if (match || preserveAllTokens) {
                    lastMatch = true;
                    if (sizePlus1++ == max) {
                        i = len;
                        lastMatch = false;
                    }
                    onSplit.accept(str.substring(start, i));
                    match = false;
                }
                start = ++i;
                continue;
            }
            lastMatch = false;
            match = true;
            i++;
        }
    }
    if (match || preserveAllTokens && lastMatch) {
        onSplit.accept(str.substring(start, i));
    }
}

public static void split(final String str, final String separatorChars, Consumer<String> onSplit) {
    splitWorker(str, separatorChars, -1, false, onSplit);
}

// 使用方法
public void example() {
    split("Hello world", " ", System.out::println);
}

還能有更快的方法麼？也有更極端的優化方法，由於在拿子串（substring方法）時實際發生了一次字符串複製，所以能夠把回調函數改成傳入子串在字符串的區間start、end，回調再根據區間讀取子串進行處理，但並非很通用，這裏就不展現代碼了，有興趣的能夠試一下。

還能有更快的方...