一道文本處理題目的思考

時間 2019-11-14

標籤一道文本處理題目思考欄目興趣愛好简体版

原文原文鏈接

在網上碰到有網友問了這麼一道題，題目是這樣的：

java 寫入txt文件,想要修改txt文件每一行的第一個數字，加1；

例如txt文件是：

1 1 5

2 2 10

3 3 15

轉變成：

2 1 5

3 2 10

4 3 15

看到題目的第一反應時可能須要正則表達式，而在java中使用raplaceAll("正則表達式","替換後的表達式")基本上就能夠搞定了。可是有一個問題：正則匹配很好寫，reg = "^\\d+";就能夠匹配每行的第一個數字了，可是替換成什麼呢？須要對每一個數字加1，這個怎麼處理？使用捕獲組能夠獲取咱們須要處理的數據，可是捕獲後，沒法進一步處理數據了。此條路不通以後，很不情願的想起另一種辦法：按行處理。

掃描須要處理的文本，每掃描一行，就對該行進行匹配，匹配到數據以後，對該行處理，而後將該行寫入到新的文件，整個文本掃描完成以後，數據也就處理完了。這個辦法是否是很笨拙？對於目前也沒有更好的方式（更好的方式也許可使用excel來處理，可是要求使用編程來完成），就開始代碼實現了：

// code version 1.0

開始寫代碼時，發現數據之間都是以多個空格或者tab來分割，方便期間，使用split函數來處理吧。

// java 解析文本，將每行第一個數字加1

public static void writeFile() {

BufferedReader reader = null ;

BufferedWriter writer = null ;

try {

File file = new File( "new.txt" );

if (!file.exists()) {

file.createNewFile();

}

StringBuffer sb = new StringBuffer();

reader = new BufferedReader( new FileReader( "test.txt" ));

String line = null ;

//按行讀取

while ((line = reader.readLine()) != null ) {

String[] arr = line.split( "[ \t]++" );

if (arr. length < 3) {

sb.append(line).append( "\r\n" );

continue ;

}

//獲取第一個數字，並加1

int num = Integer. valueOf(arr[0]);

num ++;

sb.append(num).append( "\t" ).append(arr[1]).append( "\t" ).append(arr[2]).append( "\r\n" );

}

//寫入新的文件

writer = new BufferedWriter( new FileWriter(file));

writer.write(sb.toString());

} catch (IOException e) {

e.printStackTrace();

} finally {

if (reader != null ) {

try {

reader.close();

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

if (writer != null ) {

try {

writer.close();

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

代碼寫起來仍是很順利的，可是有個問題，在數據處理完成以後：

int num = Integer.valueOf(arr[0]);

num ++;

怎麼把新的內容寫入到當前行中，也就是寫入當前的文本中？處理數據的時候是經過reader來按行讀取，若是須要將數據寫入的話，須要writer，在讀取文件的時候，直接使用writer寫入數據，也不是件很容易的事。爲了方便處理，果斷創建一個新的文件，寫入新的數據，處理完以後，刪掉舊的文件就是！

代碼實現以後，就跟一個朋友商討了下，朋友說，可使用正則來完成，split的效率有點低。OK，把主要的處理過程從新實現了下：

// code version 1.1

Pattern pattern = Pattern. compile( "^\\d+" );

while(...) {

Matcher matcher = pattern .matcher(line);

if (matcher.find()) {

String str = matcher.group();

int n = Integer.parseInt(str);

n ++;

line = line.replaceFirst(str, String.valueOf(n));

sb.append(line).append( "\r\n" );

}

除了將split修改成Pattern以後，同時，行數據也保持原來的風格保持了不變：

line = line.replaceFirst(str, String. valueOf (n));

可是爲何Pattern會比split的效率高呢？

split的實現中，會調用Pattern.compile("...");也就是在對文本每行的處理中，若是使用split，則每次都會新建一個Pattern.compile("...")對象。而在使用Pattern類，只在最開始生成一個Pattern.compile("...")對象，減小了內存的開銷;

一個前輩說，不須要正則，使用indexOf和substring能夠提升效率。同時，他建議不要使用BufferedWriter,應當使用printStream。

恩，開始修改代碼：

// code version 1.2

public static void writeFile() throws IOException {

BufferedReader reader = null ;

PrintStream writer = null ;

File file = new File( "new.txt" );

if (!file.exists()) {

file.createNewFile();

}

writer = new PrintStream( new FileOutputStream(file));

reader = new BufferedReader( new FileReader( "test.txt" ));

String line = null ;

//按行讀取

while ((line = reader.readLine()) != null ) {

//這裏經過index來肯定須要處理的數據

int index = line.indexOf( " " );

if (index == -1) {

continue ;

}

int num = Integer.parseInt(line.substring(0,index))+1;

line = num + line.substring(index);

writer.println(line);

}

// ....

}

使用indexOf和substring 替換掉正則以後，邏輯彷佛也清晰了許多，因爲去掉了正則表達式的一些處理，直接對字符串處理，效率上應該會有一些提升。可是使用PrintStream 替換掉 BufferedWriter是否是就是個好主意？不見得，BufferedWriter做爲一個具備緩衝功能的包裝類，性能上比其餘類要高不少。並且在處理文本時，每處理一行，就向文件中寫入數據，這個性能也不見得很高。權衡之際，提升處理數據效率，使用indexOf和substring，寫文件時，採用BufferedWriter將數據寫入緩衝，提升效率。

// code version 1.3