#coding4fun#詞頻統計優化思路

時間 2020-02-15

標籤 coding4fun coding fun 詞頻統計優化思路简体版

原文原文鏈接

關於這期的coding4fun，我選擇的是hashmap方式實現。總體思路和流程你們可能都差很少，C++同窗們的總結寫的很好，一些邏輯優化都有總結，我這裏介紹下java實現的一些優化吧。java

使用ByteString代替String

開始讀出文件轉成String對象，而後經過String對象操做，代碼寫起來都比較方便。安全

可是有一個問題，文件讀取出來的byte[]轉成String對象很是耗時，一個1G的String對象分配內存時間就很長了，String對象內部使用char[]，經過byte[]構造String對象須要根據編碼遍歷byte[]。這個過程很是耗時，確定是能夠優化的。多線程

因而我使用ByteString類代替Stringide

class ByteString{
byte[] bs;
int start;
int end;
}

hashcode()和equals()方法參考String的實現。測試

在code4fun的16核機器上測試以下代碼：優化

代碼1：編碼

byte[] bs = new byte[1024*1024*1024];
long st = System.currentTimeMillis();
new String(bs);
System.out.println(System.currentTimeMillis() - st);  // 2619ms

代碼2：線程

byte[] bs = new byte[1024*1024*1024];
long st = System.currentTimeMillis();
int count = 100000;
for(int i = 0; i &lt; count; i++)
new ByteString(bs, 0, 100);
System.out.println(System.currentTimeMillis() - st);  //10ms

循環中代碼要精簡

Hashmap的實現，給單詞計數時避免不了以下的代碼：code

ByteString str = new ByteString(bs, start, end);
Count count = map.get(str);
If(count == null){
count = new Count(str,1);
map.put(str,count);
} else{
count.add(1);
}

原本這段代碼沒什麼問題，可是當單詞個數足夠大的時候（最終1.1G的文件，有2億多單詞），這段代碼就值得優化了。第一行建立的對象，只有單詞第一次出現有用，其餘時間均可以不用建立。對象

因而建立一個Pmap類，繼承HahsMap，並添加了一個get(ByteStringbs,intstart,intend)方法。上面的代碼改成

Count count = map.get(bs, start, end);
If(count == null){
ByteString str = new ByteString(bs, start, end);
count = new Count(str,1);
map.put(str,count);
} else{
count.add(1);
}