目前作分詞比較流行的是用深度學習來作,好比用循環神經網絡和條件隨機場,也有直接用條件隨機場或隱馬爾科夫模型的。前面也實現過上面幾種,效果挺不錯,基於隱馬爾科夫模型的差一點,條件隨機場的效果較好,而加入深度學習的效果最好。java
而最最傳統的分詞作法不少都是基於字典的,而後經過最大匹配法匹配,效果比較通常。效果雖然通常,但咱們仍是看下怎麼實現的吧。git
Trie 是一種搜索樹,它的 key 都爲字符串,經過 key 能夠找到 value。能作到高效查詢和插入,時間複雜度爲O(k),缺點是耗內存。它的核心思想就是減小不必的字符比較,使查詢高效率,即用空間換時間,再利用共同前綴來提升查詢效率。github
Trie樹的根節點不包含字符,根節點到某節點的路徑連起來的字符串爲該節點對應的字符串,每一個節點只包含一個字符,此外,任意節點的全部子節點的字符都不相同。bash
好比以下,將五個詞語添加到Trie樹中,最後的結構如圖所示。網絡
TrieTree tree = new TrieTree();
tree.put("美利堅");
tree.put("美麗");
tree.put("金幣");
tree.put("金子");
tree.put("帝王");
複製代碼
https://github.com/sea-boat/TextAnalyzer/blob/master/src/main/java/com/seaboat/text/analyzer/segment/併發
能夠看到基於字典的分詞效果是存在缺點的,須要用機器學習進一步優化。機器學習
DictSegment segment = new DictSegment();
System.out.println(segment.seg("我是中國人"));
System.out.println(segment.seg("人工智能是什麼"));
System.out.println(segment.seg("北京互聯網違法和不良信息舉報中心"));
複製代碼
[我, 是, 中國人]
[人工智能, 是, 什麼]
[北京, 互聯網, 違法, 和不, 良, 信息, 舉報中心]
複製代碼
定義一個節點類表明Trie樹節點,包含若干子節點、值和刪除標記。getChild
方法用於遍歷該節點下的指定字符的子節點,allChildrenDeleted
方法用於檢測節點下的子節點是否已被刪除了,setChild
方法用於將子節點設置到某個節點上。分佈式
public class TrieNode {
private TrieNode[] children;
private String value;
private boolean deleted = false;
public TrieNode(String value) {
this.value = value == null ? null : value.intern();
}
public boolean isEmpty() {
return this.value == null && this.children == null;
}
public TrieNode[] getChildren() {
return children;
}
public TrieNode getChild(String word) {
if (children == null)
return null;
for (TrieNode c : children) {
if (c.getValue() == word.intern() && !c.deleted)
return c;
}
return null;
}
public boolean allChildrenDeleted() {
if (children == null)
return true;
for (TrieNode c : children) {
if (!c.deleted)
return false;
}
return true;
}
public void setChild(TrieNode child) {
if (children == null) {
children = new TrieNode[1];
children[0] = child;
} else {
TrieNode[] temp = children;
children = new TrieNode[temp.length + 1];
System.arraycopy(temp, 0, children, 0, temp.length);
children[children.length - 1] = child;
}
}
}
複製代碼
定義一個 TrieTree 類表明樹對象,包含了樹的根節點。put
方法用於將字符串放到樹結構中,須要先遍歷檢測是否已經有字符串前綴,沒有則要建立對應的節點,而後添加到對應節點的子節點中。get
和remove
操做都須要針對樹結構作處理,最終完成查詢和刪除,刪除操做爲了方便僅僅是設置下指定節點的刪除標識。學習
public class TrieTree {
protected TrieNode root;
public TrieTree() {
this.root = new TrieNode(null);
}
public void put(String word) throws IllegalArgumentException {
if (word == null) {
throw new IllegalArgumentException();
}
TrieNode current = this.root;
for (String s : word.split("")) {
TrieNode child = current.getChild(s);
if (child == null) {
child = new TrieNode(s);
current.setChild(child);
}
current = child;
}
}
public TrieNode get(String word) throws IllegalArgumentException {
if (word == null) {
throw new IllegalArgumentException();
}
TrieNode current = this.root;
for (String s : word.split("")) {
TrieNode child = current.getChild(s);
if (child == null)
return null;
current = child;
}
return current;
}
public void remove(String word) {
if (word == null || word.length() <= 0) {
return;
}
for (int i = 0; i < word.length(); i++) {
String sub_word = word.substring(0, word.length() - i);
TrieNode current = this.root;
for (String s : sub_word.split("")) {
TrieNode child = current.getChild(s);
if (child != null && (child.getChildren() == null || child.allChildrenDeleted()))
child.setDeleted(true);
current = child;
}
}
}
}
複製代碼
seg
爲分詞方法,它主要就是嘗試進行最大字符串匹配,儘可能匹配字典中最長詞,其中查找是否存在字符串在teri樹中查找。優化
public List<String> seg(String text) {
int flag = 0;
int delta = 1;
List<String> words = new ArrayList<String>();
while (flag + delta <= text.length()) {
String temp = text.substring(flag, flag + delta);
if (tree.get(temp) != null) {
if ((flag + delta) == text.length()) {
words.add(temp);
break;
}
delta++;
continue;
}
words.add(temp.substring(0, temp.length() - 1));
flag = flag + delta - 1;
delta = 1;
}
return words;
}
複製代碼
-------------推薦閱讀------------
跟我交流,向我提問:
公衆號的菜單已分爲「讀書總結」、「分佈式」、「機器學習」、「深度學習」、「NLP」、「Java深度」、「Java併發核心」、「JDK源碼」、「Tomcat內核」等,可能有一款適合你的胃口。
歡迎關注: