本文主要研究下如何使用opennlp進行文檔分類html
要對文檔進行分類,須要一個最大熵模型(Maximum Entropy Model
),在opennlp中對應DoccatModelapache
@Test public void testSimpleTraining() throws IOException { ObjectStream<DocumentSample> samples = ObjectStreamUtils.createObjectStream( new DocumentSample("1", new String[]{"a", "b", "c"}), new DocumentSample("1", new String[]{"a", "b", "c", "1", "2"}), new DocumentSample("1", new String[]{"a", "b", "c", "3", "4"}), new DocumentSample("0", new String[]{"x", "y", "z"}), new DocumentSample("0", new String[]{"x", "y", "z", "5", "6"}), new DocumentSample("0", new String[]{"x", "y", "z", "7", "8"})); TrainingParameters params = new TrainingParameters(); params.put(TrainingParameters.ITERATIONS_PARAM, 100); params.put(TrainingParameters.CUTOFF_PARAM, 0); DoccatModel model = DocumentCategorizerME.train("x-unspecified", samples, params, new DoccatFactory()); DocumentCategorizer doccat = new DocumentCategorizerME(model); double[] aProbs = doccat.categorize(new String[]{"a"}); Assert.assertEquals("1", doccat.getBestCategory(aProbs)); double[] bProbs = doccat.categorize(new String[]{"x"}); Assert.assertEquals("0", doccat.getBestCategory(bProbs)); //test to make sure sorted map's last key is cat 1 because it has the highest score. SortedMap<Double, Set<String>> sortedScoreMap = doccat.sortedScoreMap(new String[]{"a"}); Set<String> cat = sortedScoreMap.get(sortedScoreMap.lastKey()); Assert.assertEquals(1, cat.size()); }
這裏爲了方便測試,先手工編寫DocumentSample來作訓練文本
categorize方法返回的是一個機率,getBestCategory能夠根據機率來返回最爲匹配的分類
輸出以下:測試
Indexing events with TwoPass using cutoff of 0 Computing event counts... done. 6 events Indexing... done. Sorting and merging events... done. Reduced 6 events to 6. Done indexing in 0.13 s. Incorporating indexed data for training... done. Number of Event Tokens: 6 Number of Outcomes: 2 Number of Predicates: 14 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-4.1588830833596715 0.5 2: ... loglikelihood=-2.6351991759048894 1.0 3: ... loglikelihood=-1.9518912133474995 1.0 4: ... loglikelihood=-1.5599038834410852 1.0 5: ... loglikelihood=-1.3039748361952568 1.0 6: ... loglikelihood=-1.1229511041438864 1.0 7: ... loglikelihood=-0.9877356230661396 1.0 8: ... loglikelihood=-0.8826624290652341 1.0 9: ... loglikelihood=-0.7985244514476817 1.0 10: ... loglikelihood=-0.729543972551105 1.0 //... 95: ... loglikelihood=-0.0933856684859806 1.0 96: ... loglikelihood=-0.09245907503183291 1.0 97: ... loglikelihood=-0.09155090064000486 1.0 98: ... loglikelihood=-0.09066059844628399 1.0 99: ... loglikelihood=-0.08978764309881068 1.0 100: ... loglikelihood=-0.08893152970793908 1.0
opennlp的categorize方法須要本身先切詞好,單獨調用不是很方便,不過若是是基於pipeline設計的,也能夠理解,在pipeline前面先通過切詞等操做。本文僅僅是使用官方的測試源碼來作介紹,讀者能夠下載箇中文分類文本訓練集來訓練,而後對中文文本進行分類。設計