使用opennlp進行文檔分類

時間 2019-12-04

標籤使用 opennlp 進行文檔分類简体版

原文原文鏈接

序

本文主要研究下如何使用opennlp進行文檔分類html

DoccatModel

要對文檔進行分類，須要一個最大熵模型(Maximum Entropy Model)，在opennlp中對應DoccatModelapache

@Test
    public void testSimpleTraining() throws IOException {

        ObjectStream<DocumentSample> samples = ObjectStreamUtils.createObjectStream(
                new DocumentSample("1", new String[]{"a", "b", "c"}),
                new DocumentSample("1", new String[]{"a", "b", "c", "1", "2"}),
                new DocumentSample("1", new String[]{"a", "b", "c", "3", "4"}),
                new DocumentSample("0", new String[]{"x", "y", "z"}),
                new DocumentSample("0", new String[]{"x", "y", "z", "5", "6"}),
                new DocumentSample("0", new String[]{"x", "y", "z", "7", "8"}));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 100);
        params.put(TrainingParameters.CUTOFF_PARAM, 0);

        DoccatModel model = DocumentCategorizerME.train("x-unspecified", samples,
                params, new DoccatFactory());

        DocumentCategorizer doccat = new DocumentCategorizerME(model);

        double[] aProbs = doccat.categorize(new String[]{"a"});
        Assert.assertEquals("1", doccat.getBestCategory(aProbs));

        double[] bProbs = doccat.categorize(new String[]{"x"});
        Assert.assertEquals("0", doccat.getBestCategory(bProbs));

        //test to make sure sorted map's last key is cat 1 because it has the highest score.
        SortedMap<Double, Set<String>> sortedScoreMap = doccat.sortedScoreMap(new String[]{"a"});
        Set<String> cat = sortedScoreMap.get(sortedScoreMap.lastKey());
        Assert.assertEquals(1, cat.size());
    }

這裏爲了方便測試，先手工編寫DocumentSample來作訓練文本
categorize方法返回的是一個機率，getBestCategory能夠根據機率來返回最爲匹配的分類

輸出以下：測試

Indexing events with TwoPass using cutoff of 0

    Computing event counts...  done. 6 events
    Indexing...  done.
Sorting and merging events... done. Reduced 6 events to 6.
Done indexing in 0.13 s.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 6
        Number of Outcomes: 2
      Number of Predicates: 14
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-4.1588830833596715    0.5
  2:  ... loglikelihood=-2.6351991759048894    1.0
  3:  ... loglikelihood=-1.9518912133474995    1.0
  4:  ... loglikelihood=-1.5599038834410852    1.0
  5:  ... loglikelihood=-1.3039748361952568    1.0
  6:  ... loglikelihood=-1.1229511041438864    1.0
  7:  ... loglikelihood=-0.9877356230661396    1.0
  8:  ... loglikelihood=-0.8826624290652341    1.0
  9:  ... loglikelihood=-0.7985244514476817    1.0
 10:  ... loglikelihood=-0.729543972551105    1.0
//...
 95:  ... loglikelihood=-0.0933856684859806    1.0
 96:  ... loglikelihood=-0.09245907503183291    1.0
 97:  ... loglikelihood=-0.09155090064000486    1.0
 98:  ... loglikelihood=-0.09066059844628399    1.0
 99:  ... loglikelihood=-0.08978764309881068    1.0
100:  ... loglikelihood=-0.08893152970793908    1.0