使用opennlp自定義命名實體

時間 2019-12-05

標籤使用 opennlp 自定義命名實體简体版

原文原文鏈接

序

本文主要研究一下如何使用opennlp自定義命名實體，標註訓練及模型運用。html

maven

<dependency>
            <groupId>org.apache.opennlp</groupId>
            <artifactId>opennlp-tools</artifactId>
            <version>1.8.4</version>
        </dependency>

實踐

訓練模型

// train the name finder
        String typedEntities = "<START:organization> NATO <END>\n" +
                "<START:location> United States <END>\n" +
                "<START:organization> NATO Parliamentary Assembly <END>\n" +
                "<START:location> Edinburgh <END>\n" +
                "<START:location> Britain <END>\n" +
                "<START:person> Anders Fogh Rasmussen <END>\n" +
                "<START:location> U . S . <END>\n" +
                "<START:person> Barack Obama <END>\n" +
                "<START:location> Afghanistan <END>\n" +
                "<START:person> Rasmussen <END>\n" +
                "<START:location> Afghanistan <END>\n" +
                "<START:date> 2010 <END>";
        ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
                new PlainTextByLineStream(new MockInputStreamFactory(typedEntities), "UTF-8"));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ALGORITHM_PARAM, "MAXENT");
        params.put(TrainingParameters.ITERATIONS_PARAM, 70);
        params.put(TrainingParameters.CUTOFF_PARAM, 1);

        TokenNameFinderModel nameFinderModel = NameFinderME.train("eng", null, sampleStream,
                params, TokenNameFinderFactory.create(null, null, Collections.emptyMap(), new BioCodec()));

opennlp使用<START> 及 <END>來進行自定義標註實體，命名實體的話則在START以後用冒號標明，好比<START:person>

參數說明apache

ALGORITHM_PARAM

On the engineering level, using maxent is an excellent way of creating programs which perform very difficult classification tasks very well.

ITERATIONS_PARAM

number of training iterations, ignored if -params is used.

CUTOFF_PARAM

minimal number of times a feature must be seen

使用模型

上面訓練完模型以後，就可使用該模型進行解析

NameFinderME nameFinder = new NameFinderME(nameFinderModel);

        // now test if it can detect the sample sentences

        String[] sentence = "NATO United States Barack Obama".split("\\s+");

        Span[] names = nameFinder.find(sentence);

        Stream.of(names)
                .forEach(span -> {
                    String named = IntStream.range(span.getStart(),span.getEnd())
                            .mapToObj(i -> sentence[i])
                            .collect(Collectors.joining(" "));
                    System.out.println("find type: "+ span.getType()+",name: " + named);
                });

輸出以下：maven

find type: organization,name: NATO
find type: location,name: United States
find type: person,name: Barack Obama

小結

opennlp的自定義命名實體的標註，給以了必定定製空間，方便開發者定製各自領域特殊的命名實體，以提升特定命名實體分詞的準確性。spa

doc

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。