本文主要研究下如何使用opennlp進行詞性標註html
詞性(Part of Speech, POS),標註是對一個詞彙或一段文字進行描述的過程。這個描述被稱爲一個標註。算法
目前流行的中文詞性標籤有兩大類:北大詞性標註集和賓州詞性標註集。現代漢語的詞能夠分爲兩類12種詞性:一類是實詞:名詞、動詞、形容詞、數詞、量詞和代詞;另外一類是虛詞:副詞、介詞、連詞、助詞、嘆詞和擬聲詞。app
這塊的技術大多數使用HMM(隱馬爾科夫模型)+ Viterbi算法,最大熵算法(Maximum Entropy)。ide
OpenNLP裏頭能夠使用POSTaggerME類來執行基本的標註,以及ChunkerME類來執行分塊。post
public static POSModel trainPOSModel(ModelType type) throws IOException { TrainingParameters params = new TrainingParameters(); params.put(TrainingParameters.ALGORITHM_PARAM, type.toString()); params.put(TrainingParameters.ITERATIONS_PARAM, 100); params.put(TrainingParameters.CUTOFF_PARAM, 5); return POSTaggerME.train("eng", createSampleStream(), params, new POSTaggerFactory()); } private static ObjectStream<POSSample> createSampleStream() throws IOException { InputStreamFactory in = new ResourceAsStreamFactory(POSTaggerMETest.class, "postag/AnnotatedSentences.txt"); return new WordTagSampleStream(new PlainTextByLineStream(in, StandardCharsets.UTF_8)); } @Test public void testPOSTagger() throws IOException { POSModel posModel = trainPOSModel(ModelType.MAXENT); POSTagger tagger = new POSTaggerME(posModel); String[] tags = tagger.tag(new String[] { "The", "driver", "got", "badly", "injured", "."}); Assert.assertEquals(6, tags.length); Assert.assertEquals("DT", tags[0]); Assert.assertEquals("NN", tags[1]); Assert.assertEquals("VBD", tags[2]); Assert.assertEquals("RB", tags[3]); Assert.assertEquals("VBN", tags[4]); Assert.assertEquals(".", tags[5]); }
這裏首先進行模型訓練,其中訓練文本樣式以下:
Last_JJ September_NNP ,_, I_PRP tried_VBD to_TO find_VB out_RP the_DT address_NN of_IN an_DT old_JJ school_NN friend_NN whom_WP I_PRP had_VBD not_RB seen_VBN for_IN 15_CD years_NNS ._. I_PRP just_RB knew_VBD his_PRP$ name_NN ,_, Alan_NNP McKennedy_NNP ,_, and_CC I_PRP 'd_MD heard_VBD the_DT rumour_NN that_IN he_PRP 'd_MD moved_VBD to_TO Scotland_NNP ,_, the_DT country_NN of_IN his_PRP$ ancestors_NNS ._. So_IN I_PRP called_VBD Julie_NNP ,_, a_DT friend_NN who's_WDT still_RB in_IN contact_NN with_IN him_PRP ._. She_PRP told_VBD me_PRP that_IN he_PRP lived_VBD in_IN 23213_CD Edinburgh_NNP ,_, Worcesterstreet_NNP 12_CD ._. I_PRP wrote_VBD him_PRP a_DT letter_NN right_RB away_RB and_CC he_PRP answered_VBD soon_RB ,_, sounding_VBG very_RB happy_JJ and_CC delighted_JJ ._.
標註說明:ui
Determiner
)Noun, singular or mass
)Verb, past tense
)Adverb
)Verb, past participle
)private Chunker chunker; private static String[] toks1 = { "Rockwell", "said", "the", "agreement", "calls", "for", "it", "to", "supply", "200", "additional", "so-called", "shipsets", "for", "the", "planes", "." }; private static String[] tags1 = { "NNP", "VBD", "DT", "NN", "VBZ", "IN", "PRP", "TO", "VB", "CD", "JJ", "JJ", "NNS", "IN", "DT", "NNS", "." }; private static String[] expect1 = { "B-NP", "B-VP", "B-NP", "I-NP", "B-VP", "B-SBAR", "B-NP", "B-VP", "I-VP", "B-NP", "I-NP", "I-NP", "I-NP", "B-PP", "B-NP", "I-NP", "O" }; @Before public void startup() throws IOException { ResourceAsStreamFactory in = new ResourceAsStreamFactory(getClass(), "chunker/test.txt"); ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream( new PlainTextByLineStream(in, StandardCharsets.UTF_8)); TrainingParameters params = new TrainingParameters(); params.put(TrainingParameters.ITERATIONS_PARAM, 70); params.put(TrainingParameters.CUTOFF_PARAM, 1); ChunkerModel chunkerModel = ChunkerME.train("eng", sampleStream, params, new ChunkerFactory()); this.chunker = new ChunkerME(chunkerModel); } @Test public void testChunkAsArray() throws Exception { String[] preds = chunker.chunk(toks1, tags1); Assert.assertArrayEquals(expect1, preds); }
這裏一樣也進行了模型訓練,其訓練文本樣式以下:
Rockwell NNP B-NP International NNP I-NP Corp. NNP I-NP 's POS B-NP Tulsa NNP I-NP unit NN I-NP said VBD B-VP it PRP B-NP signed VBD B-VP a DT B-NP tentative JJ I-NP agreement NN I-NP extending VBG B-VP its PRP$ B-NP contract NN I-NP with IN B-PP Boeing NNP B-NP Co. NNP I-NP to TO B-VP provide VB I-VP structural JJ B-NP parts NNS I-NP for IN B-PP Boeing NNP B-NP 's POS B-NP 747 CD I-NP jetliners NNS I-NP
標註說明:this
本文初步展現瞭如何使用opennlp進行詞性標註,模型訓練是個比較重要的一個方面,能夠經過特定訓練提升特定領域文本的標註準確性。code