來自麻省理工的信息抽取

時間 2019-11-24

標籤來自麻省理工信息抽取简体版

原文原文鏈接

MITIE

MITIE 即 MIT 的 NLP 團隊發佈的一個信息抽取庫和工具。它是一款免費且先進的信息抽取工具，目前包含了命名實體抽取、二元關係檢測功能，另外也提供了訓練自定義抽取器和關係檢測器的工具。html

MITIE 是核心代碼是使用 C++ 寫的，創建在高性能的機器學習庫 dlib 上。MIT 團隊給咱們提供了一些已訓練好了的模型，這其中包含了英語、西班牙語和德語，這些模型都使用了大量的語料進行訓練。咱們發現並無咱們要的中文的模型，因此這個還得咱們本身訓練。java

儘管 MITIE 是 C++ 寫的，但它也提供了其餘語言的調用 API 。在我本身的項目中經常會跟 Java 、 Python 混合用，因此只要編譯成動態庫再分別用 Java 和 Python 調用就好了，很方便。python

爲何出現MITIE

看看 MIT 實驗室的人怎麼說就知道了。ios

I work at a lab and there are a lot of cool things about my job. In fact, I could go on all day about it, but in this post I want to talk about one thing in particular, which is that we recently got funded by the program to make an open source natural language processing library focused on information extraction.git

Why make such a thing when there are already open source libraries out there for this (e.g. OpenNLP, NLTK, Stanford IE, etc.)? Well, if you look around you quickly find out that everything which exists is either expensive, not state-of-the-art, or GPL licensed. If you wanted to use this kind of NLP tool in a non-GPL project then you are either out of luck, have to pay a lot of money, or settle for something of low quality. Well, not anymore! We just released the first version of our MIT Information Extraction library which is built using state-of-the-art statistical machine learning tools.github

怎麼使用

提取實體爲例，爲方即可直接使用 MITIE 提供給咱們的模型，不然你就須要本身訓練了。從 github.com/mit-nlp/MIT… 下載。bash

而後建立一個 test.txt 文件，待測試內容爲機器學習

I met with john becker at HBU.
The other day at work I saw Brian Smith from CMU.複製代碼

最後編寫代碼以下，工具

#include <mitie/named_entity_extractor.h>
#include <mitie/conll_tokenizer.h>
#include <iostream>
#include <iomanip>
#include <fstream>
#include <cstdlib>

using namespace std;
using namespace mitie;

std::vector<string> tokenize_file (
    const string& filename
)
{
    ifstream fin(filename.c_str());
    if (!fin)
    {
        cout << "Unable to load input text file" << endl;
        exit(EXIT_FAILURE);
    }
    conll_tokenizer tok(fin);
    std::vector<string> tokens;
    string token;
    while(tok(token))
        tokens.push_back(token);

    return tokens;
}


int main(int argc, char** argv)
{
    try
    {
        if (argc != 3)
        {
            printf("You must give a MITIE ner model file as the first command line argument\n");
            printf("followed by a text file to process.\n");
            return EXIT_FAILURE;
        }
        string classname;
        named_entity_extractor ner;
        dlib::deserialize(argv[1]) >> classname >> ner;

        const std::vector<string> tagstr = ner.get_tag_name_strings();
        cout << "The tagger supports "<< tagstr.size() <<" tags:" << endl;
        for (unsigned int i = 0; i < tagstr.size(); ++i)
            cout << " " << tagstr[i] << endl;

        std::vector<string> tokens = tokenize_file(argv[2]);

        std::vector<pair<unsigned long, unsigned long> > chunks;
        std::vector<unsigned long> chunk_tags;
        std::vector<double> chunk_scores;

        ner.predict(tokens, chunks, chunk_tags, chunk_scores);

        cout << "\nNumber of named entities detected: " << chunks.size() << endl;
        for (unsigned int i = 0; i < chunks.size(); ++i)
        {
            cout << " Tag " << chunk_tags[i] << ": ";
            cout << "Score: " << fixed << setprecision(3) << chunk_scores[i] << ": ";
            cout << tagstr[chunk_tags[i]] << ": ";
            for (unsigned long j = chunks[i].first; j < chunks[i].second; ++j)
                cout << tokens[j] << " ";
            cout << endl;
        }

        return EXIT_SUCCESS;
    }
    catch (std::exception& e)
    {
        cout << e.what() << endl;
        return EXIT_FAILURE;
    }
}複製代碼

執行結果爲，post

The tagger supports 4 tags:
   PERSON
   LOCATION
   ORGANIZATION
   MISC

Number of named entities detected: 4
   Tag 0: Score: 1.532: PERSON: john becker
   Tag 2: Score: 0.340: ORGANIZATION: HBU
   Tag 0: Score: 1.652: PERSON: Brian Smith
   Tag 2: Score: 0.471: ORGANIZATION: CMU複製代碼

中文模型訓練

主要是要訓練全部詞向量特徵，後面的實名實體模型和關係模型都是創建在它的基礎上，MITIE 給咱們提供了工具完成上述操做，咱們能夠用 cmake 生成vs項目，但通常咱們沒有必要改動到代碼，直接使用 cmake 構建一下就可直接使用。主要操做有

D:\MITIE\tools\wordrep>mkdir build
D:\MITIE\tools\wordrep>cd build
D:\MITIE\tools\wordrep\build>cmake ..
D:\MITIE\tools\wordrep\build>cmake --build . --config Release複製代碼

再一個是須要收集大量的詞彙，能夠經過維基百科和百度百科收集，相似處理能夠參加前面的文章《如何使用中文維基百科語料》。

接着就能夠開始訓練了，參數e表示生成全部咱們須要的模型，data爲語料庫的目錄。

wordrep -e data複製代碼

if (parser.option("e"))
        {
            count_words(parser);
            word_vects(parser);
            basic_morph(parser);
            cca_morph(parser);
            return 0;
        }複製代碼

Java&Python調用

主要的一步都是要生成共享連接庫，一樣使用 cmake 能夠很方便生成，到 mitielib 目錄，

D:\MITIE\mitielib>mkdir build
D:\MITIE\mitielib>cd build
D:\MITIE\mitielib\build>cmake ..
D:\MITIE\mitielib\build>cmake --build . --config Release --target install複製代碼

生成須要的連接庫。

-- Install configuration: "Release"
-- Installing: D:/MITIE/mitielib/msvcp140.dll
-- Installing: D:/MITIE/mitielib/vcruntime140.dll
-- Installing: D:/MITIE/mitielib/concrt140.dll
-- Installing: D:/MITIE/mitielib/mitie.lib
-- Installing: D:/MITIE/mitielib/mitie.dll複製代碼

而後 python 就能輕易完成調用。而對於 Java 也而須要相似的操做，但它的構建過程還須要有 SWIG 。生成以下的連接庫和 jar 包，而後 Java就能輕易完成調用。

-- Install configuration: "Release"
-- Installing: D:/MITIE/mitielib/java/../javamitie.dll
-- Installing: D:/MITIE/mitielib/java/../javamitie.jar
-- Up-to-date: D:/MITIE/mitielib/java/../msvcp140.dll
-- Up-to-date: D:/MITIE/mitielib/java/../vcruntime140.dll
-- Up-to-date: D:/MITIE/mitielib/java/../concrt140.dll複製代碼