基於pytorch：BiLSTM-Attention實現關係抽取

　　概述html

　　雖然tensorflow2.0發佈以來仍是收穫了一批用戶，可是在天然語言處理領域，彷佛pytorch見的更多一點。關係抽取是目前天然語言處理的主流任務之一，遺憾沒能找到較新能用的開源代碼。一方面是由於關係抽取任務的複雜性，目前數據集較少，且標註的成本極高，尤爲是中文數據集，因此針對該任務的數據集屈指可數，這也限制了這方面的研究。另外一方面，關係抽取任務的複雜性，程序多數不可通用。github上有pytorch版本的BiLSTM-attention的開源代碼，然而基於python2且pytorch版本較低。目前沒有基於python3,tf2的BiLSTM-Attention關係抽取任務的開源代碼。我在這篇博客中會寫使用python3，基於pytorch框架實現BiLSTM-Attention進行關係抽取的主要代碼(可有可無的就不寫啦)。(學生黨仍是棄坑tensorflow1.x吧，一口老血。。。)python

　　關係抽取git

　　其實關係抽取能夠歸爲信息抽取的一部分。信息抽取是當前天然語言處理的熱點之一。信息抽取是知識圖譜，文本摘要等任務的核心環節，可是就目前的研究來看，當前的技術仍不成熟，所消耗的資源較多且研究結果差強人意。對於構建知識圖譜來講，實體識別，關係抽取，實體融合是不可缺乏的要素。當前，聯合關係抽取有許多經典模型，可是效果通常。在能夠保證明體識別的高準確率的狀況下，仍是建議使用pipeline方法，即先識別實體，後進行實體之間的關係抽取。本文介紹在已經有實體的基礎上，進行關係抽取的經典模型，BiLSTM-Attention，該模型在NLP中不少地方都有它的身影，尤爲是文本分類任務中。進行關係抽取時，也是把句子進行分類任務，這種狀況下，關係抽取也叫作關係分類。理論基礎來源於這篇文章：文章地址。github

　　關於LSTM和attention我就很少贅述了，網上的資料不少。咱們直接來看一下架構：微信

　　這是文章中的架構圖。其實也很簡單，字符通過嵌入後傳給LSTM層，編碼以後通過Attention層，而後進行目標的預測。這一看就是個最簡單的文本分類的結構。那麼關係抽取是怎麼解決的呢?關係抽取其實就是在嵌入時，加入了實體的特徵，與句子特徵融合起來，丟給神經網絡進行關係分類。網絡

　　數據集架構

　　數據集展現app

　　這個數據集是關係抽取中最多見的數據集，人物關係抽取。第一列，第二列是實體，第三列是他們之間的關係，後面是兩個實體所處的句子。總共11個類別+unknown，在文件relation2id中。框架

　　### 數據預處理ide

　　def get_label_distribution(relation_file_path,data_file_path):

　　relation2id = {}

　　with open(relation_file_path, "r", encoding="utf-8") as fr:

　　for line in fr.readlines():

　　line = line.strip().split(" ")

　　relation2id[line[0]] = int(line[1])

　　import pandas as pd

　　label = []

　　with open(data_file_path, encoding='utf-8') as fr:

　　for line in fr.readlines():

　　line = line.split("\t")

　　label.append(relation2id[line[2]])

　　df = pd.Series(label).value_counts()

　　return df

　　上述代碼是爲了獲得標籤的分佈，讀取文件後將標籤信息寫入relation2id的字典中。label 中寫入的是數據集中的標籤。

　　def flatten_lists(lists):

　　flatten_list = []

　　for l in lists:

　　if type(l) == list:

　　flatten_list += l

　　else:

　　flatten_list.append(l)

　　return flatten_list

　　def flat_gen(x):

　　def is_elment(el):

　　return not(isinstance(el,collections.Iterable) and not isinstance(el,str))

　　for el in x:

　　if is_elment(el):

　　yield el

　　else:

　　yield from flat_gen(el)

　　上述代碼是將整個數據文件轉換成單行列表。

　　評價模型部分的代碼省略，基本都是一個套路。

　　模型構建

　　import torch

　　import torch.nn as nn

　　import torch.nn.functional as F

　　torch.manual_seed(1)

　　class BiLSTM_ATT(nn.Module):

　　def __init__(self,input_size,output_size,config,pre_embedding):

　　super(BiLSTM_ATT,self).__init__()

　　self.batch = config['BATCH']

　　self.input_size = input_size

　　self.embedding_dim = config['EMBEDDING_DIM']

　　self.hidden_dim = config['HIDDEN_DIM']

　　self.tag_size = output_size

　　self.pos_size = config['POS_SIZE']

　　self.pos_dim = config['POS_DIM']

　　self.pretrained = config['pretrained']

　　if self.pretrained:

　　self.word_embeds = nn.Embedding.from_pretrained(torch.FloatTensor(pre_embedding),freeze=False)

　　else: 瀋陽作人流多少錢 http://yyk.39.net/sy/zhuanke/fc843.html

　　self.word_embeds = nn.Embedding(self.input_size,self.embedding_dim)

　　self.pos1_embeds = nn.Embedding(self.pos_size,self.pos_dim)

　　self.pos2_embeds = nn.Embedding(self.pos_size,self.pos_dim)

　　self.dense = nn.Linear(self.hidden_dim,self.tag_size,bias=True)

　　self.relation_embeds = nn.Embedding(self.tag_size,self.hidden_dim)

　　self.lstm = nn.LSTM(input_size=self.embedding_dim+self.pos_dim*2,hidden_size=self.hidden_dim//2,num_layers=1, bidirectional=True)

　　self.hidden2tag = nn.Linear(self.hidden_dim,self.tag_size)

　　self.dropout_emb = nn.Dropout(p=0.5)

　　self.dropout_lstm = nn.Dropout(p=0.5)

　　self.dropout_att = nn.Dropout(p=0.5)

　　self.hidden = self.init_hidden()

　　self.att_weight = nn.Parameter(torch.randn(self.batch,1,self.hidden_dim))

　　self.relation_bias = nn.Parameter(torch.randn(self.batch,self.tag_size,1))

　　def init_hidden(self):

　　return torch.randn(2, self.batch, self.hidden_dim // 2)

　　def init_hidden_lstm(self):

　　return (torch.randn(2, self.batch, self.hidden_dim // 2),

　　torch.randn(2, self.batch, self.hidden_dim // 2))

　　def attention(self,H):

　　M = torch.tanh(H) # 非線性變換 size:(batch_size,hidden_dim,seq_len)

　　a = F.softmax(torch.bmm(self.att_weight,M),dim=2) # a.Size : (batch_size,1,seq_len)

　　a = torch.transpose(a,1,2) # (batch_size,seq_len,1)

　　return torch.bmm(H,a) # (batch_size,hidden_dim,1)

　　LSTM的輸入是實體1的位置信息+實體2的微信信息+嵌入信息。LSTM的output保存了最後一層的輸出h。

　　def forward(self,sentence,pos1,pos2):

　　self.hidden = self.init_hidden_lstm()

　　embeds = torch.cat((self.word_embeds(sentence),self.pos1_embeds(pos1),self.pos2_embeds(pos2)),dim=2)

　　embeds = torch.transpose(embeds,0,1)

　　lstm_out, self.hidden = self.lstm(embeds, self.hidden)

　　lstm_out = lstm_out.permute(1,2,0)

　　lstm_out = self.dropout_lstm(lstm_out)

　　att_out = torch.tanh(self.attention(lstm_out ))

　　relation = torch.tensor([i for i in range(self.tag_size)], dtype=torch.long).repeat(self.batch, 1)

　　relation = self.relation_embeds(relation)

　　out = torch.add(torch.bmm(relation, att_out), self.relation_bias)

　　out = F.softmax(out,dim=1)

　　return out.view(self.batch,-1)

　　這一段主要是維度變換的工做，將數據處理成模型所須要的維度。上面有一些配置信息是在另外一個文件夾中統一編寫的，基本的模型就是這樣。