一款簡單但技術先進的NLP庫，分享給你！

時間 2019-11-13

標籤一款簡單技術先進 nlp 分享简体版

原文原文鏈接

來源商業新知網，原標題：Flair：一款簡單但技術先進的NLP庫瀏覽器

過去的幾年裏，在NLP（天然語言處理）領域，咱們已經見證了多項使人難以置信的突破，如ULMFiT、ELMo、Facebook的PyText以及谷歌的BERT等等。app

這些技術大大推動了NLP的前沿性研究，尤爲是語言建模。只要給出前幾個單詞的順序，咱們就能夠預測下一個句子。框架

但更重要的是，機器也找到了長期沒法實現推測語句的關鍵因素。dom

那就是：語境！機器學習

對語境的瞭解打破了阻礙NLP技術進步的障礙。而今天，咱們就來討論這樣的一個庫：Flair。oop

至今爲止，單詞要麼表示爲稀疏矩陣，要麼表示爲嵌入式詞語，如GLoVe，Bert和ELMo。可是，事物總有改進的空間，Flair就願意更正不足。學習

在本文中，首先咱們將瞭解Flair是什麼以及其背後的概念。而後將深刻討論使用Flair實現NLP任務。測試

一. 什麼是Flair庫？優化

Flair是由Zalando Research開發的一個簡單的天然語言處理（NLP）庫。 Flair的框架直接構建在PyTorch上，PyTorch是最好的深度學習框架之一。 Zalando Research團隊還爲如下NLP任務發佈了幾個預先訓練的模型：google

1. 名稱-實體識別（NER）：它能夠識別單詞是表明文本中的人，位置仍是名稱。

2. 詞性標註（PoS）：將給定文本中的全部單詞標記爲它們所屬的「詞性」。

3. 文本分類：根據標準對文本進行分類（標籤）。

4. 培訓定製模型：製做咱們本身的定製模型。

全部的這些模型，看起來頗有前景。但真正引發我注意的是，當我看到Flair在NLP中超越了幾項最早進成績的時候。看看這個目錄：

注意：F1評分主要是用於分類任務的評估指標。在評估模型時，它一般用於機器學習項目中的精度度量。F1評分考慮了現有項目的分佈。

二. Flair庫的優點是什麼？

Flair庫中包含了許多強大的功能，如下是最突出的一些方面：

· 它包括了最通用和最早進的單詞嵌入方式，如GloVe，BERT，ELMo，字符嵌入等。憑藉Flair API技術，使用起來很是容易。

· Flair的界面容許咱們組合不一樣的單詞嵌入並嵌入文檔，顯著優化告終果。

· 'Flair 嵌入'是Flair庫提供的簽名嵌入。它由上下文字符串嵌入提供支持，咱們將在下一節中詳細瞭解這一律念。

· Flair支持多種語言，並有望添加新語種。

三 . 用於序列標記的上下文字符串嵌入簡介

在處理NLP任務時，上下文語境很是重要。經過先前字符預測下一個字符，這一學習過程構成了序列建模的基礎。

上下文字符串的嵌入，是經過熟練利用字符語言模型的內部狀態，來產生一種新的嵌入類型。簡單來講，它經過字符模型中的某些內部原則，使單詞在不一樣的句子中能夠具備不一樣的含義。

注意：語言和字符模型是單詞/字符的機率分佈，所以每一個新單詞或新字符都取決於前面的單詞或字符。

有兩個主要因素驅動了上下文字符串的嵌入：

1. 這些單詞被理解爲字符（沒有任何單詞的概念）。也就是說，它的工做原理相似於字符嵌入。

2. 嵌入是經過其周圍文本進行語境化的。這意味着根據上下文，相同的單詞能夠有不一樣的嵌入意義。很像天然的人類語言，不是嗎？在不一樣的狀況下，同一個詞可能有不一樣的含義。

讓咱們看個例子來理解這個意思：

· 案例1：讀一本書（Reading a book）

· 案例2：請預訂火車票（Please book a train ticket）

說明：

· 在案例1中，book是一個名詞

· 在案例2中，book是動詞

語言是如此奇妙而複雜的東西啊！

四. 使用Flair在Python中執行NLP任務

是時候讓Flair進行測試了！咱們已經瞭解了這個神奇圖書館的所有內容。如今讓咱們親眼看看它在機器上是如何運行的。

咱們將使用Flair在Python中執行如下全部NLP任務：

1.使用Flair對嵌入的文本分類

2.詞性標記（PoS）與NLTK庫的比較

創建環境

咱們將使用Google Colaboratory運行咱們的代碼。Colab最棒的一點就是它免費提供GPU支持！這極大地方便了學習模型的深度培訓。

爲何使用Colab？

· 徹底免費

· 具備至關不錯的硬件配置

· 你的Web瀏覽器上都有，即便是硬件過期的舊機器也能夠運行

· 鏈接到你的Google雲端硬盤

· 很好地與Github集成

你只須要一個穩定的互聯網鏈接。

關於數據集

咱們將努力研究Twitter Sentiment Analysis（推特敏感度分析）的實踐問題。

而這一挑戰帶來的問題是：

這項任務的目的是檢測推文中的仇恨言論。爲了簡單起見，若是它帶有相關的種族主義或性別歧視情緒，咱們則判斷這條推文包含仇恨言論。所以，這項任務是將帶有種族主義或性別歧視地推文與其餘推文分類。

1.使用Flair嵌入進行文本分類

第1步：將數據導入Colab的本地環境：

# Install the PyDrive wrapper & import libraries.

# This only needs to be done once per notebook.

!pip install -U -q PyDrive

from pydrive.auth import GoogleAuth

from pydrive.drive import GoogleDrive

from google.colab import auth

from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.

# This only needs to be done once per notebook.

auth.authenticate_user()

gauth = GoogleAuth()

gauth.credentials = GoogleCredentials.get_application_default()

drive = GoogleDrive(gauth)

# Download a file based on its file ID.

# A file ID looks like: laggVyWshwcyP6kEI-y_W3P8D26sz

file_id = '1GhyH4k9C4uPRnMAMKhJYOqa-V9Tqt4q8' ### File ID ###

data = drive.CreateFile({'id': file_id})

#print('Downloaded content "{}"'.format(downloaded.GetContentString()))

你能夠在驅動器中數據集文件的可共享連接中找到文件ID。

將數據集導入Colab筆記本：

import io

Import pandas as pd

data = pd.read_csv(io.StringIO(data.GetContentString()))

data.head()

已從數據中刪除全部表情符號和符號，而且字符已轉換爲小寫。

第2步：安裝Flair

# download flair library #

import torch

!pip install flair

import flair

簡要介紹一下Flair數據類型

這個庫的對象有兩種類型—句子和標記對象。一個句子持有一個文本句子，基本上是標記列表：

from flair.data import Sentence

# create a sentence #

sentence = Sentence('Blogs of Analytics Vidhya are Awesome.')

# print the sentence to see what’s in it. #

print(Sentence)

第3步：準備文本以使用Flair

#extracting the tweet part#

text = data['tweet']

## txt is a list of tweets ##

txt = text.tolist()

print(txt[:10])

第4步：使用Flair嵌入單詞

## Importing the Embeddings ##

from flair.embeddings import WordEmbeddings

from flair.embeddings import CharacterEmbeddings

from flair.embeddings import StackedEmbeddings

from flair.embeddings import FlairEmbeddings

from flair.embeddings import BertEmbeddings

from flair.embeddings import ELMoEmbeddings

from flair.embeddings import FlairEmbeddings

### Initialising embeddings (un-comment to use others) ###

#glove_embedding = WordEmbeddings('glove')

#character_embeddings = CharacterEmbeddings()

flair_forward = FlairEmbeddings('news-forward-fast')

flair_backward = FlairEmbeddings('news-backward-fast')

#bert_embedding = BertEmbedding()

#elmo_embedding = ElmoEmbedding()

stacked_embeddings = StackedEmbeddings( embeddings = [

flair_forward-fast,

flair_backward-fast

])

你會注意到，咱們剛剛使用了一些上面最流行的單詞嵌入。你能夠刪除評論'＃'以使用全部嵌入。

如今你可能會問，到底什麼是「堆疊嵌入」？在這裏，咱們能夠結合多個嵌入來構建一個功能強大的單詞表示模型，不須要太複雜。很像合唱，不是嗎？

咱們使用Flair的堆疊嵌入只是爲了減小本文中的計算時間。使用你喜歡的任何組合能夠隨意地使用這個和其餘嵌入。

測試堆疊嵌入：

# create a sentence #

sentence = Sentence(‘ Analytics Vidhya blogs are Awesome .')

# embed words in sentence #

stacked.embeddings(sentence)

for token in sentence:

print(token.embedding)

# data type and size of embedding #

print(type(token.embedding))

# storing size (length) #

z = token.embedding.size()[0]

第5步：將文本矢量化

咱們將使用兩種方法展現這一點。

· 在推文中嵌入詞的意思

咱們將在這種方法中計算如下內容：

對於每一個句子：

1.爲每一個單詞生成單詞嵌入

2.計算每一個單詞嵌入的平均值以獲取句子嵌入

from tqdm import tqdm ## tracks progress of loop ##

# creating a tensor for storing sentence embeddings #

s = torch.zeros(0,z)

# iterating Sentence (tqdm tracks progress) #

for tweet in tqdm(txt):

# empty tensor for words #

w = torch.zeros(0,z)

sentence = Sentence(tweet)

stacked_embeddings.embed(sentence)

# for every word #

for token in sentence:

# storing word Embeddings of each word in a sentence #

w = torch.cat((w,token.embedding.view(-1,z)),0)

# storing sentence Embeddings (mean of embeddings of all words) #

s = torch.cat((s, w.mean(dim = 0).view(-1, z)),0)

· 文檔嵌入：將整個推文矢量化

from flair.embeddings import DocumentPoolEmbeddings

### initialize the document embeddings, mode = mean ###

document_embeddings = DocumentPoolEmbeddings([

flair_embedding_backward,

flair_embedding_forward

])

# Storing Size of embedding #

z = sentence.embedding.size()[1]

### Vectorising text ###

# creating a tensor for storing sentence embeddings

s = torch.zeros(0,z)

# iterating Sentences #

for tweet in tqdm(txt):

sentence = Sentence(tweet)

document_embeddings.embed(sentence)

# Adding Document embeddings to list #

s = torch.cat((s, sentence.embedding.view(-1,z)),0)

你能夠爲模型選擇任一種方法。如今咱們的文本已經矢量化過，咱們能夠將其提供給咱們的機器學習模型了！

第6步: 爲訓練集和測試集劃分數據

## tensor to numpy array ##

X = s.numpy()

## Test set ##

test = X[31962:,:]

train = X[:31962,:]

# extracting labels of the training set #

target = data['label'][data['label'].isnull()==False].values

第7步：構建模型並定義自定義評估程序（用於F1分數）

· 爲XGBoost定義自定義F1評估程序

def custom_eval(preds, dtrain):

labels = dtrain.get_label().astype(np.int)

preds = (preds >= 0.3).astype(np.int)

return [('f1_score', f1_score(labels, preds))]

· 構建XGBoost模型

import xgboost as xgb

from sklearn.model_selection import train_test_split

from sklearn.metrics import f1_score

### Splitting training set ###

x_train, x_valid, y_train, y_valid = train_test_split(train, target,

random_state=42,

test_size=0.3)

### XGBoost compatible data ###

dtrain = xgb.DMatrix(x_train,y_train)

dvalid = xgb.DMatrix(x_valid, label = y_valid)

### defining parameters ###

params = {

'colsample': 0.9,

'colsample_bytree': 0.5,

'eta': 0.1,

'max_depth': 8,

'min_child_weight': 6,

'objective': 'binary:logistic',

'subsample': 0.9

}

### Training the model ###

xgb_model = xgb.train(

params,

dtrain,

feval= custom_eval,

num_boost_round= 1000,

maximize=True,

evals=[(dvalid, "Validation")],

early_stopping_rounds=30

)

至此，咱們的模型已經經過訓練，能夠進行評估了！

第8步: 能夠預測了！

### Reformatting test set for XGB ###

dtest = xgb.DMatrix(test)

### Predicting ###

predict = xgb_model.predict(dtest) # predicting

咱們能夠把預測上傳到練習題界面，其中，0.2是機率閾值。

注意：根據Flair的官方文檔顯示，一個嵌入和其餘嵌入堆疊時，效果更佳。可是存在一個問題。

在CPU上計算可能須要很是長的時間，強烈建議利用GPU來得到更快的結果，你能夠在Colab中使用免費的。

2.詞性標註（POS）

咱們將使用Conll-2003數據集的一個子集，是一個預先標記的英文數據集。

第1步：導入數據集

### file was uploaded manually to local environment of Colab ###

data = open('pos-tagged_corpus.txt','r')

txt = data.read()

#print(txt)

數據文件每行包含一個單詞，空行表示句子邊界。

第2步：從數據集中提取句子和PoS標籤

### converting text in form of list of (words with their tags) ###

txt = txt.split('n')

### removing DOCSTART (document header)

txt = [x for x in txt if x != '-DOCSTART- -X- -X- O']

### check ###

for i in range(10):

print(txt[i])

print(‘-’*10)

### Extracting Sentences ###

# Initialize empty list for storing words

words = []

# initialize empty list for storing sentences #

corpus = []

for i in tqdm(txt):

## if blank sentence encountered ##

if i =='':

## previous words form a sentence ##

corpus.append(' '.join(words))

## Refresh Word list ##

words = []

else:

## word at index 0 ##

words.append(i.split()[0])

# did it work? #

for i in range(10):

print(corpus[i])

print(‘-’*10)

### Extracting POS ###

# Initialize empty list for storing word pos

w_pos = []

#initialize empty list for storing sentence pos #

POS = []

for i in tqdm(txt):

## blank sentence = new line ##

if i =='':

## previous words form a sentence POS ##

POS.append(' '.join(w_pos))

## Refresh words list ##

w_pos = []

else:

## pos tag from index 1 ##

w_pos.append(i.split()[1])

# did it work? #

for i in range(10):

print(corpus[i])

print(POS[i])

### Removing blanks form sentence and pos ###

corpus = [x for x in corpus if x!= '']

POS = [x for x in POS if x!= '']

### Check ###

For i in range(10):

print(corpus[i])

print(POS[i])

咱們從數據集中提取了咱們須要的基本方面。讓咱們繼續第3步。

第3步：使用NLTK和Flair標記文本

· 使用NLTK進行標記

首先，輸入所需的庫資源：

import nltk

nltk.download('tagsets')

nltk.download('punkt')

nltk.download('averaged_perceptron_tagger')

from nltk import word_tokenize

這將會下載全部所需的文件，用於使用NLTK進行標記。

### Tagging the corpus with NLTK ###

#for storing results#

nltk_pos = []

##for every sentence ##

for i in tqdm(corpus):

# Tokenize sentence #

text = word_tokenize(i)

#tag Words#

z = nltk.pos_tag(text)

# store #

nltk_pos.append(z)

POS標籤採用如下格式：

[(‘token_1’, ‘tag_1’), ………….. , (‘token_n’, ‘tag_n’)]

讓咱們從中提取POS：

### Extracting final pos by nltk in a list ###

tmp = []

nltk_result = []

## every tagged sentence ##

for i in tqdm(nltk_pos):

tmp = []

## every word ##

for j in i:

## append tag (from index 1) ##

tmp.append(j[1])

# join the tags of every sentence #

nltk_result.append(' '.join(tmp))

### check ###

for i in range(10):

print(nltk_result[i])

print(corpus[i])

NLTK標籤已準備就緒。

· 如今，關注一下使用Flair進行標記

首先，輸入庫資源：

!pip install flair

from flair.data import Sentence

from flair.models import SequenceTagger

使用Flair進行標記：

# initiating object #

pos = SequenceTagger.load('pos-fast')

#for storing pos tagged string#

f_pos = []

## for every sentence ##

for i in tqdm(corpus):

sentence = Sentence(i)

pos.predict(sentence)

## append tagged sentence ##

f_pos.append(sentence.to_tagged_string())

###check ###

for i in range(10):

print(f_pos[i])

print(corpus[i])

結果將爲如下格式：

token_1 <tag_1>token_2 <tag_2>………………….. token_n

注意：咱們能夠在Flair庫中使用不一樣的標記器，能夠隨意修補和實驗。

用NLTK的方式提取句子標籤

Import re

### Extracting POS tags ###

## in every sentence by index ##

for i in tqdm(range(len(f_pos))):

## for every words ith sentence ##

for j in corpus[i].split():

## replace that word from ith sentence in f_pos ##

f_pos[i] = str(f_pos[i]).replace(j,"",1)

## Removing < > symbols ##

for j in ['<','>']:

f_pos[i] = str(f_pos[i]).replace(j,"")

## removing redundant spaces ##

f_pos[i] = re.sub(' +', ' ', str(f_pos[i]))

f_pos[i] = str(f_pos[i]).lstrip()

### check ###

for i in range(10):

print(f_pos[i])

print(corpus[i])

啊哈！咱們終於標記了語料庫並從其中按順序提取出句子。咱們能夠自由刪除全部的標點和特殊符號。

### Removing Symbols and redundant space ###

## in every sentence by index ##

for i in tqdm(range(len(corpus))):

# Removing Symbols #

corpus[i] = re.sub('[^a-zA-Z]', ' ', str(corpus[i]))

POS[i] = re.sub('[^a-zA-Z]', ' ', str(POS[i]))

f_pos[i] = re.sub('[^a-zA-Z]', ' ', str(f_pos[i]))

nltk_result[i] = re.sub('[^a-zA-Z]', ' ', str(nltk_result[i]))

## Removing HYPH SYM (they are for symbols) ##

f_pos[i] = str(f_pos[i]).replace('HYPH',"")

f_pos[i] = str(f_pos[i]).replace('SYM',"")

POS[i] = str(POS[i]).replace('SYM',"")

POS[i] = str(POS[i]).replace('HYPH',"")

nltk_result[i] = str(nltk_result[i].replace('HYPH',''))

nltk_result[i] = str(nltk_result[i].replace('SYM',''))

## Removing redundant space ##

POS[i] = re.sub(' +', ' ', str(POS[i]))

f_pos[i] = re.sub(' +', ' ', str(f_pos[i]))

corpus[i] = re.sub(' +', ' ', str(corpus[i]))

nltk_result[i] = re.sub(' +', ' ', str(nltk_result[i]))

咱們使用NLTK和Flair標記了語料庫，提取並刪除了全部沒必要要的元素，讓咱們一塊兒看看：

for i in range(1000):

print('corpus '+corpus[i])

print('actual '+POS[i])

print('nltk '+nltk_result[i])

print('flair '+f_pos[i])

print('-'*50)

輸出：

結果看起來很是有說服力！

第4步：針對標記數據集評估來自NLTK和Flair的PoS標記

在這裏，咱們將在定製評估器的幫助下對標籤進行逐字評估。

請注意，在上面的示例中，與NLTK和flair標籤相比，實際的POS標籤包含冗餘（如粗體所示）。所以，咱們不會考慮句子長度不等的POS標記句子。

### EVALUATION FUNCTION ###

def eval(x,y):

# correct match #

count = 0

#Total comparisons made#

comp = 0

## for every sentence index in dataset ##

for i in range(len(x)):

## if the sentence length match ##

if len(x[i].split()) == len(y[i].split()):

## compare each word ##

for j in range(len(x[i].split())):

if x[i][j] == y[i][j] :

## Match! ##

count = count+1

comp = comp + 1

else:

comp = comp + 1

return (count/comp)*100

最後，咱們根據數據集提供的POS標籤評估NLTK和Flair的POS標籤。

print("nltk Score ", eval2(POS,nltk_result))

print("Flair Score ", eval2(POS,f_pos))

結果是：

NLTK評估分數 : 85.38654023442645

Flair 評估分數 : 90.96172124773179

Flair顯然在字嵌入和堆疊字嵌入方面佔據優點。因爲其高級的API，這些嵌入能夠絕不費力地實現。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。