適用於NLP天然語言處理的Python代寫：使用Facebook FastText庫

時間 2019-12-19

標籤適用於 nlp 天然語言處理 python 代寫使用 fasttext 欄目 Python 简体版

原文原文鏈接

原文連接：http://tecdat.cn/?p=8572

在本文中，咱們將研究FastText，它是用於單詞嵌入和文本分類的另外一個極其有用的模塊。python

在本文中，咱們將簡要探討FastText庫。本文分爲兩個部分。在第一部分中，咱們將看到FastText庫如何建立向量表示形式，該向量表示形式可用於查找單詞之間的語義類似性。在第二部分中，咱們將看到FastText庫在文本分類中的應用。c++

語義類似性的FastText

FastText支持詞袋和Skip-Gram模型。在本文中，咱們將實現skip-gram模型，因爲這些主題很是類似，所以咱們選擇這些主題以擁有大量數據來建立語料庫。您能夠根據須要添加更多相似性質的主題。git

第一步，咱們須要導入所需的庫。github

$ pip install wikipedia

導入庫算法

如下腳本將所需的庫導入咱們的應用程序：c#

對於單詞表示和語義類似性，咱們能夠將Gensim模型用於FastText。from keras.preprocessing.text import Tokenizer
from gensim.models.fasttext import FastText
import numpy as np
import matplotlib.pyplot as plt
import nltk
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer

import wikipedia
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))

%matplotlib inline

維基百科文章微信

在這一步中，咱們將抓取所需的Wikipedia文章。看下面的腳本：網絡

artificial_intelligence = wikipedia.page("Artificial Intelligence").content
machine_learning = wikipedia.page("Machine Learning").content
deep_learning = wikipedia.page("Deep Learning").content
neural_network = wikipedia.page("Neural Network").content

artificial_intelligence = sent_tokenize(artificial_intelligence)
machine_learning = sent_tokenize(machine_learning)
deep_learning = sent_tokenize(deep_learning)
neural_network = sent_tokenize(neural_network)

artificial_intelligence.extend(machine_learning)
artificial_intelligence.extend(deep_learning)
artificial_intelligence.extend(neural_network)

要抓取Wikipedia頁面，咱們可使用模塊中的page方法wikipedia。您要剪貼的頁面名稱做爲參數傳遞給page方法。該方法返回WikipediaPage對象，而後您可使用該對象經過content屬性來檢索頁面內容，如上面的腳本所示。app

而後使用該sent_tokenize方法未來自四個Wikipedia頁面的抓取的內容標記爲句子。該sent_tokenize方法返回句子列表。四個頁面的句子分別標記。最後，經過該extend方法將四篇文章中的句子鏈接在一塊兒。less

數據預處理

下一步是經過刪除標點符號和數字來清除文本數據。

preprocess_text以下定義的功能執行預處理任務。

import re
from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

def preprocess_text(document):
     



        preprocessed_text = ' '.join(tokens)

        return preprocessed_text

讓咱們看看咱們的函數是否經過預處理一個僞句子來執行所需的任務：

sent = preprocess_text("Artificial intelligence, is the most advanced technology of the present era")
print(sent)

預處理語句以下所示：

artificial intelligence advanced technology present

您會看到標點符號和停用詞已被刪除。

建立單詞表示

咱們已經對語料庫進行了預處理。如今是時候使用FastText建立單詞表示形式了。首先讓咱們爲FastText模型定義超參數：

embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2

這embedding_size是嵌入向量的大小。

下一個超參數是min_word，它指定語料庫中單詞生成的最小頻率。最後，最頻繁出現的單詞將經過down_sampling屬性指定的數字進行下采樣。

如今讓咱們FastText爲單詞表示建立模型。

%%time
ft_model = FastText(word_tokenized_corpus,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      sg=1,
                      iter=100)

該sg參數定義了咱們要建立模型的類型。值爲1表示咱們要建立跳躍語法模型。零指定單詞袋模型，這也是默認值。

執行上面的腳本。運行可能須要一些時間。在個人機器上，上述代碼運行的時間統計信息以下：

CPU times: user 1min 45s, sys: 434 ms, total: 1min 45s
Wall time: 57.2 s

print(ft_model.wv['artificial'])

這是輸出：

[-3.7653010e-02 -4.5558015e-01  3.2035065e-01 -1.5289043e-01
  4.0645871e-02 -1.8946664e-01  7.0426887e-01  2.8806925e-01
 -1.8166199e-01  1.7566417e-01  1.1522485e-01 -3.6525184e-01
 -6.4378887e-01 -1.6650060e-01  7.4625671e-01 -4.8166099e-01
  2.0884991e-01  1.8067230e-01 -6.2647951e-01  2.7614883e-01
 -3.6478557e-02  1.4782918e-02 -3.3124462e-01  1.9372456e-01
  4.3028224e-02 -8.2326338e-02  1.0356739e-01  4.0792203e-01
 -2.0596240e-02 -3.5974573e-02  9.9928051e-02  1.7191900e-01
 -2.1196717e-01  6.4424530e-02 -4.4705093e-02  9.7391091e-02
 -2.8846195e-01  8.8607501e-03  1.6520244e-01 -3.6626378e-01
 -6.2017748e-04 -1.5083785e-01 -1.7499258e-01  7.1994811e-02
 -1.9868813e-01 -3.1733567e-01  1.9832127e-01  1.2799081e-01
 -7.6522082e-01  5.2335665e-02 -4.5766738e-01 -2.7947658e-01
  3.7890410e-03 -3.8761377e-01 -9.3001537e-02 -1.7128626e-01
 -1.2923178e-01  3.9627206e-01 -3.6673656e-01  2.2755004e-01]

如今讓咱們找到「人造」，「智能」，「機器」，「網絡」，「常常出現」，「深度」這五個最類似的詞。您能夠選擇任意數量的單詞。如下腳本將打印指定的單詞以及5個最類似的單詞。

for k,v in semantically_similar_words.items():
    print(k+":"+str(v))

輸出以下：

artificial:['intelligence', 'inspired', 'book', 'academic', 'biological']
intelligence:['artificial', 'human', 'people', 'intelligent', 'general']
machine:['ethic', 'learning', 'concerned', 'argument', 'intelligence']
network:['neural', 'forward', 'deep', 'backpropagation', 'hidden']
recurrent:['rnns', 'short', 'schmidhuber', 'shown', 'feedforward']
deep:['convolutional', 'speech', 'network', 'generative', 'neural']

咱們還能夠找到任意兩個單詞的向量之間的餘弦類似度，以下所示：

print(ft_model.wv.similarity(w1='artificial', w2='intelligence'))

輸出顯示值爲「 0.7481」。該值能夠介於0到1之間。更高的值表示更高的類似度。

可視化單詞類似性

儘管模型中的每一個單詞都表示爲60維向量，可是咱們可使用主成分分析技術來找到兩個主成分。而後可使用兩個主要成分在二維空間中繪製單詞。

print(all_similar_words)
print(type(all_similar_words))
print(len(all_similar_words))

字典中的每一個鍵都是一個單詞。相應的值是全部語義類似的單詞的列表。因爲咱們在「人工」，「智能」，「機器」，「網絡」，「常常性」，「深度」這6個詞的列表中找到了前5個最類似的詞，所以您會發現其中有30個詞該all_similar_words列表。

接下來，咱們必須找到全部這30個單詞的單詞向量，而後使用PCA將單詞向量的維數從60減少到2。而後可使用plt方法，該matplotlib.pyplot方法是繪製單詞的方法的別名在二維向量空間上。

執行如下腳本以可視化單詞：

word_vectors = ft_model.wv[all_similar_words]



for word_names, x, y in zip(word_names, p_comps[:, 0], p_comps[:, 1]):
    plt.annotate(word_names, xy=(x+0.06, y+0.03), xytext=(0, 0), textcoords='offset points')

上面腳本的輸出以下所示：

能夠看到在文本中常常一塊兒出現的單詞在二維平面中也彼此靠近。

用於文本分類的FastText

文本分類是指根據文本的內容將文本數據分類爲預約義的類別。情感分析，垃圾郵件檢測和標籤檢測是一些用於文本分類的用例的最多見示例。

數據集

數據集包含多個文件，但咱們僅對該yelp_review.csv文件感興趣。該文件包含有關不一樣業務（包括餐館，酒吧，牙醫，醫生，美容院等）的520萬條評論。可是，因爲內存限制，咱們將僅使用前50,000條記錄來訓練咱們的模型。若是須要，能夠嘗試更多記錄。

讓咱們導入所需的庫並加載數據集：

import pandas as pd
import numpy as np

yelp_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/yelp_review_short.csv")

在上面的腳本中，咱們yelp_review_short.csv使用pd.read_csv函數加載了包含50,000條評論的文件。

經過將評論的數值轉換爲分類數值，能夠簡化咱們的問題。這將經過在reviews_score數據集中添加新的列來完成。

最後，數據幀的標題以下所示

安裝FastText

下一步是導入FastText模型，可使用如下wget命令從GitHub存儲庫中導入該命令，如如下腳本所示：

!wget https://github.com/facebookresearch/fastText/archive/v0.1.0.zip

若是您運行上述腳本並看到如下結果，則代表FastText已成功下載：

--2019-08-16 15:05:05--  https://github.com/facebookresearch/fastText/archive/v0.1.0.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0 [following]
--2019-08-16 15:05:05--  https://codeload.github.com/facebookresearch/fastText/zip/v0.1.0
Resolving codeload.github.com (codeload.github.com)... 192.30.255.121
Connecting to codeload.github.com (codeload.github.com)|192.30.255.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘v0.1.0.zip’

v0.1.0.zip              [ <=>                ]  92.06K  --.-KB/s    in 0.03s

2019-08-16 15:05:05 (3.26 MB/s) - ‘v0.1.0.zip’ saved [94267]

下一步是解壓縮FastText模塊。只需鍵入如下命令：

!unzip v0.1.0.zip

接下來，您必須導航到下載FastText的目錄，而後執行!make命令以運行C ++二進制文件。執行如下步驟：

cd fastText-0.1.0
!make

若是看到如下輸出，則代表FastText已成功安裝在您的計算機上。

c++ -pthread -std=c++0x -O3 -funroll-loops -c src/args.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/dictionary.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/productquantizer.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/matrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/qmatrix.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/vector.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/model.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/utils.cc
c++ -pthread -std=c++0x -O3 -funroll-loops -c src/fasttext.cc
c++ -pthread -std=c++0x -O3 -funroll-loops args.o dictionary.o productquantizer.o matrix.o qmatrix.o vector.o model.o utils.o fasttext.o src/main.cc -o fasttext

要驗證安裝，請執行如下命令：

!./fasttext

您應該看到FastText支持如下命令：

usage: fasttext <command> <args>

The commands supported by FastText are:

  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  nn                      query for nearest neighbors
  analogies               query for analogies

文字分類

在訓練FastText模型進行文本分類以前，須要先說起FastText接受特殊格式的數據，具體以下：

_label_tag This is sentence 1
_label_tag2 This is sentence 2.

若是咱們查看咱們的數據集，它不是所需的格式。具備積極情緒的文本應以下所示：

__label__positive burgers are very big portions here.

一樣，負面評論應以下所示：

__label__negative They do not use organic ingredients, but I thi...

如下腳本從數據集中過濾出reviews_score和text列，而後__label__在該reviews_score列中的全部值以前添加前綴。相似地，\n和\t被text列中的空格替換。最後，更新後的數據幀以的形式寫入yelp_reviews_updated.txt。

import pandas as pd
from io import StringIO
import csv

col = ['reviews_score', 'text']

如今讓咱們打印更新後的yelp_reviews數據框。

yelp_reviews.head()

您應該看到如下結果：

reviews_score   text
0   __label__positive   Super simple place but amazing nonetheless. It...
1   __label__positive   Small unassuming place that changes their menu...
2   __label__positive   Lester's is located in a beautiful neighborhoo...
3   __label__positive   Love coming here. Yes the place always needs t...
4   __label__positive   Had their chocolate almond croissant and it wa...

一樣，數據框的尾部以下所示：

reviews_score   text
49995   __label__positive   This is an awesome consignment store! They hav...
49996   __label__positive   Awesome laid back atmosphere with made-to-orde...
49997   __label__positive   Today was my first appointment and I can hones...
49998   __label__positive   I love this chic salon. They use the best prod...
49999   __label__positive   This place is delicious. All their meats and s...

咱們已經將數據集轉換爲所需的形狀。下一步是將咱們的數據分爲訓練集和測試集。80％的數據（即50,000條記錄中的前40,000條記錄）將用於訓練數據，而20％的數據（最後10,000條記錄）將用於評估算法的性能。

如下腳本將數據分爲訓練集和測試集：

!head -n 40000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt"
!tail -n 10000 "/content/drive/My Drive/Colab Datasets/yelp_reviews_updated.txt" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"

yelp_reviews_train.txt便會生成包含訓練數據的文件。一樣，新生成的yelp_reviews_test.txt文件將包含測試數據。

如今是時候訓練咱們的FastText文本分類算法了。

%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" -output model_yelp_reviews

爲了訓練算法，咱們必須使用supervised命令並將其傳遞給輸入文件。這是上面腳本的輸出：

Read 4M words
Number of words:  177864
Number of labels: 2
Progress: 100.0%  words/sec/thread: 2548017  lr: 0.000000  loss: 0.246120  eta: 0h0m
CPU times: user 212 ms, sys: 48.6 ms, total: 261 ms
Wall time: 15.6 s

您能夠經過如下!ls命令查看模型：

!ls

這是輸出：

args.o             Makefile         quantization-results.sh
classification-example.sh  matrix.o         README.md
classification-results.sh  model.o          src
CONTRIBUTING.md        model_yelp_reviews.bin   tutorials
dictionary.o           model_yelp_reviews.vec   utils.o
eval.py            PATENTS          vector.o
fasttext           pretrained-vectors.md    wikifil.pl
fasttext.o         productquantizer.o       word-vector-example.sh
get-wikimedia.sh       qmatrix.o            yelp_reviews_train.txt
LICENSE            quantization-example.sh

能夠model_yelp_reviews.bin在上面的文檔列表中看到。

最後，可使用如下test命令測試模型。必須在test命令後指定型號名稱和測試文件，以下所示：

!./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt"

上面腳本的輸出以下所示：

N   10000
P@1 0.909
R@1 0.909
Number of examples: 10000

這裏P@1是指精度，R@1是指召回率。您能夠看到咱們的模型達到了0.909的精度和召回率，這至關不錯。

如今，讓咱們嘗試清除標點符號和特殊字符的文本，並將其轉換爲小寫字母，以提升文本的一致性。

!cat "/content/drive/My Drive/Colab Datasets/yelp_reviews_train.txt" | sed -e "s/\([.\!?,’/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt"

而且如下腳本清除了測試集：

"/content/drive/My Drive/Colab Datasets/yelp_reviews_test.txt" | sed -e "s/\([.\!?,’/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"

如今，咱們將在清理的訓練集上訓練模型：

%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews

最後，咱們將使用在淨化訓練集上訓練的模型對測試集進行預測：

!./fasttext test model_yelp_reviews.bin "/content/drive/My Drive/Colab Datasets/yelp_reviews_test_clean.txt"

上面腳本的輸出以下：

N   10000
P@1 0.915
R@1 0.915
Number of examples: 10000

您會看到精度和召回率都有小幅提升。爲了進一步改善模型，您能夠增長模型的時代和學習率。如下腳本將元數設置爲30，將學習率設置爲0.5。

%%time
!./fasttext supervised -input "/content/drive/My Drive/Colab Datasets/yelp_reviews_train_clean.txt" -output model_yelp_reviews -epoch 30 -lr 0.5