使用Facebook的FastText簡化文本分類

時間 2019-11-17

標籤使用 fasttext 簡化文本分類欄目硅谷简体版

原文原文鏈接

使用FastText API分析亞馬遜產品評論情緒的分步教程

本博客提供了詳細的分步教程，以便使用FastText進行文本分類。爲此，咱們選擇在Amazon.com上對客戶評論進行情緒分析，並詳細說明如何抓取特定產品的評論以便對他們進行情緒分析。

什麼是FastText？

文本分類已成爲商業世界的重要組成部分; 是否用於垃圾郵件過濾或分析電子商務網站的推特客戶評論的情緒，這多是最廣泛的例子。

FastText是由Facebook AI Research（FAIR）開發的開源庫，專門用於簡化文本分類。FastText可以在幾十分鐘內經過多核CPU在數百萬個示例文本數據上進行訓練，並使用訓練模型在不到五分鐘的時間內對超過300,000個類別中的未出現的文本進行預測。

預先標註的訓練數據集：

收集了從Kaggle.com得到的包含數百萬條亞馬遜評論的手動註釋數據集，並在轉換爲FastText格式後用於訓練模型。

FastText的數據格式以下：

__label__<X> __label__<Y> ... <Text>複製代碼

其中X和Y表明類標籤。

在咱們使用的數據集中，咱們將評論標題添加到評論以前，用「:」和空格分隔。

下面給出了訓練數據文件中的示例，能夠在 Kaggle.com網站上找到用於訓練和測試模型的數據集。

__label__2 Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"複製代碼

在這裏，咱們只有兩個類1和2，其中__label__1表示評論者爲產品打1或2星，而__label__2表示4或5星評級。

訓練FastText進行文本分類：

預處理和清洗數據：

在規範化文本案例並刪除不須要的字符後，執行如下命令以生成預處理和清洗的訓練數據文件。

cat <path to training file> | sed -e 「s/\([.\!?,’/()]\)/ \1 /g」 | tr 「[:upper:]」 「[:lower:]」 > <path to pre-processed output file>複製代碼

設置FastText：

讓咱們從下載最新版本開始：

$ git clone https://github.com/facebookresearch/fastText.git 
$ cd fastText 
$ make複製代碼

不帶任何參數運行二進制文件將打印高級文檔，顯示fastText支持的不一樣用例：

>> ./fasttext
usage: fasttext <command> <args>
The commands supported by fasttext are:
  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  nn                      query for nearest neighbors
  analogies               query for analogies複製代碼

在本教程中，咱們主要使用了supervised，test和predict子命令，對應於學習（和使用）的文本分類。

訓練模型：

如下命令用於訓練文本分類模型：

./fasttext supervised -input <path to pre-processed training file> -output <path to save model> -label __label__ 複製代碼

該-input命令行選項指的是訓練文件，而-output選項指的是該模型要保存的位置。訓練完成後，將在給定位置建立包含訓練分類器的文件model.bin。

用於改進模型的可選參數：

增長訓練迭代次數：

默認狀況下，模型在每一個示例上迭代5次，爲了更好的訓練增長此參數，咱們能夠指定-epoch參數。

示例：

./fasttext supervised -input <path to pre-processed training file> -output <path to save model> -label __label__ -epoch 50複製代碼

指定學習率：

改變學習率意味着改變咱們模型的學習速度，是增長（或下降）算法的學習率。這對應於處理每一個示例後模型更改的程度。學習率爲0意味着模型根本不會改變，所以不會學到任何東西。良好的學習率值在該範圍內0.1 - 1.0。

lr的默認值爲0.1。這裏是如何指定此參數。

./fasttext supervised -input <path to pre-processed training file> -output <path to save model> -label __label__ -lr 0.5複製代碼

使用n-gram做爲特徵：

對於依賴於詞序，特別是情感分析的問題，這是一個有用的步驟。它是指定連續token在n的窗口內的詞都做爲特徵來訓練。

咱們指定-wordNgrams參數（理想狀況下，值介於2到5之間）：

./fasttext supervised -input <path to pre-processed training file> -output <path to save model> -label __label__ -wordNgrams 3複製代碼

測試和評估模型：

如下命令用於在預先註釋的測試數據集上測試模型，並將原始標籤與每一個評論的預測標籤進行比較，並以準確率和召回率的形式生成評估分數。

精度是fastText預測的標籤中正確標籤的數量。召回是成功預測的標籤數量。

./fasttext test <path to model> <path to test file> k複製代碼

其中參數k表示模型用於預測每一個評論的前k個標籤。

在400000評論的測試數據上評估咱們訓練的模型所得到的結果以下。如所觀察到的，精確度，召回率爲91％，而且模型在很短的時間內獲得訓練。

N 400000
P@1 0.913
R@1 0.913
Number of examples: 400000複製代碼

分析在Amazon.com上產品的實時客戶評價的情緒：

抓取亞馬遜客戶評論：

咱們使用現有的python庫來從頁面中抓取評論。

要安裝，請在命令提示符/終端中鍵入：

pip install amazon-review-scraper複製代碼

如下是給定網址網頁的示例代碼，用於抓取特定產品的評論：

from amazon_review_scraper import amazon_review_scraper
 
url = input("Enter URL: ")
 
start_page = input("Enter Start Page: ")
 
end_page = input("Enter End Page: ")
 
time_upper_limit = input("Enter upper limit of time range (Example: Entering the value 5 would mean the program will wait anywhere from 0 to 5 seconds before scraping a page. If you don't want the program to wait, enter 0): ")
 
file_name = "amazon_product_review"
 
scraper = amazon_review_scraper.amazon_review_scraper(url, start_page, end_page, time_upper_limit)
 
scraper.scrape()
 
scraper.write_csv(file_name)複製代碼

注意：在輸入特定產品的客戶審覈頁面的URL時，請確保附加＆pageNumber = 1（若是它不存在），以使scraper正常運行。

上面的代碼從給定的URL中抓取了評論，並按如下格式建立了輸出csv文件：

從上面的csv文件中，咱們提取標題和正文並將它們一塊兒追加到一塊兒，用訓練文件中的'：和空格分隔，並將它們存儲在一個單獨的txt文件中以預測情緒。

數據的情緒預測：

./fasttext predict <path to model> <path to test file> k > <path to prediction file>複製代碼

其中k表示模型將預測每一個評論的前k個標籤。

上述評論預測的標籤以下：

__label__2
__label__1
__label__2
__label__2
__label__2
__label__2
__label__2
__label__2
__label__1
__label__2
__label__2複製代碼

這是至關準確的，可手動驗證。預測文件隨後可用於進一步的詳細分析和可視化目的。

所以，在本博客中，咱們學習了使用FastText API進行文本分類，抓取給定產品的亞馬遜客戶評論，並使用通過培訓的分析模型預測他們的情緒。

點擊英文原文連接

更多文章歡迎訪問: http://www.apexyun.com

公衆號:銀河系1號

聯繫郵箱：public@space-explore.com

(未經贊成，請勿轉載)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。