掃雷大軍：爲何你不該該去除停用詞？

時間 2020-12-18

標籤 app ide 性能學習 this 3d code blog rem 欄目系統性能简体版

原文原文鏈接

來源：Pexelsapp

咱們經常認爲在預處理文本時，去除停用詞是很明智的一種操做。ide

的確，我贊成這一作法，可是咱們應該謹慎決定該去除哪類停用詞。性能

好比說，去除停用詞最常規的方法是使用NLTK停用詞表。學習

一塊兒來看看nltk中的停用詞列表吧。this

from  nltk.corpus import stopwords
print(stopwords.words('english'))

stopwords.py hosted with by GitHub3d

['i', 'me', 'my', 'myself', 'we', 'our', 'ours','ourselves', 'you', "you're", "you've", "you'll","you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him','his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it',"it's", 'its', 'itself', 'they', 'them', 'their', 'theirs','themselves', 'what', 'which', 'who', 'whom', 'this', 'that',"that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be','been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing','a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while','of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into','through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up','down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then','once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both','each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not','only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will','just', 'don', "don't", 'should', "should've", 'now', 'd','ll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't",'didn', "didn't", 'doesn', "doesn't", 'hadn',"hadn't", 'hasn', "hasn't", 'haven', "haven't",'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',"mustn't", 'needn', "needn't", 'shan', "shan't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren',"weren't", 'won', "won't", 'wouldn', "wouldn't"]

如今，請注意這些加粗的單詞。code

它們有什麼問題嗎？blog

下面是一個例子：ci

假設咱們要建立一個對產品評論進行情感分析的模型。鑑於數據集很小，所以能夠本身手動標識情感態度。讓咱們來研究一下數據集中的一些評論。rem

The product is really very good. — POSITIVE
（這款產品真的很棒。——積極）
The products seems to be good. — POSITIVE
（這款產品看起來不錯。——積極）
Good product. I really liked it.— POSITIVE
（不錯的產品。我真的很喜歡。——積極）
I didn’t like the product. —NEGATIVE
（我不喜歡這款產品。——消極）
The product is not good. — NEGATIVE
（這款產品很差。——消極）

接下來，對數據進行預處理，去除全部的停用詞。

如今，來看看上述範例會發生什麼變化吧。

product really good. — POSITIVE
（產品真的很棒。——積極）
products seems good. — POSITIVE
（產品看起來不錯。——積極）
Good product. really liked. —POSITIVE
（不錯的產品，真的很喜歡。——積極）
like product. — NEGATIVE
（喜歡產品。——消極）
product good. — NEGATIVE
（產品好。——消極）

看看這些負面評價如今表達的含義。

可怕吧？

來源：Pexels

那些正面評價彷佛並未受到影響，可是負面評價的總體意思都變了。若是咱們用這些數據去構建模型，那最後得出的結果確定不理想。

這種狀況常有發生，當去除停用詞後，句子的整個意思都會改變。

若是你使用的是基礎的NLP技術，如BOW, Count Vectorizer或TF-IDF（詞頻和逆文檔頻率），那麼去除停用詞是明智的選擇，由於在這些模型中，停用詞會帶來干擾。但若是你使用的是LSTM或其餘模型，這些模型會捕獲單詞的語義，且單詞的含義基於前文語境，那麼此時保留停用詞就十分必要了。

如今，回到最初的問題——去除停用詞真的能提升模型性能嗎？

就如我以前所說，這取決於去除的是哪類停用詞。若是不去除像I, my, me等停用詞的話，數據集就會受到更多幹擾。

來源：Pexels

那麼，有什麼解決辦法呢？能夠建立一個合適的停用詞列表，但問題是要如何在不一樣的項目中重複使用這個列表。

這就是爲何建立Python包nlppreprocess，這個包去除了全部無用的停用詞，此外，還能夠更加高效地整理文本。

發揮nlppreprocess包功能的最佳方法是將其與pandas 組合使用：

from  nlppreprocess importNLP
import pandas as pd
nlp = NLP()
df = pd.read_csv('some_file.csv')
df['text'] = df['text'].apply(nlp.process)

viewrawdemo.py hosted with by GitHub
如今，若是咱們用nlppreprocess包對以前的樣本進行預處理，能夠獲得以下結果：

product really very good. — POSITIVE
（產品真的很棒。——積極）
2.products seems good. — POSITIVE
（產品看起來不錯。——積極）
Good product. really liked. —POSITIVE
（不錯的產品，真的很喜歡。——積極）
not like product. — NEGATIVE
（不喜歡產品。——消極）
product not good. — NEGATIVE
（產品很差。——消極）

如此看來，用nlppreprocess包去除停用詞並進行其餘預處理彷佛效果不錯。

你以爲的？

留言點贊發個朋友圈

咱們一塊兒分享AI學習與發展的乾貨

編譯組：虞雙雙、李韻帷
相關連接：
https://towardsdatascience.com/why-you-should-avoid-removing-stopwords-aa7a353d2a52

如需轉載，請後臺留言，遵照轉載規範

長按識別二維碼可添加關注

讀芯君愛你

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

掃雷大軍：爲何你不該該去除停用詞？

留言 點贊 發個朋友圈

咱們一塊兒分享AI學習與發展的乾貨

推薦文章閱讀

長按識別二維碼可添加關注

讀芯君愛你

留言點贊發個朋友圈