文本數據處理(天然語言處理基礎)

時間 2019-11-11

原文原文鏈接

文本數據的特徵提取,中文分詞及詞袋模型

1.使用CountVectorizer對文本進行特徵提取

#導入量化工具CountVectorizer工具
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
#使用CountVectorizer擬合文本數據
en = ['The quick brown fox jumps over a lazy dog']
vect.fit(en)
#打印結果
print('單詞數:{}'.format(len(vect.vocabulary_)))
print('分詞:{}'.format(vect.vocabulary_))

單詞數:8
分詞:{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumps': 3, 'over': 5, 'lazy': 4, 'dog': 1}

#使用中文文本進行試驗
cn = ['那隻敏捷的綜色狐狸跳過了一隻懶惰的狗']
#擬閤中文文本數據
vect.fit(cn)
#打印結果
print('單詞數:{}'.format(len(vect.vocabulary_)))
print('分詞:{}'.format(vect.vocabulary_))

單詞數:1
分詞:{'那隻敏捷的綜色狐狸跳過了一隻懶惰的狗': 0}

2.使用分詞工具對中文文本進行分詞

#導入結巴分詞
import jieba
#使用結巴分詞對中文文本進行分詞
cn = jieba.cut('那隻敏捷的棕色狐狸跳過了一隻懶惰的狗')
#使用空格做爲詞之間的分界線
cn = [' '.join(cn)]
#打印結果
print(cn)

['那 只 敏捷 的 棕色 狐狸 跳過 了 一隻 懶惰 的 狗']

#使用CountVectorizer對中文文本進行向量化
vect.fit(cn)
#打印結果
print('單詞數:{}'.format(len(vect.vocabulary_)))
print('分詞:{}'.format(vect.vocabulary_))

單詞數:6
分詞:{'敏捷': 2, '棕色': 3, '狐狸': 4, '跳過': 5, '一隻': 0, '懶惰': 1}

3.使用詞袋模型將文本數據轉爲數組

#定義詞袋模型
bag_of_words = vect.transform(cn)
#打印詞袋模型中的數據特徵
print('轉化爲詞袋的特徵:\n{}'.format(repr(bag_of_words)))

轉化爲詞袋的特徵:
<1x6 sparse matrix of type '<class 'numpy.int64'>'
	with 6 stored elements in Compressed Sparse Row format>

#打印詞袋模型的密度表達
print('詞袋的密度表達:\n{}'.format(bag_of_words.toarray()))

詞袋的密度表達:
[[1 1 1 1 1 1]]

#輸入新的中文文本
cn_1 = jieba.cut('懶惰的狐狸不如敏捷的狐狸敏捷,敏捷的狐狸不如懶惰的狐狸懶惰')
#以空格進行分隔
cn2 = [' '.join(cn_1)]
#打印結果
print(cn2)

['懶惰 的 狐狸 不如 敏捷 的 狐狸 敏捷 , 敏捷 的 狐狸 不如 懶惰 的 狐狸 懶惰']

#創建新的詞袋模型
new_bag = vect.transform(cn2)
#打印結果
print('轉化爲詞袋的特徵:\n{}'.format(repr(new_bag)))
print('詞袋的密度表達:\n{}'.format(new_bag.toarray()))

轉化爲詞袋的特徵:
<1x6 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>
詞袋的密度表達:
[[0 3 3 0 4 0]]

對文本數據進一步優化處理

1.使用n-Gram改善詞袋模型

#使用n-Gram改善詞袋模型
#那就寫一句話吧
joke = jieba.cut('小明看見了小李騎了夏麗的腳踏車')
#插入空格
joke = [' '.join(joke)]
#轉化爲向量
vect.fit(joke)
joke_feature = vect.transform(joke)
#打印文本數據特徵
print('這句話的特徵表達:\n{}'.format(joke_feature.toarray()))

這句話的特徵表達:
[[1 1 1 1 1]]

#將剛纔的文本打亂順序
joke2 = jieba.cut('小李看見夏麗騎了小明的腳踏車')
#插入空格
joke2 = [' '.join(joke2)]
#進行特徵提取
joke2_feature = vect.transform(joke2)
#打印文本的特徵
print('這句話的特徵表達:\n{}'.format(joke2_feature.toarray()))

這句話的特徵表達:
[[1 1 0 1 1]]

#修改CountVectorizer的ngram參數
vect = CountVectorizer(ngram_range=(2,2))
#從新進行文本數據的特徵提取
cv = vect.fit(joke)
joke_feature = cv.transform(joke)
#打印新的結果
print('調整n-Gram參數後的詞典:{}'.format(cv.get_feature_names()))
print('新的特徵表達:{}'.format(joke_feature.toarray()))

調整n-Gram參數後的詞典:['夏麗 腳踏車', '小明 看見', '李騎 夏麗', '看見 李騎']
新的特徵表達:[[1 1 1 1]]

#調整文本順序
joke2 = jieba.cut('小李看見夏麗騎了小明的腳踏車')
#插入空格
joke2 = [' '.join(joke2)]
#提取文本數據特徵
joke2_feature = vect.transform(joke2)
print('新的特徵表達:{}'.format(joke2_feature.toarray()))

新的特徵表達:[[0 0 0 0]]

在調整了CountVectorizer的ngram_range參數以後,機器再也不認爲這兩句是同一個意思了,因此n-Gram模型對文本特徵提取進行了很好的優化

2.使用tf-idf模型對文本數據進行處理

#顯示ACLIMDB數據集的樹狀文件夾列表
!tree ACLIMDB

卷 Data 的文件夾 PATH 列表
卷序列號爲 06B1-81F6
D:\JUPYTERNOTEBOOK\ACLIMDB
├─test
│  ├─neg
│  └─pos
└─train
    ├─neg
    ├─pos
    └─unsup

#導入量化工具CountVectorizer工具
from sklearn.feature_extraction.text import CountVectorizer
#導入文件載入工具
from sklearn.datasets import load_files
#定義訓練數據集
train_set = load_files('aclImdb/train')
X_train,y_train, = train_set.data,train_set.target
#打印訓練數據集文件數量
print('訓練集文件數量:{}'.format(len(X_train)))
#隨便抽取一條影評打印出來
print('隨機抽一個看看:',X_train[22])

訓練集文件數量:75000
隨機抽一個看看: b"Okay, once you get past the fact that Mitchell and Petrillo are Dean and Jerry knockoffs, you could do worse than this film. Charlita as Princess Nona is great eye candy, Lugosi does his best with the material he's given, and the production values, music especially (except for the vocals) are better than you'd think for the $50k cost of production. The final glimpses of the characters are a hoot. Written by Tim Ryan, a minor actor in late Charlie Chan films, and husband of Grannie on the Beverly Hillbillies. All in all, WAY better than many late Lugosi cheapies."

#載入測試集
test = load_files('aclImdb/test/')
X_test,y_test = test.data,test.target
#返回測試數據集文件的數量
print(len(X_test))

#用CountVectorizer擬合訓練數據集
vect = CountVectorizer().fit(X_train)
#將文本轉化爲向量
X_train_vect = vect.transform(X_train)
#把測試數據集轉化爲向量
X_test_vect = vect.transform(X_test)
#打印訓練集特徵數量
print('訓練集樣本特徵數量:{}'.format(len(vect.get_feature_names())))
#打印最後10個訓練集樣本特徵
print('最後10個訓練集樣本特徵:{}'.format(vect.get_feature_names()[-10:]))

訓練集樣本特徵數量:124255
最後10個訓練集樣本特徵:['üvegtigris', 'üwe', 'ÿou', 'ıslam', 'ōtomo', 'şey', 'дом', 'книги', '色戒', 'ｒｏｃｋ']

#導入tfidf轉化工具
from sklearn.feature_extraction.text import TfidfTransformer
#用tfidf工具轉化訓練集和測試集
tfidf = TfidfTransformer(smooth_idf = False)
tfidf.fit(X_train_vect)
X_train_tfidf = tfidf.transform(X_train_vect)
X_test_tfidf = tfidf.transform(X_test_vect)
#將處理先後的特徵打印進行比較
print('未經tfidf處理的特徵:\n',X_train_vect[:5,:5].toarray())
print('通過tfidf處理的特徵:\n',X_train_tfidf[:5,:5].toarray())

未經tfidf處理的特徵:
 [[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
通過tfidf處理的特徵:
 [[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

#導入線性SVC分類模型
from sklearn.svm import LinearSVC
#導入交叉驗證工具
from sklearn.model_selection import cross_val_score
#使用交叉驗證對模型進行評分
scores = cross_val_score(LinearSVC(),X_train_vect,y_train,cv=3)
#從新訓練線性SVC模型
clf = LinearSVC().fit(X_train_tfidf,y_train)
#使用新數據進行交叉驗證
scores2 = cross_val_score(LinearSVC(),X_train_tfidf,y_train,cv=3)
#打印新的分數進行對比
print('通過tf-idf處理的訓練集交叉驗證得分:{:.3f}'.format(scores.mean()))
print('通過tf-id處理的測試集得分:{:.3f}'.format(clf.score(X_test_tfidf,y_test)))

通過tf-idf處理的訓練集交叉驗證得分:0.660
通過tf-id處理的測試集得分:0.144

3.刪除文本中的停用詞

#導入內置的停用詞庫
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
#打印停用詞個數
print('停用詞個數:',len(ENGLISH_STOP_WORDS))
#打印停用詞中前20個和後20個
print('列出前20個和後20個:\n',list(ENGLISH_STOP_WORDS)[:20],list(ENGLISH_STOP_WORDS)[-20:])

停用詞個數: 318
列出前20個和後20個:
 ['interest', 'meanwhile', 'do', 'thereupon', 'can', 'cry', 'upon', 'then', 'first', 'six', 'except', 'our', 'noone', 'being', 'done', 'afterwards', 'any', 'even', 'after', 'otherwise'] ['seemed', 'top', 'as', 'all', 'found', 'very', 'nor', 'seem', 'via', 'these', 'been', 'beforehand', 'behind', 'becomes', 'un', 'ten', 'onto', 'ourselves', 'an', 'keep']

#導入Tfidf模型
from sklearn.feature_extraction.text import TfidfVectorizer
#激活英文停用詞參數
tfidf = TfidfVectorizer(smooth_idf = False,stop_words = 'english')
#擬合訓練數據集
tfidf.fit(X_train)
#將訓練數據集文本轉化爲向量
X_train_tfidf = tfidf.transform(X_train)
#使用交叉驗證進行評分
scores3 = cross_val_score(LinearSVC(),X_train_tfidf,y_train,cv=3)
clf.fit(X_train_tfidf,y_train)
#將測試數據集轉化爲向量
X_test_tfidf = tfidf.transform(X_test)
#打印交叉驗證評分和測試集評分
print('去掉停用詞後訓練集交叉驗證平均分:{:3f}'.format(scores3.mean()))
print('去掉停用詞後測試集模型得分:{:3f}'.format(clf.score(X_test_tfidf,y_test)))

去掉停用詞後訓練集交叉驗證平均分:0.723933
去掉停用詞後測試集模型得分:0.150920

總結 : python

　　在scikit-learn中,有兩個類使用了tf-idf方法,其中一個是TfidfTransformer,它用來將CountVectorizer從文本中提取的特徵矩陣進行轉化,另外一個是TfidfVectorizer,它和CountVectorizer用法是相同的,至關於把CountVectorizer和TfidfTransformer所作的工做整合在了一塊兒.api

　　在天然語言領域最經常使用的python工具包--NLTK.其也能夠實現分詞,爲文本加註標籤等功能,還能夠進行詞幹提取以及詞幹還原.數組

　　若是想進一步發展,能夠深刻了解話題建模(Topic Modeling)和文檔聚類(Document Clustering).機器學習

　　在深度學習領域最經常使用來作天然語言處理的當屬word2vec庫,若是有興趣的能夠深刻了解.工具

文章引自 ; 《深刻淺出python機器學習》學習