spaCy是一個流行、易用的Python天然語言處理包。spaCy具備至關高的處理精度,並且處理速度極快。不過,因爲spaCy仍是一個相對比較新的NLP開發包,所以它尚未像NLTK那樣被普遍採用,並且目前也沒有太多的教程。在本文中,咱們將展現如何使用spaCy來實現文本分類,並在結尾提供完整的實現代碼。node
對於年輕的研究者而言,尋找並篩選出合適的學術會議來投稿,是一件至關耗時耗力的事情。首先下載會議處理數據集,咱們接下來將按照會議來分類論文。git
先快速看一下數據:github
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import base64 import string import re from collections import Counter from nltk.corpus import stopwords stopwords = stopwords.words('english')df = pd.read_csv('research_paper.csv') df.head()
結果以下:web
能夠用下面的代碼確認數據集中沒有丟失的值:數據庫
df.isnull().sum()
結果以下:網絡
Title 0 Conference 0 dtype: int64
如今咱們把數據拆分爲訓練集和測試集:app
from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.33, random_state=42)print('Research title sample:', train['Title'].iloc[0]) print('Conference of this paper:', train['Conference'].iloc[0]) print('Training Data Shape:', train.shape) print('Testing Data Shape:', test.shape)
運行結果以下:dom
Research title sample: Cooperating with Smartness: Using Heterogeneous Smart Antennas in Ad-Hoc Networks. Conference of this paper: INFOCOM Training Data Shape: (1679, 2) Testing Data Shape: (828, 2)
數據集包含了2507個論文標題,已經按會議分爲5類。下面的圖表概述了論文在不一樣會議中的分佈狀況:機器學習
下面的代碼是使用spaCy進行文本預處理的一種方法,以後咱們將嘗試找出在前兩個類型會議(INFOCOM &ISCAS)的論文中用的最多的單詞:函數
import spacynlp = spacy.load('en_core_web_sm') punctuations = string.punctuationdef cleanup_text(docs, logging=False): texts = [] counter = 1 for doc in docs: if counter % 1000 == 0 and logging: print("Processed %d out of %d documents." % (counter, len(docs))) counter += 1 doc = nlp(doc, disable=['parser', 'ner']) tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-'] tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations] tokens = ' '.join(tokens) texts.append(tokens) return pd.Series(texts)INFO_text = [text for text in train[train['Conference'] == 'INFOCOM']['Title']]IS_text = [text for text in train[train['Conference'] == 'ISCAS']['Title']]INFO_clean = cleanup_text(INFO_text) INFO_clean = ' '.join(INFO_clean).split()IS_clean = cleanup_text(IS_text) IS_clean = ' '.join(IS_clean).split()INFO_counts = Counter(INFO_clean) IS_counts = Counter(IS_clean)INFO_common_words = [word[0] for word in INFO_counts.most_common(20)] INFO_common_counts = [word[1] for word in INFO_counts.most_common(20)]fig = plt.figure(figsize=(18,6)) sns.barplot(x=INFO_common_words, y=INFO_common_counts) plt.title('Most Common Words used in the research papers for conference INFOCOM') plt.show()
INFORCOM的運行結果以下:
‘
接下來計算ISCAS:
IS_common_words = [word[0] for word in IS_counts.most_common(20)] IS_common_counts = [word[1] for word in IS_counts.most_common(20)]fig = plt.figure(figsize=(18,6)) sns.barplot(x=IS_common_words, y=IS_common_counts) plt.title('Most Common Words used in the research papers for conference ISCAS') plt.show()
運行結果以下:
在INFOCOM中的頂級詞是「networks」和「network」,顯然這是由於INFOCOM是網絡領域的會議。 ISCAS的頂級詞是「base」和「design」,這揭示出ISCAS是關於數據庫、系統設計等課題的會議。
首先咱們載入spacy模型並建立語言處理對象:
from sklearn.feature_extraction.text import CountVectorizer from sklearn.base import TransformerMixin from sklearn.pipeline import Pipeline from sklearn.svm import LinearSVC from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS from sklearn.metrics import accuracy_score from nltk.corpus import stopwords import string import re import spacy spacy.load('en') from spacy.lang.en import English parser = English()
下面是另外一種用spaCy清理文本的方法:
STOPLIST = set(stopwords.words('english') + list(ENGLISH_STOP_WORDS)) SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "」", "」"]class CleanTextTransformer(TransformerMixin): def transform(self, X, **transform_params): return [cleanText(text) for text in X] def fit(self, X, y=None, **fit_params): return selfdef get_params(self, deep=True): return {} def cleanText(text): text = text.strip().replace("\n", " ").replace("\r", " ") text = text.lower() return textdef tokenizeText(sample): tokens = parser(sample) lemmas = [] for tok in tokens: lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_) tokens = lemmas tokens = [tok for tok in tokens if tok not in STOPLIST] tokens = [tok for tok in tokens if tok not in SYMBOLS] return tokens
下面咱們定義一個函數來顯示出最重要的特徵,具備最高的相關係數的特徵:
def printNMostInformative(vectorizer, clf, N): feature_names = vectorizer.get_feature_names() coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) topClass1 = coefs_with_fns[:N] topClass2 = coefs_with_fns[:-(N + 1):-1] print("Class 1 best: ") for feat in topClass1: print(feat) print("Class 2 best: ") for feat in topClass2: print(feat)vectorizer = CountVectorizer(tokenizer=tokenizeText, ngram_range=(1,1)) clf = LinearSVC() pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])# data train1 = train['Title'].tolist() labelsTrain1 = train['Conference'].tolist()test1 = test['Title'].tolist() labelsTest1 = test['Conference'].tolist() # train pipe.fit(train1, labelsTrain1)# test preds = pipe.predict(test1) print("accuracy:", accuracy_score(labelsTest1, preds)) print("Top 10 features used to predict: ") printNMostInformative(vectorizer, clf, 10) pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer)]) transform = pipe.fit_transform(train1, labelsTrain1)vocab = vectorizer.get_feature_names() for i in range(len(train1)): s = "" indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]] numOccurences = transform.data[transform.indptr[i]:transform.indptr[i+1]] for idx, num in zip(indexIntoVocab, numOccurences): s += str((vocab[idx], num))
運行結果以下:
accuracy: 0.7463768115942029 Top 10 features used to predict: Class 1 best: (-0.9286024231429632, ‘database’) (-0.8479561292796286, ‘chip’) (-0.7675978546440636, ‘wimax’) (-0.6933516302055982, ‘object’) (-0.6728543084136545, ‘functional’) (-0.6625144315722268, ‘multihop’) (-0.6410217867606485, ‘amplifier’) (-0.6396374843938725, ‘chaotic’) (-0.6175855765947755, ‘receiver’) (-0.6016682542232492, ‘web’) Class 2 best: (1.1835964521070819, ‘speccast’) (1.0752051052570133, ‘manets’) (0.9490176624004726, ‘gossip’) (0.8468395015456092, ‘node’) (0.8433107444740003, ‘packet’) (0.8370516260734557, ‘schedule’) (0.8344139814680707, ‘multicast’) (0.8332232077559836, ‘queue’) (0.8255429594734555, ‘qos’) (0.8182435133796081, ‘location’)
接下來計算精度、召回、F1分值:
from sklearn import metrics print(metrics.classification_report(labelsTest1, preds, target_names=df['Conference'].unique()))
運行結果以下;
precision recall f1-score support VLDB 0.75 0.77 0.76 159 ISCAS 0.90 0.84 0.87 299 SIGGRAPH 0.67 0.66 0.66 106 INFOCOM 0.62 0.69 0.65 139 WWW 0.62 0.62 0.62 125 avg / total 0.75 0.75 0.75 828
好了,咱們已經用spaCy完成了對論文的分類,完整源碼下載: GITHUB
原文連接:Spacy實現文本分類 - 匯智網