練習地址:https://www.kaggle.com/c/ds100fa19python
1. 導入包
import pandas as pd import spacy train = pd.read_csv("train.csv") test = pd.read_csv("test.csv")
2. 數據預覽
train.head(10) train = train.fillna(" ") test = test.fillna(" ")
注意處理下 NaN
, 不然後續會報錯,見連接:
spacy 報錯 gold.pyx in spacy.gold.GoldParse.init() 解決方案https://michael.blog.csdn.net/article/details/109106806
dom
2. 特徵組合
- 對郵件的主題和內容進行組合 + 處理標籤
train['all'] = train['subject']+train['email'] train['label'] = [{ "spam": bool(y), "ham": not bool(y)} for y in train.spam.values] train.head(10)
標籤不是很懂爲何這樣,可能spacy要求這種格式的標籤
學習
- 訓練集、驗證集切分,採用分層抽樣
from sklearn.model_selection import StratifiedShuffleSplit # help(StratifiedShuffleSplit) splt = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1) for train_idx, valid_idx in splt.split(train, train['spam']): # 按照後者分層抽樣 train_set = train.iloc[train_idx] valid_set = train.iloc[valid_idx] # 查看分佈 print(train_set['spam'].value_counts()/len(train_set)) print(valid_set['spam'].value_counts()/len(valid_set))
輸出:顯示兩種數據集的標籤分佈是幾乎相同的測試
0 0.743636 1 0.256364 Name: spam, dtype: float64 0 0.743713 1 0.256287 Name: spam, dtype: float64
- 文本、標籤分離
train_text = train_set['all'].values train_label = train_set['label'] valid_text = valid_set['all'].values valid_label = valid_set['label'] # 標籤還要作如下處理,添加一個 'cats' key,'cats' 也是內置的關鍵字 train_label = [{ "cats": label} for label in train_label] valid_label = [{ "cats": label} for label in valid_label] # 訓練數據打包,再轉爲list train_data = list(zip(train_text, train_label)) test_text = (test['subject']+test['email']).values
print(train_label[0])
輸出:優化
{ 'cats': { 'spam': False, 'ham': True}}
3. 建模
- 建立模型,管道
nlp = spacy.blank('en') # 創建空白的英語模型 email_cat = nlp.create_pipe('textcat', # config= # { # "exclusive_classes": True, # 排他的,二分類 # "architecture": "bow" # } ) # 參數 'textcat' 不能隨便寫,是接口內置的 字符串 # 上面的 config 不要也能夠,沒找到文檔說明,該怎麼配置 help(nlp.create_pipe)
- 添加管道
nlp.add_pipe(email_cat)
- 添加標籤
# 注意順序,ham是 0, spam 是 1 email_cat.add_label('ham') email_cat.add_label('spam')
- 訓練
from spacy.util import minibatch import random def train(model, train, optimizer, batch_size=8): loss = { } random.seed(1) random.shuffle(train) # 隨機打亂 batches = minibatch(train, size=batch_size) # 數據分批 for batch in batches: text, label = zip(*batch) model.update(text, label, sgd=optimizer, losses=loss) return loss
- 預測
def predict(model, text): docs = [model.tokenizer(txt) for txt in text] # 先把文本令牌化 emailpred = model.get_pipe('textcat') score, _ = emailpred.predict(docs) pred_label = score.argmax(axis=1) return pred_label
- 評估
def evaluate(model, text, label): pred = predict(model, text) true_class = [int(lab['cats']['spam']) for lab in label] correct = (pred == true_class) acc = sum(correct)/len(correct) # 準確率 return acc
4. 訓練
n = 20 opt = nlp.begin_training() # 定義優化器 for i in range(n): loss = train(nlp, train_data, opt) acc = evaluate(nlp, valid_text, valid_label) print(f"Loss: {loss['textcat']:.3f} \t Accuracy: {acc:.3f}")
輸出:lua
Loss: 1.132 Accuracy: 0.941 Loss: 0.283 Accuracy: 0.988 Loss: 0.121 Accuracy: 0.993 Loss: 0.137 Accuracy: 0.993 Loss: 0.094 Accuracy: 0.982 Loss: 0.069 Accuracy: 0.995 Loss: 0.060 Accuracy: 0.990 Loss: 0.010 Accuracy: 0.992 Loss: 0.004 Accuracy: 0.992 Loss: 0.004 Accuracy: 0.992 Loss: 0.004 Accuracy: 0.992 Loss: 0.004 Accuracy: 0.992 Loss: 0.004 Accuracy: 0.992 Loss: 0.004 Accuracy: 0.991 Loss: 0.004 Accuracy: 0.991 Loss: 0.308 Accuracy: 0.981 Loss: 0.158 Accuracy: 0.987 Loss: 0.014 Accuracy: 0.990 Loss: 0.007 Accuracy: 0.990 Loss: 0.043 Accuracy: 0.990
5. 預測
pred = predict(nlp, test_text)
- 寫入提交文件
id = test['id'] output = pd.DataFrame({ 'id':id, 'Class':pred}) output.to_csv("submission.csv", index=False)
模型在測試集的準確率是99%以上!
spa
個人CSDN博客地址 https://michael.blog.csdn.net/.net
長按或掃碼關注個人公衆號(Michael阿明),一塊兒加油、一塊兒學習進步!
3d