train.tsv
和
test.tsv
,內容是從網上搜集的情感文本數據,簡單地通過分詞後用空格拼接起來。訓練集和測試集各有10000條數據
Pytext框架包括了Task, Trainer, Model, DataHandler, Exporter 組件,分別對應了任務切換、模型訓練、模型結構、數據處理、模型導出的做用,它們都繼承自名Component的類 html
(圖片來自: pytext-pytext.readthedocs-hosted.com/en/latest/o…python
Component能夠讀取JSON類型的配置文件,配置文件能夠設置訓練過程當中使用的輸入和學習率等參數。按照官方文本分類教程,咱們幾乎能夠不須要實現模型,輸入,輸出等代碼,只須要準備好數據集便可。json
docnn.json的內容以下:bash
{
"task": {
"DocClassificationTask": {
"data_handler": {
"train_path": "train.tsv",
"eval_path": "test.tsv",
"test_path": "test.tsv"
}
}
}
}
複製代碼
pytext train < docnn.json
複製代碼
CONFIG=docnn.json
pytext export --output-path model.c2 < "$CONFIG"
複製代碼
在桌面上咱們能夠看到導出的模型 model.c2
框架
# !/usr/bin/env python3
# -*- coding:utf-8 _*-
""" @Author:yanqiang @File: demo.py @Time: 2018/12/21 19:06 @Software: PyCharm @Description: """
import sys
import pytext
import jieba
config_file = sys.argv[1]
model_file = sys.argv[2]
text = sys.argv[3]
text = " ".join([word for word in jieba.cut(text)])
config = pytext.load_config(config_file)
predictor = pytext.create_predictor(config, model_file)
# Pass the inputs to PyText's prediction API
result = predictor({"raw_text": text})
# Results is a list of output blob names and their scores.
# The blob names are different for joint models vs doc models
# Since this tutorial is for both, let's check which one we should look at.
doc_label_scores_prefix = (
'scores:' if any(r.startswith('scores:') for r in result)
else 'doc_scores:'
)
# For now let's just output the top document label!
best_doc_label = max(
(label for label in result if label.startswith(doc_label_scores_prefix)),
key=lambda label: result[label][0],
# Strip the doc label prefix here
)[len(doc_label_scores_prefix):]
print("輸入句子的情感爲:%s" % best_doc_label)
複製代碼
咱們看看效果:學習
python main.py "$CONFIG" model.c2 "超級喜歡蒙牛這個味 道"
複製代碼
python main.py "$CONFIG" model.c2 "這是什麼商品啊!太 差了吧?"
複製代碼
咱們上面過程能夠看到,pytext加速了模型從訓練到落地的速度,省去了不少繁瑣的工程。不過,咱們上面的例子模型須要有待提升,須要研究下自定義模型和詞向量使用,提升分類效果。測試