1、要解決的問題html
問題:經常一些單位或組織召開會議時須要錄入會議記錄,咱們須要經過機器學習對用戶輸入的文本內容進行自動評判,合格或不合格。(一樣的問題還相似垃圾短信檢測、工做日誌質量分析等。)git
處理思路:咱們人工對現有會議記錄進行評判,標記合格或不合格,經過對這些記錄的學習造成模型,學習算法仍採用二元分類的快速決策樹算法,和上一篇文章不一樣,此次輸入的特徵值再也不是浮點數,而是中文文本。這裏就要涉及到文本特徵提取。github
爲何要進行文本特徵提取呢?由於文本是人類的語言,符號文字序列不能直接傳遞給算法。而計算機程序算法只接受具備固定長度的數字矩陣特徵向量(float或float數組),沒法理解可變長度的文本文檔。算法
經常使用的文本特徵提取方法有以下幾種:數組
以上只是須要瞭解大體的含義,咱們不須要去實現一個文本特徵提取的算法,只須要使用平臺自帶的方法就能夠了。app
系統自帶的文本特徵處理的方法,輸入是一個字符串,要求將一個語句中的詞語用空格分開,英語的句子中詞彙是天生經過空格分割的,但中文句子不是,因此咱們須要首先進行分詞操做,具體流程以下:框架
2、代碼機器學習
代碼總體流程和上一篇文章描述的基本一致,爲簡便起見,咱們省略了模型存儲和讀取的過程。ide
先看一下數據集:學習
代碼以下:
namespace BinaryClassification_TextFeaturize { class Program { static readonly string DataPath = Path.Combine(Environment.CurrentDirectory, "Data", "meeting_data_full.csv"); static void Main(string[] args) { MLContext mlContext = new MLContext(); var fulldata = mlContext.Data.LoadFromTextFile<MeetingInfo>(DataPath, separatorChar: ',', hasHeader: false); var trainTestData = mlContext.Data.TrainTestSplit(fulldata, testFraction: 0.15); var trainData = trainTestData.TrainSet; var testData = trainTestData.TestSet; var trainingPipeline = mlContext.Transforms.CustomMapping<JiebaLambdaInput, JiebaLambdaOutput>(mapAction: JiebaLambda.MyAction, contractName: "JiebaLambda") .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: "JiebaText")) .Append(mlContext.BinaryClassification.Trainers.FastTree(labelColumnName: "Label", featureColumnName: "Features")); ITransformer trainedModel = trainingPipeline.Fit(trainData); //評估 var predictions = trainedModel.Transform(testData); var metrics = mlContext.BinaryClassification.Evaluate(data: predictions, labelColumnName: "Label"); Console.WriteLine($"Evalution Accuracy: {metrics.Accuracy:P2}"); //建立預測引擎 var predEngine = mlContext.Model.CreatePredictionEngine<MeetingInfo, PredictionResult>(trainedModel); //預測1 MeetingInfo sampleStatement1 = new MeetingInfo { Text = "支委會。" }; var predictionresult1 = predEngine.Predict(sampleStatement1); Console.WriteLine($"{sampleStatement1.Text}:{predictionresult1.PredictedLabel}"); //預測2 MeetingInfo sampleStatement2 = new MeetingInfo { Text = "開展新時代中國特點社會主義思想三十講黨員答題活動。" }; var predictionresult2 = predEngine.Predict(sampleStatement2); Console.WriteLine($"{sampleStatement2.Text}:{predictionresult2.PredictedLabel}"); Console.WriteLine("Press any to exit!"); Console.ReadKey(); } } public class MeetingInfo { [LoadColumn(0)] public bool Label { get; set; } [LoadColumn(1)] public string Text { get; set; } } public class PredictionResult : MeetingInfo { public string JiebaText { get; set; } public float[] Features { get; set; } public bool PredictedLabel; public float Score; public float Probability; } }
3、代碼分析
和上一篇文章中類似的內容我就再也不重複解釋了,重點介紹一下學習管道的創建。
var trainingPipeline = mlContext.Transforms.CustomMapping<JiebaLambdaInput, JiebaLambdaOutput>(mapAction: JiebaLambda.MyAction, contractName: "JiebaLambda") .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: "JiebaText")) .Append(mlContext.BinaryClassification.Trainers.FastTree(labelColumnName: "Label", featureColumnName: "Features"));
首先,在進行文本特徵轉換以前,咱們須要對文本進行分詞操做,您能夠對樣本數據進行預處理,造成分詞的結果再進行學習,咱們沒有采用這個方法,而是自定義了一個分詞處理的數據處理管道,經過這個管道進行分詞,其定義以下:
namespace BinaryClassification_TextFeaturize { public class JiebaLambdaInput { public string Text { get; set; } } public class JiebaLambdaOutput { public string JiebaText { get; set; } } public class JiebaLambda { public static void MyAction(JiebaLambdaInput input, JiebaLambdaOutput output) { JiebaNet.Segmenter.JiebaSegmenter jiebaSegmenter = new JiebaNet.Segmenter.JiebaSegmenter(); output.JiebaText = string.Join(" ", jiebaSegmenter.Cut(input.Text)); } } }
最後咱們新建了兩個對象進行實際預測:
//預測1 MeetingInfo sampleStatement1 = new MeetingInfo { Text = "支委會。" }; var predictionresult1 = predEngine.Predict(sampleStatement1); Console.WriteLine($"{sampleStatement1.Text}:{predictionresult1.PredictedLabel}"); //預測2 MeetingInfo sampleStatement2 = new MeetingInfo { Text = "開展新時代中國特點社會主義思想三十講黨員答題活動。" }; var predictionresult2 = predEngine.Predict(sampleStatement2); Console.WriteLine($"{sampleStatement2.Text}:{predictionresult2.PredictedLabel}");
預測結果以下:
4、調試
上一篇文章提到,當咱們運行Transform方法時,會對全部記錄進行轉換,轉換後的數據集是什麼樣子呢,咱們能夠寫一個調試程序看一下。
var predictions = trainedModel.Transform(testData); DebugData(mlContext, predictions); private static void DebugData(MLContext mlContext, IDataView predictions) { var trainDataShow = new List<PredictionResult>(mlContext.Data.CreateEnumerable<PredictionResult>(predictions, false, true)); foreach (var dataline in trainDataShow) { dataline.PrintToConsole(); } } public class PredictionResult { public string JiebaText { get; set; } public float[] Features { get; set; } public bool PredictedLabel; public float Score; public float Probability; public void PrintToConsole() { Console.WriteLine($"JiebaText={JiebaText}"); Console.WriteLine($"PredictedLabel:{PredictedLabel},Score:{Score},Probability:{Probability}"); Console.WriteLine($"TextFeatures Length:{Features.Length}"); if (Features != null) { foreach (var f in Features) { Console.Write($"{f},"); } Console.WriteLine(); } Console.WriteLine(); } }
經過對調試結果的分析,能夠看到整個數據處理管道的工做流程。
5、資源獲取
源碼下載地址:https://github.com/seabluescn/Study_ML.NET
工程名稱:BinaryClassification_TextFeaturize