機器學習框架ML.NET學習筆記【3】文本特徵分析

時間 2019-11-12

標籤機器學習框架 ml.net 筆記文本特徵分析简体版

原文原文鏈接

1、要解決的問題html

問題：經常一些單位或組織召開會議時須要錄入會議記錄，咱們須要經過機器學習對用戶輸入的文本內容進行自動評判，合格或不合格。（一樣的問題還相似垃圾短信檢測、工做日誌質量分析等。）git

處理思路：咱們人工對現有會議記錄進行評判，標記合格或不合格，經過對這些記錄的學習造成模型，學習算法仍採用二元分類的快速決策樹算法，和上一篇文章不一樣，此次輸入的特徵值再也不是浮點數，而是中文文本。這裏就要涉及到文本特徵提取。github

爲何要進行文本特徵提取呢？由於文本是人類的語言，符號文字序列不能直接傳遞給算法。而計算機程序算法只接受具備固定長度的數字矩陣特徵向量(float或float數組)，沒法理解可變長度的文本文檔。算法

經常使用的文本特徵提取方法有以下幾種：數組

以上只是須要瞭解大體的含義，咱們不須要去實現一個文本特徵提取的算法，只須要使用平臺自帶的方法就能夠了。app

系統自帶的文本特徵處理的方法，輸入是一個字符串，要求將一個語句中的詞語用空格分開，英語的句子中詞彙是天生經過空格分割的，但中文句子不是，因此咱們須要首先進行分詞操做，具體流程以下：框架

2、代碼機器學習

代碼總體流程和上一篇文章描述的基本一致，爲簡便起見，咱們省略了模型存儲和讀取的過程。ide

先看一下數據集：學習

代碼以下：

namespace BinaryClassification_TextFeaturize
{
    class Program
    {
        static readonly string DataPath = Path.Combine(Environment.CurrentDirectory, "Data", "meeting_data_full.csv");

        static void Main(string[] args)
        {
            MLContext mlContext = new MLContext();
            var fulldata = mlContext.Data.LoadFromTextFile<MeetingInfo>(DataPath, separatorChar: ',', hasHeader: false);
            var trainTestData = mlContext.Data.TrainTestSplit(fulldata, testFraction: 0.15);
            var trainData = trainTestData.TrainSet;
            var testData = trainTestData.TestSet;

            var trainingPipeline = mlContext.Transforms.CustomMapping<JiebaLambdaInput, JiebaLambdaOutput>(mapAction: JiebaLambda.MyAction, contractName: "JiebaLambda")
                .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: "JiebaText"))
                .Append(mlContext.BinaryClassification.Trainers.FastTree(labelColumnName: "Label", featureColumnName: "Features"));
            ITransformer trainedModel = trainingPipeline.Fit(trainData);

            
            //評估
            var predictions = trainedModel.Transform(testData);           
            var metrics = mlContext.BinaryClassification.Evaluate(data: predictions, labelColumnName: "Label");
            Console.WriteLine($"Evalution Accuracy: {metrics.Accuracy:P2}");
           

            //建立預測引擎
            var predEngine = mlContext.Model.CreatePredictionEngine<MeetingInfo, PredictionResult>(trainedModel);

            //預測1
            MeetingInfo sampleStatement1 = new MeetingInfo { Text = "支委會。" };
            var predictionresult1 = predEngine.Predict(sampleStatement1);
            Console.WriteLine($"{sampleStatement1.Text}:{predictionresult1.PredictedLabel}");         

            //預測2
            MeetingInfo sampleStatement2 = new MeetingInfo { Text = "開展新時代中國特點社會主義思想三十講黨員答題活動。" };
            var predictionresult2 = predEngine.Predict(sampleStatement2);
            Console.WriteLine($"{sampleStatement2.Text}:{predictionresult2.PredictedLabel}");        

            Console.WriteLine("Press any to exit!");
            Console.ReadKey();
        }
        
    }

    public class MeetingInfo
    {
        [LoadColumn(0)]
        public bool Label { get; set; }
        [LoadColumn(1)]
        public string Text { get; set; }
    }

    public class PredictionResult : MeetingInfo
    {
        public string JiebaText { get; set; }
        public float[] Features { get; set; }
        public bool PredictedLabel;
        public float Score;
        public float Probability;        
    }
}

View Code

3、代碼分析

和上一篇文章中類似的內容我就再也不重複解釋了，重點介紹一下學習管道的創建。

var trainingPipeline = mlContext.Transforms.CustomMapping<JiebaLambdaInput, JiebaLambdaOutput>(mapAction: JiebaLambda.MyAction, contractName: "JiebaLambda")
    .Append(mlContext.Transforms.Text.FeaturizeText(outputColumnName: "Features", inputColumnName: "JiebaText"))
    .Append(mlContext.BinaryClassification.Trainers.FastTree(labelColumnName: "Label", featureColumnName: "Features"));

首先，在進行文本特徵轉換以前，咱們須要對文本進行分詞操做，您能夠對樣本數據進行預處理，造成分詞的結果再進行學習，咱們沒有采用這個方法，而是自定義了一個分詞處理的數據處理管道，經過這個管道進行分詞，其定義以下：

namespace BinaryClassification_TextFeaturize
{
    public class JiebaLambdaInput
    {
        public string Text { get; set; }
    }

    public class JiebaLambdaOutput
    {
        public string JiebaText { get; set; }
    }

    public class JiebaLambda
    {       
        public static void MyAction(JiebaLambdaInput input, JiebaLambdaOutput output)
        {
            JiebaNet.Segmenter.JiebaSegmenter jiebaSegmenter = new JiebaNet.Segmenter.JiebaSegmenter();
            output.JiebaText = string.Join(" ", jiebaSegmenter.Cut(input.Text));          
        }        
    }
}

最後咱們新建了兩個對象進行實際預測：

            //預測1
            MeetingInfo sampleStatement1 = new MeetingInfo { Text = "支委會。" };
            var predictionresult1 = predEngine.Predict(sampleStatement1);
            Console.WriteLine($"{sampleStatement1.Text}:{predictionresult1.PredictedLabel}");         

            //預測2
            MeetingInfo sampleStatement2 = new MeetingInfo { Text = "開展新時代中國特點社會主義思想三十講黨員答題活動。" };
            var predictionresult2 = predEngine.Predict(sampleStatement2);
            Console.WriteLine($"{sampleStatement2.Text}:{predictionresult2.PredictedLabel}");

預測結果以下：

4、調試

上一篇文章提到，當咱們運行Transform方法時，會對全部記錄進行轉換，轉換後的數據集是什麼樣子呢，咱們能夠寫一個調試程序看一下。

        var predictions = trainedModel.Transform(testData);
        DebugData(mlContext, predictions);

        private static void DebugData(MLContext mlContext, IDataView predictions)
        {
            var trainDataShow = new List<PredictionResult>(mlContext.Data.CreateEnumerable<PredictionResult>(predictions, false, true));

            foreach (var dataline in trainDataShow)
            {
                dataline.PrintToConsole();
            }
        }

    public class PredictionResult 
    {
        public string JiebaText { get; set; }
        public float[] Features { get; set; }
        public bool PredictedLabel;
        public float Score;
        public float Probability;
        public void PrintToConsole()
        {
            Console.WriteLine($"JiebaText={JiebaText}");
            Console.WriteLine($"PredictedLabel:{PredictedLabel},Score:{Score},Probability:{Probability}");
            Console.WriteLine($"TextFeatures Length:{Features.Length}");
            if (Features != null)
            {
                foreach (var f in Features)
                {
                    Console.Write($"{f},");
                }
                Console.WriteLine();
            }
            Console.WriteLine();
        }
    }