Weka manual 3.6翻譯: 16.6 分類

時間 2019-11-19

標籤 weka manual 3.6 翻譯 16.6 分類简体版

原文原文鏈接

若覺排版很差，可點這裏。 java

16.6 分類

在WEKA內，分類和迴歸算法都被稱爲「分類」，並都位於 weka.classifiers 包中。本節包括如下主題：算法

• 創建一個分類 -批量和增量學習。數組

• 評價一個分類 -各類評估技術，以及如何得到生成的統計信息。數據結構

• 分類實例 -得到未知數據的分類。 dom

WEKA Examples 集合[3]包含分類的示例類，在 wekaexamples.classifiers 包中。學習

16.6.1 創建分類器

經過設計，WEK中的全部分類器均可批量分類，即，他們對整個數據集在一次訓練。這是正常的，若是訓練數據裝入到內存中。但也有算法，能夠運行中更新本身的內部模型這些分類器被稱爲增量的。如下兩部分覆蓋批量和增量的分類器。測試

批量分類器 ui

創建了一批分類是很是簡單的： google

• 設置選項 -不管是使用 setOptions(String[]) 方法或實際的set方法。 lua

• 訓練 -提供訓練集，調用 buildClassifier(Instances) 。根據定義 buildClassifier(Instances) 方法徹底重置內部模型，以確保後續用同一數據調用此方法會獲得同一個模型（「重複實驗」）。

下面的代碼片斷用數據集生成未修剪J48：

import weka.core.Instances;

import weka.classifiers.trees.J48;

...

Instances data = ... // from somewhere

String[] options = new String[1];

options[0] = "-U"; // unpruned tree

J48 tree = new J48(); // new instance of tree

tree.setOptions(options); // set the options

tree.buildClassifier(data); // build classifier

增量分類器

在WEKA內，全部增量分類器都實現了接口UpdateableClassifier（位於包 weka.classifiers）。這個特定接口的Javadoc，講述了一個什麼樣的分類實現此接口。這些分類器能夠被用於處理大量的數據，利用較小的存儲器佔用空間，由於訓練數據沒必要加載在內存中。例如，ARFF文件能夠增量地讀出（見第16.2章）。

訓練增量分類器分兩個階段：

1. 經過調用buildClassifier(Instances) 方法初始化模型。可使用一個 weka.core.Instances 對象，對象能夠沒有實際的數據或有一組初始數據。

2. 經過調用 updateClassifier(Instance) 方法一行一行更新模型。

下面的例子演示瞭如何使用ArffLoader的類增量地加載一個ARFF文件，且一行一行地訓練NaiveBayesUpdateable 分類器：

import weka.core.converters.ArffLoader;

import weka.classifiers.bayes.NaiveBayesUpdateable;

import java.io.File;

...

/ /加載數據

ArffLoader loader = new ArffLoader();

loader.setFile(new File("/some/where/data.arff"));

Instances structure = loader.getStructure();

structure.setClassIndex（structure.numAttributes（） - 1）;

// train NaiveBayes

NaiveBayesUpdateable nb = new NaiveBayesUpdateable();

nb.buildClassifier(structure);

Instance current;

while ((current = loader.getNextInstance(structure)) != null)

nb.updateClassifier(current);

16.6.2 評估分類器

創建一個分類器只是其中的一部分，評估其表現如何是另外一個重要部分。WEKA支持兩種類型的評價：

• 交叉驗證 -若是隻有一個單一的數據集，並但願獲得一個合理的實事求是的評價。設置折的數量爲數據集中的行的數量會獲得一個留一法交叉驗證（LOOCV）。

• 專用測試集 -測試集徹底是用於評估建好的分類器。有一個採用相同（或相似）概念的測試集做爲訓練集，是很重要的，不然將永遠是表現不佳。

評價步驟，包括收集統計資料，由Evaluation類作（包weka.classifiers）。

交叉驗證

Evaluation類的crossValidateModel方法用於執行交叉驗證，使用未經訓練的分類器和一個數據集。提供未經訓練的分類器，確保沒有信息泄漏到實際的評估中。雖然，buildClassifier重置了分類器，這是一個實現的要求，它不能保證明際狀況就是這樣（「漏」(leaky)實現）。使用未經訓練的分類，避免了沒必要要的反作用，由於每對訓練/測試組合，咱們使用最初提供的分類器的副本。

進行交叉驗證以前，數據就被隨機附帶的隨機數發生器(java.util.Random) 隨機化。建議此發生器使用指定的「種子」。不然，在同一數據集上後續的運行的交叉驗證不會產生相同的結果，緣由是不一樣的數據隨機化（參閱Section 16.4獲取更多信息隨機化）。

下面的代碼片斷對一個J48決策樹算法進行10折交叉驗證，在數據集 newData 上，用隨機數生成器，其種子是「1」。收集到的統計數據彙總輸出到標準輸出。

import weka.classifiers.Evaluation;

import weka.classifiers.trees.J48;

import weka.core.Instances;

import java.util.Random;

...

Instances newData = ... // from somewhere

Evaluation eval = new Evaluation(newData);

J48 tree = new J48();

eval.crossValidateModel(tree, newData, 10, new Random(1));

System.out.println(eval.toSummaryString("\nResults\n\n", false));

這個例子中的Evaluation對象用一個數據集初始化，這個數據集在評估過程當中使用。這樣作是爲了告知評估方法正在評估的數據類型是什麼，確保全部的內部數據結構正確設置。

訓練/測試集

使用專用的測試集評估一個分類器與交叉驗證同樣簡單。可是，如今提供不是一個未經訓練的分類器，而是一個受過訓練的分類器。再次，weka.classifiers.Evaluation類是用來執行評估的，這一次使用 evaluateModel 方法。

下面的代碼片斷訓練J48，在數據集上使用默認選項，並對它在測試集上進行評估，而後輸出收集到的統計數據彙總。

import weka.core.Instances;

import weka.classifiers.Evaluation;

import weka.classifiers.trees.J48;

...

Instances train = ... // from somewhere

Instances test = ... // from somewhere

// train classifier

Classifier cls = new J48();

cls.buildClassifier(train);

// evaluate classifier and print some statistics

Evaluation eval = new Evaluation(train);

eval.evaluateModel(cls, test);

System.out.println(eval.toSummaryString("\nResults\n\n", false));

統計

在前面的章節中，咱們在代碼用Evaluation類的toSummaryString方法但還有其餘對標稱類屬性進行的方法：

• toMatrixString – 輸出混淆矩陣.

• toClassDetailsString– 輸出 TP/FP 率，精確率, 召回率, F-measure, AUC (per class).

• toCumulativeMarginDistributionString– 輸出積累頻率分佈cumulative margins distribution。

若是不但願使用這些彙總的方法，能夠直接訪問我的統計度量方法。下面列出一些常見的措施：

• 標稱類屬性

- correct() -正確分類的實例的數量。不正確的分類可經過 incorrect()。

- pctCorrect() -正確分類的實例（精度）的百分比。pctIncorrect()返回的錯誤分類的百分比。

- areaUnderROC(int) -指定類標記索引（基於0的索引）的曲線下方區域(AUC)。

• 數字類屬性

- corelationCoefficient() -的相關係數。

• 通常

- meanAbsoluteError() -平均絕對偏差。

- rootMeanSquaredError() -均方根偏差。

- numInstances() -一擁有類值的實例數量

- unclassified() {3}-未分類的實例的數量。{/3}

– pctUnclassified() - 未分類的實例的百分比.

關於完整概述，參閱Evaluation 類的Javadoc頁面。經過查找上述的彙總方法的源代碼，能夠很容易地肯定哪些方法被用於特定的輸出。

16.6.3 分類實例

建立的分類器評估且證實有效後，構造的分類器能夠用來做預測與標籤無標籤數據。第16.5.2節已經提供的如何使用一個分類器的 classifyInstance 方法的簡要說明。此節在這裏，闡述多一點。

下面的示例使用一個訓練好的分類樹，把從磁盤加載的全部未標記的數據集的實例做標記。在全部的實例都被貼上了標籤後，產生的新的數據集寫入到磁盤一個新的文件中。

// 加載未標記的數據且設置類屬性

Instances unlabeled = DataSource.read("/some/where/unlabeled.arff");

unlabeled.setClassIndex(unlabeled.numAttributes() - 1);

// create copy

Instances labeled = new Instances(unlabeled);

// label instances

for (int i = 0; i < unlabeled.numInstances(); i++) {

double clsLabel = tree.classifyInstance(unlabeled.instance(i));

labeled.instance(i).setClassValue(clsLabel);

}

// save newly labeled data

DataSink.write("/some/where/labeled.arff", labeled);

固然，上面的例子對分類和迴歸問題一樣有效，只要分類器能夠處理數值型的類。這是爲何？對於數值類型，classifyInstance(Instance) 方法返回歸值，對於標稱類型，返回可用類標籤列表中基於0的索引列表中的可用的類標籤。

若是對類的分佈感興趣，可使用 distributionForInstanc(Instance) 方法（該數組他爲1）。固然，使用這種方法只對分類問題纔有意義。下面的代碼片斷輸出類的分佈，實際和預測的標籤在控制檯中的並排輸出：

// load data

Instances train = DataSource.read(args[0]);

train.setClassIndex(train.numAttributes() - 1);

Instances test = DataSource.read(args[1]);

test.setClassIndex(test.numAttributes() - 1);

// train classifier

J48 cls = new J48();

cls.buildClassifier(train);

// output predictions

System.out.println("# - actual - predicted - distribution");

for (int i = 0; i < test.numInstances(); i++) {

double pred = cls.classifyInstance(test.instance(i));

double[] dist = cls.distributionForInstance(test.instance(i));

System.out.print((i+1) + " - ");

System.out.print(test.instance(i).toString(test.classIndex()) + " - ");

System.out.print(test.classAttribute().value((int) pred) + " - ");

System.out.println(Utils.arrayToString(dist));

}

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。