AdaBoost-MH算法是由Schapire(AdaBoost算法做者)與Singer提出,基本思想與AdaBoost算法相似:自適應地調整樣本-類別的分佈權重。對於訓練樣本\(\langle (x_1, Y_1), \cdots, (x_m, Y_m) \rangle\),任意一個實例 \(x_i \in \mathcal{X}\),標籤類別\(Y_i \subseteq \mathcal{Y}\),算法流程以下:git
其中,\(D_t(i, \ell)\)表示在t次迭代實例\(x_i\)對應標籤\(\ell\)的權重,\(Y[\ell]\)標識標籤\(\ell\)是否屬於實例\((x, Y)\),若屬於則爲+1,反之爲-1(增長樣本標籤的權重);即github
\[ Y[\ell] = \left \{ { \matrix { {+1} & {\ell \in Y} \cr {-1} & {\ell \notin Y} \cr } } \right. \]算法
\[ Z_t = \sum_{i=1}^{m} \sum_{\ell \in \mathcal{Y}} D_{t}(i, \ell) \exp \large{(}-\alpha_{t} Y_i[\ell] h_t(x_i, \ell) \large{)} \]app
ML-KNN (multi-label K nearest neighbor)基於KNN算法,已知K近鄰的標籤信息,經過最大後驗機率(Maximum A Posteriori)估計實例\(t\)是否應打上標籤\(\ell\),ui
\[ y_t(\ell) = \mathop{ \arg \max}_{b \in \{0,1\}} P(H_b^{\ell} | E_{C_t(\ell)}^{\ell} ) \]spa
其中,\(H_0^{\ell}\)表示實例\(t\)不該打上標籤\(\ell\),\(H_1^{\ell}\)則表示應被打上;\(E_{C_t(\ell)}^{\ell}\) 表示實例\(t\)的K近鄰中擁有標籤\(\ell\)的實例數爲\(C_t(\ell)\)。上述式子可有貝葉斯定理求解:.net
\[ y_t(\ell) = \mathop{ \arg \max}_{b \in \{0,1\}} P(H_b^{\ell}) P(E_{C_t(\ell)}^{\ell} | H_b^{\ell} ) \]
算法 | Hamming loss | Precision | Recall | F1 Measure |
LR+OvR | 0.0569 | 0.6252 | 0.5586 | 0.5563 |
AdaBoost.MH | 0.0587 | 0.6280 | 0.6082 | 0.5837 |
ML-KNN | 0.0652 | 0.6204 | 0.6535 | 0.5977 |
此外,Mulan提供了衆多數據集,Kaggle也有多標籤分類的比賽WISE 2014。
import numpy as np from sklearn import metrics from sklearn.datasets import load_svmlight_file from sklearn.linear_model import LogisticRegression from sklearn.multiclass import OneVsRestClassifier from sklearn.preprocessing import MultiLabelBinarizer # load svm file X_train, y_train = load_svmlight_file('tmc2007_train.svm', dtype=np.float64, multilabel=True) X_test, y_test = load_svmlight_file('tmc2007_test.svm', dtype=np.float64, multilabel=True) # convert multi labels to binary matrix mb = MultiLabelBinarizer() y_train = mb.fit_transform(y_train) y_test = mb.fit_transform(y_test) # LR + OvR clf = OneVsRestClassifier(LogisticRegression(), n_jobs=10) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) # multilabel classification metrics loss = metrics.hamming_loss(y_test, y_pred) prf = metrics.precision_recall_fscore_support(y_test, y_pred, average='samples') """ ML-KNN for multilabel classification """ from skmultilearn.adapt import MLkNN clf = MLkNN(k=15) clf.fit(X_train, y_train) y_pred = clf.predict(X_test)
// AdaBoost.MH for multilabel classification val labels0Based = true val binaryProblem = false val learner = new AdaBoostMHLearner(sc) learner.setNumIterations(params.numIterations) // 500 iter learner.setNumDocumentsPartitions(params.numDocumentsPartitions) learner.setNumFeaturesPartitions(params.numFeaturesPartitions) learner.setNumLabelsPartitions(params.numLabelsPartitions) val classifier = learner.buildModel(params.input, labels0Based, binaryProblem) val testPath = "./tmc2007_test.svm" val numRows = DataUtils.getNumRowsFromLibSvmFile(sc, testPath) val testRdd = DataUtils.loadLibSvmFileFormatDataAsList(sc, testPath, labels0Based, binaryProblem, 0, numRows, -1); val results = classifier.classifyWithResults(sc, testRdd, 20) val predAndLabels = sc.parallelize(predLabels.zip(goldLabels) .map(t => { (t._1.map(e => e.toDouble), t._2.map(e => e.toDouble)) })) val metrics = new MultilabelMetrics(predAndLabels)
[1] Schapire, Robert E., and Yoram Singer. "BoosTexter: A boosting-based system for text categorization." Machine learning 39.2-3 (2000): 135-168.
[2] Zhang, Min-Ling, and Zhi-Hua Zhou. "ML-KNN: A lazy learning approach to multi-label learning." Pattern recognition 40.7 (2007): 2038-2048.
[3] 基於PredictionIO的推薦引擎打造,及大規模多標籤分類探索.