sklearn中的指標都在sklearn.metric包下,與聚類相關的指標都在sklearn.metric.cluster包下,聚類相關的指標分爲兩類:有監督指標和無監督指標,這兩類指標分別在sklearn.metric.cluster.supervised和sklearn.metric.cluster.unsupervised包下。聚類指標大部分都是有監督指標,無監督指標較少。
無監督指標和有監督指標應該充分配合起來:無監督指標很好,有監督指標不好,代表這個問題可能不是單靠聚類就能解決的;無監督指標不好,有監督指標很好,代表有監督指標極可能是不靠譜的,數據標註有問題。html
sklearn.metric.cluster.__init__.py
把全部的聚類指標都引入進來了。
實際上,sklearn.metric包把cluster下的指標所有引進來了,因此能夠直接使用sklearn.metric而沒必要關心sklearn.metric.cluster.python
from .supervised import adjusted_mutual_info_score from .supervised import normalized_mutual_info_score from .supervised import adjusted_rand_score from .supervised import completeness_score from .supervised import contingency_matrix from .supervised import expected_mutual_information from .supervised import homogeneity_completeness_v_measure from .supervised import homogeneity_score from .supervised import mutual_info_score from .supervised import v_measure_score from .supervised import fowlkes_mallows_score from .supervised import entropy from .unsupervised import silhouette_samples from .unsupervised import silhouette_score from .unsupervised import calinski_harabaz_score from .bicluster import consensus_score
在瞭解這些聚類指標前,須要一些預備知識才能讀懂代碼。函數
稀疏矩陣的一種格式,保存行、列、數三項。spa
from sklearn import metrics from sklearn.metrics.cluster.supervised import contingency_matrix labels_true = np.array([0, 2, 2, 3, 2, 1]) labels_pred = np.array([0, 2, 2, 2, 1, 2]) contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
輸出爲
[[1 0 0]
[0 0 1]
[0 1 2]
[0 0 1]].net
共現矩陣行數等於實際類別數,列數等於聚類個數,第i行第j列的值表示實際類別爲i的元素有多少個被當作聚類類別爲j。code
調整之意是:$score=\frac{x-E(x)}{max(x)-E(x)}$
蘭德係數是一種指標,互信息是一種指標,通過調整獲得調整蘭德係數和調整互信息兩種指標。
調整的意義在於:對於隨機聚類,分值應該儘可能低。orm
import numpy as np from scipy.misc import comb from sklearn import metrics from sklearn.metrics.cluster.supervised import contingency_matrix labels_true = np.array([0, 2, 2, 3, 2, 1]) labels_pred = np.array([0, 2, 2, 2, 1, 2]) score = metrics.cluster.adjusted_rand_score(labels_true, labels_pred) print(score) n_samples = labels_true.shape[0] n_classes = np.unique(labels_true).shape[0] n_clusters = np.unique(labels_pred).shape[0] contingency = contingency_matrix(labels_true, labels_pred, sparse=True) print(contingency.todense()) sum_comb_c = sum(comb(n_c, 2) for n_c in np.ravel(contingency.sum(axis=1))) sum_comb_k = sum(comb(n_k, 2) for n_k in np.ravel(contingency.sum(axis=0))) sum_comb = sum(comb(n_ij, 2) for n_ij in contingency.data) prod_comb = (sum_comb_c * sum_comb_k) / comb(n_samples, 2) mean_comb = (sum_comb_k + sum_comb_c) / 2. score = (sum_comb - prod_comb) / (mean_comb - prod_comb) print(score)
silhouette_score是一種無監督聚類指標。
$$silhouette_sample_score=\frac{b-a}{max(a,b)}$$
a表示樣本的最小類內距離,b表示樣本的最小類間距離。
silhouette_samples函數用於計算每一個樣本的silhouette分值,silhouette_score就是各個樣本分值的平均值。htm
https://blog.csdn.net/howhigh/article/details/73928635
sklearn官方文檔blog