sklearn聚類評價指標

sklearn中的指標都在sklearn.metric包下,與聚類相關的指標都在sklearn.metric.cluster包下,聚類相關的指標分爲兩類:有監督指標和無監督指標,這兩類指標分別在sklearn.metric.cluster.supervised和sklearn.metric.cluster.unsupervised包下。聚類指標大部分都是有監督指標,無監督指標較少。
無監督指標和有監督指標應該充分配合起來:無監督指標很好,有監督指標不好,代表這個問題可能不是單靠聚類就能解決的;無監督指標不好,有監督指標很好,代表有監督指標極可能是不靠譜的,數據標註有問題。html

sklearn.metric.cluster.__init__.py把全部的聚類指標都引入進來了。
實際上,sklearn.metric包把cluster下的指標所有引進來了,因此能夠直接使用sklearn.metric而沒必要關心sklearn.metric.cluster.python

from .supervised import adjusted_mutual_info_score
from .supervised import normalized_mutual_info_score
from .supervised import adjusted_rand_score
from .supervised import completeness_score
from .supervised import contingency_matrix
from .supervised import expected_mutual_information
from .supervised import homogeneity_completeness_v_measure
from .supervised import homogeneity_score
from .supervised import mutual_info_score
from .supervised import v_measure_score
from .supervised import fowlkes_mallows_score
from .supervised import entropy
from .unsupervised import silhouette_samples
from .unsupervised import silhouette_score
from .unsupervised import calinski_harabaz_score
from .bicluster import consensus_score

預備知識

在瞭解這些聚類指標前,須要一些預備知識才能讀懂代碼。函數

COO

稀疏矩陣的一種格式,保存行、列、數三項。spa

contingency_matrix共現矩陣

from sklearn import metrics
from sklearn.metrics.cluster.supervised import contingency_matrix

labels_true = np.array([0, 2, 2, 3, 2, 1])
labels_pred = np.array([0, 2, 2, 2, 1, 2])
contingency = contingency_matrix(labels_true, labels_pred, sparse=True)

輸出爲
[[1 0 0]
[0 0 1]
[0 1 2]
[0 0 1]].net

共現矩陣行數等於實際類別數,列數等於聚類個數,第i行第j列的值表示實際類別爲i的元素有多少個被當作聚類類別爲j。code

AdjustedRandIndex調整蘭德係數

調整之意是:$score=\frac{x-E(x)}{max(x)-E(x)}$
蘭德係數是一種指標,互信息是一種指標,通過調整獲得調整蘭德係數和調整互信息兩種指標。
調整的意義在於:對於隨機聚類,分值應該儘可能低。orm

import numpy as np
from scipy.misc import comb
from sklearn import metrics
from sklearn.metrics.cluster.supervised import contingency_matrix

labels_true = np.array([0, 2, 2, 3, 2, 1])
labels_pred = np.array([0, 2, 2, 2, 1, 2])
score = metrics.cluster.adjusted_rand_score(labels_true, labels_pred)
print(score)
n_samples = labels_true.shape[0]
n_classes = np.unique(labels_true).shape[0]
n_clusters = np.unique(labels_pred).shape[0]
contingency = contingency_matrix(labels_true, labels_pred, sparse=True)
print(contingency.todense())
sum_comb_c = sum(comb(n_c, 2) for n_c in np.ravel(contingency.sum(axis=1)))
sum_comb_k = sum(comb(n_k, 2) for n_k in np.ravel(contingency.sum(axis=0)))
sum_comb = sum(comb(n_ij, 2) for n_ij in contingency.data)
prod_comb = (sum_comb_c * sum_comb_k) / comb(n_samples, 2)
mean_comb = (sum_comb_k + sum_comb_c) / 2.
score = (sum_comb - prod_comb) / (mean_comb - prod_comb)
print(score)

silhouette_score

silhouette_score是一種無監督聚類指標。
$$silhouette_sample_score=\frac{b-a}{max(a,b)}$$
a表示樣本的最小類內距離,b表示樣本的最小類間距離。
silhouette_samples函數用於計算每一個樣本的silhouette分值,silhouette_score就是各個樣本分值的平均值。htm

參考資料

https://blog.csdn.net/howhigh/article/details/73928635
sklearn官方文檔blog

相關文章
相關標籤/搜索