one-hot編碼sklearn實現詳解

時間 2019-12-08

原文原文鏈接

one-hot編碼是特徵處理中的必備，在項目中咱們是這麼應用的，html

# sklearn用法
from sklearn import preprocessing

enc = OneHotEncoder(sparse = False)
ans = enc.fit_transform([[0, 0, 3],
                         [1, 1, 0],
                         [0, 2, 1],
                         [1, 0, 2]])

解析的原理可參考：link
在sklearn中onehot編碼的核心邏輯在_fit_transform方法中ui

def _fit_transform(self, X):

 #獲取入參的行數和列數
n_samples, n_features = X.shape
#獲取每列的最大值加1"""
 n_values = np.max(X, axis=0) + 1
#cumsum累加，用於以後構建稀疏矩陣"""
indices = np.cumsum(n_values)
#構建稀疏矩陣的列值
column_indices = (X + indices[:-1]).ravel()
#構建【0 0 0 1 1 1 2 2 2 3 3 3】
row_indices = np.repeat(np.arange(n_samples, dtype=np.int32),
                        n_features)
#one-hot編碼後要麼是0 要麼是1  先構建全1的矩陣
data = np.ones(n_samples * n_features)
#Coordinate Format (COO)稀疏矩陣的存儲格式
out = sparse.coo_matrix((data, (row_indices, column_indices)),
                        shape=(n_samples, indices[-1]),
                        dtype=self.dtype).tocsr()

舉例說明，好比最開始的例子
咱們先來看第一個特徵，即第一列 [0,1,0,1]，也就是說它有兩個取值 0 或者 1，那麼 one-hot 就會使用兩位來表示這個特徵，[1,0] 表示 0， [0,1] 表示 1，在上例輸出結果中的前兩位 [1,0...] 也就是表示該特徵爲 0
第二個特徵，第二列 [0,1,2,0]，它有三種值，那麼 one-hot 就會使用三位來表示這個特徵，[1,0,0] 表示 0， [0,1,0] 表示 1，[0,0,1] 表示 2，在上例輸出結果中的第三位到第六位 [...0,1,0,0...] 也就是表示該特徵爲 1
第二個特徵，第三列 [3,0,1,2]，它有四種值，那麼 one-hot 就會使用四位來表示這個特徵，[1,0,0,0] 表示 0， [0,1,0,0] 表示 1，[0,0,1,0] 表示 2，[0,0,0,1] 表示 3，在上例輸出結果中的最後四位 [...0,0,0,1] 也就是表示該特徵爲 3編碼

在_fit_transform方法中是如何實現的呢，以下圖spa