忘記從哪幾位大神哪蒐羅來的了web
One Hot Encoding 又稱爲一位 有效編碼, 主要採用N位狀態寄存器來對N個狀態進行編碼, 每一個狀態都有獨立的寄存位, 而且在任意時候都只有一位有效。算法
例如:瀏覽器
天然狀態碼爲: 000, 001, 010, 100, 101 網絡
One-Hot 編碼爲: 000001, 000010, 000100, 010000, 100000app
對於每個特徵,若是它有m個可能值, 那麼通過One—Hot編碼後, 就變成了 m個二元特徵(例如 成績全部可能值爲 好、中、差 — > 其 one-hot 編碼後結果爲 100, 010, 001); 這些特徵互斥, 每次只有一個激活, 所以這樣編碼後數據會變的很稀疏。dom
可是這樣作的好處是:機器學習
例子:Scikit-learn 的例子ide
from sklearn.preprocessing import OneHotEncoder學習
def main():ui
print('program start')
enc = OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])
array = enc.transform([[0, 1, 3]]).toarray()
print(array) # [[1. 0. 0. 1. 0. 0. 0. 0. 1.]]
print('program end')
if __name__ == '__main__':
main()
Note: fit 了 4個數據的3個特徵, transform了1個數據3個特徵, 第一個特徵兩種值(0:10, 1:01),第二個特徵三種值(0:100, 1: 010, 2:001) 第三個特徵四種值(0:1000,1:0100,2:0010, 3:0001)。因此轉換[0,1,3] 爲[1.0.0,1,0,0,0,1]
在實際機器學習應用任務中, 有不少特徵的的特徵值(feature) 並非連續的, 例如性別的取值爲’male’ & ‘female’。因此對於這樣的特徵 ,一般咱們須要對其進行特徵數字化,
以下面的例子:
性別:[‘male’, ‘female’]
地區:[‘Europe’, ‘US’, ‘Asia’]
瀏覽器:[‘Firefox’, ‘Chrome’, ‘Safari’, ‘Internet Explorer’]
對於某一個樣本, 如[‘male’, ‘US’, ‘Internet Explorer’] , 咱們須要將這個分類值的特徵數字化, 最直接的方法是採用序列化的方式:[0,1,3] , 同理 [‘female’, ‘Asia’, ‘Chrome’] 序列化編碼是[1,2,1]。 可是這樣的特徵不能直接放入機器學習的算法中。
由於, 分類器的默認數據是連續的,而且是有序的。 可是,按照上述表示, 數字並非有序的,而是隨機分配的。
爲何要求 數據是 連續的 ?和 要求是有序的,
One Hot Encoding 的處理方法
對於以上問題 , 性別的屬性是二維的, 地區是三維的 , 同理 瀏覽器是四維的。這樣,咱們採用One-Hot編碼的方式, 對上述的樣本 [‘male’, ‘US’, ‘Internet Explorer’] 進行編碼, 那麼 ‘male’ 對應 [0,1] ;‘US’ —> [0,1,0]; ‘Internet Explorer’ —> [0,0,0,1] 最終 完整的特徵數字化的結果爲:[0,1,0,1,0,0,0,0,1] , 注意: ⚠️ 使用這樣編碼的後果是 數據會變得很是的稀疏。
一、Why do we binarize categorical features?
We binarize the categorical input so that they can be thought of as a vector from the Euclidean space (we call this as embedding the vector in the Euclidean space).使用one-hot編碼,將離散特徵的取值擴展到了歐式空間,離散特徵的某個取值就對應歐式空間的某個點。
二、Why do we embed the feature vectors in the Euclidean space?
Because many algorithms for classification/regression/clustering etc. requires computing distances between features or similarities between features. And many definitions of distances and similarities are defined over features in Euclidean space. So, we would like our features to lie in the Euclidean space as well.將離散特徵經過one-hot編碼映射到歐式空間,是由於,在迴歸,分類,聚類等機器學習算法中,特徵之間距離的計算或類似度的計算是很是重要的,而咱們經常使用的距離或類似度的計算都是在歐式空間的類似度計算,計算餘弦類似性,基於的就是歐式空間。
三、Why does embedding the feature vector in Euclidean space require us to binarize categorical features?
Let us take an example of a dataset with just one feature (say job_type as per your example) and let us say it takes three values 1,2,3.
Now, let us take three feature vectors x_1 = (1), x_2 = (2), x_3 = (3). What is the euclidean distance between x_1 and x_2, x_2 and x_3 & x_1 and x_3? d(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. This shows that distance between job type 1 and job type 2 is smaller than job type 1 and job type 3. Does this make sense? Can we even rationally define a proper distance between different job types? In many cases of categorical features, we can properly define distance between different values that the categorical feature takes. In such cases, isn't it fair to assume that all categorical features are equally far away from each other?
Now, let us see what happens when we binary the same feature vectors. Then, x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1). Now, what are the distances between them? They are sqrt(2). So, essentially, when we binarize the input, we implicitly state that all values of the categorical features are equally away from each other.
將離散型特徵使用one-hot編碼,確實會讓特徵之間的距離計算更加合理。好比,有一個離散型特徵,表明工做類型,該離散型特徵,共有三個取值,不使用one-hot編碼,其表示分別是x_1 = (1), x_2 = (2), x_3 = (3)。兩個工做之間的距離是,(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2。那麼x_1和x_3工做之間就越不類似嗎?顯然這樣的表示,計算出來的特徵的距離是不合理。那若是使用one-hot編碼,則獲得x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1),那麼兩個工做之間的距離就都是sqrt(2).即每兩個工做之間的距離是同樣的,顯得更合理。
四、About the original question?
Note that our reason for why binarize the categorical features is independent of the number of the values the categorical features take, so yes, even if the categorical feature takes 1000 values, we still would prefer to do binarization.
五、Are there cases when we can avoid doing binarization?不必用one-hot 編碼的情形
Yes. As we figured out earlier, the reason we binarize is because we want some meaningful distance relationship between the different values. As long as there is some meaningful distance relationship, we can avoid binarizing the categorical feature. For example, if you are building a classifier to classify a webpage as important entity page (a page important to a particular entity) or not and let us say that you have the rank of the webpage in the search result for that entity as a feature, then 1] note that the rank feature is categorical, 2] rank 1 and rank 2 are clearly closer to each other than rank 1 and rank 3, so the rank feature defines a meaningful distance relationship and so, in this case, we don't have to binarize the categorical rank feature.
More generally, if you can cluster the categorical values into disjoint subsets such that the subsets have meaningful distance relationship amongst them, then you don't have binarize fully, instead you can split them only over these clusters. For example, if there is a categorical feature with 1000 values, but you can split these 1000 values into 2 groups of 400 and 600 (say) and within each group, the values have meaningful distance relationship, then instead of fully binarizing, you can just add 2 features, one for each cluster and that should be fine.
將離散型特徵進行one-hot編碼的做用,是爲了讓距離計算更合理,但若是特徵是離散的,而且不用one-hot編碼就能夠很合理的計算出距離,那麼就不必進行one-hot編碼,好比,該離散特徵共有1000個取值,咱們分紅兩組,分別是400和600,兩個小組之間的距離有合適的定義,組內的距離也有合適的定義,那就不必用one-hot 編碼。
離散特徵進行one-hot編碼後,編碼後的特徵,其實每一維度的特徵均可以看作是連續的特徵。就能夠跟對連續型特徵的歸一化方法同樣,對每一維特徵進行歸一化。好比歸一化到[-1,1]或歸一化到均值爲0,方差爲1。
有些狀況不須要進行特徵的歸一化:
It depends on your ML algorithms, some methods requires almost no efforts to normalize features or handle both continuous and discrete features, like tree based methods: c4.5, Cart, random Forrest, bagging or boosting. But most of parametric models (generalized linear models, neural network, SVM,etc) or methods using distance metrics (KNN, kernels, etc) will require careful work to achieve good results. Standard approaches including binary all features, 0 mean unit variance all continuous features, etc。
基於樹的方法是不須要進行特徵的歸一化,例如隨機森林,bagging 和 boosting等。基於參數的模型或基於距離的模型,都是要進行特徵的歸一化。
首先,one-hot編碼是N位狀態寄存器爲N個狀態進行編碼的方式
eg:高、中、低不可分,→ 用0 0 0 三位編碼以後變得可分了,而且成爲互相獨立的事件
→ 相似 SVM中,本來線性不可分的特徵,通過project以後到高維以後變得可分了 GBDT處理高維稀疏矩陣的時候效果並很差,即便是低維的稀疏矩陣也未必比SVM好
對於決策樹來講,one-hot的本質是增長樹的深度
tree-model是在動態的過程當中生成相似 One-Hot + Feature Crossing 的機制
1. 一個特徵或者多個特徵最終轉換成一個葉子節點做爲編碼 ,one-hot能夠理解成三個獨立事件
2. 決策樹是沒有特徵大小的概念的,只有特徵處於他分佈的哪一部分的概念
one-hot能夠解決線性可分問題 可是比不上label econding
one-hot降維後的缺點:
降維前能夠交叉的降維後可能變得不能交叉
樹模型的訓練過程:
從根節點到葉子節點整條路中有多少個節點至關於交叉了多少次,因此樹的模型是自行交叉
eg:是不是長的 { 否(是→ 柚子,否 → 蘋果) ,是 → 香蕉 } 園 cross 黃 → 形狀 (圓,長) 顏色 (黃,紅) one-hot度爲4的樣本
使用樹模型的葉子節點做爲特徵集交叉結果能夠減小沒必要要的特徵交叉的操做 或者減小維度和degree候選集
eg 2 degree → 8的特徵向量 樹 → 3個葉子節點
樹模型:Ont-Hot + 高degree笛卡爾積 + lasso 要消耗更少的計算量和計算資源
這就是爲何樹模型以後能夠stack線性模型
n*m的輸入樣本 → 決策樹訓練以後能夠知道在哪個葉子節點上 → 輸出葉子節點的index → 變成一個n*1的矩陣 → one-hot編碼 → 能夠獲得一個n*o的矩陣(o是葉子節點的個數) → 訓練一個線性模型
典型的使用: GBDT + RF
優勢 : 節省作特徵交叉的時間和空間
若是隻使用one-hot訓練模型,特徵之間是獨立的
對於現有模型的理解:(G(l(張量))):
其中:l(·)爲節點的模型
G(·)爲節點的拓撲方式
神經網絡:l(·)取邏輯迴歸模型
G(·)取全鏈接的方式
決策樹: l(·)取LR
G(·)取樹形連接方式
創新點: l(·)取 NB,SVM 單層NN ,等
G(·)取怎樣的信息傳遞方式