這個模型最先被應用於醫療診斷,其中,類變量的不一樣值用於表示患者可能患的不一樣疾病。證據變量用於表示不一樣症狀、化驗結果等。在簡單的疾病診斷上,樸素貝葉斯模型確實發揮了很好的做用,甚至比人類專家的診斷結果都要好。可是在更深度的應用中,醫生髮現,對於更復雜(由多種致病緣由和症狀共同表現)的疾病,模型表現的並很差。html
數據科學家通過分析認爲,出現這種現象的緣由在於:模型作了集中一般並不真實的強假設,例如:網絡
這種模型可用於醫學診斷是由於少許可解釋的參數易於由專家得到,早期的機器輔助醫療診斷系統正式創建在這一技術之上。dom
可是,以後更加深刻的實踐代表,構建這種模型的強假設下降了模型診斷的準確性,尤爲是「過分計算」某些特定的證據,該模型很容易太高估計某些方面特徵的影響。機器學習
例如,「高血壓」和「肥胖症」是心臟病的兩個硬指標,可是,這兩個症狀之間相關度很高,高血壓通常就伴隨着肥胖症。在使用樸素貝葉斯公式計算的時候,因爲乘法項的緣故,關於這方面的證據因子就會被重複計算,以下式:ide
P(心臟病 | 高血壓,肥胖症) = P(高血壓 | 心臟病) * P(高血壓 | 肥胖症) / P(高血壓,肥胖症)函數
因爲「高血壓」和「肥胖症」之間存在較強相關性的緣故,咱們能夠很容易想象,分子乘積增長的比率是大於分母聯合分佈增長的比率的。所以,當分子項繼續增長的時候,最終的後驗機率就會不斷增大。可是由於新增的特徵項並無提供新的信息,後驗機率的這種增大變化反而下降了模型的預測性能。性能
實際上,在實踐中人們發現,樸素貝葉斯模型的診斷性能會隨着特徵的增長而下降,這種下降經常歸因於違背了強條件獨立性假設。學習
筆者將這種現象稱之爲「過分特徵化(over-featuring)」,這是工程中常見的一種現象,過分特徵化若是沒法獲得有效規避,會顯著下降模型的泛化和預測性能。在這篇文章中,咱們經過實驗和分析來論證這個說法。測試
能夠經過這4個特徵預測鳶尾花卉屬於(iris-setosa, iris-versicolour, iris-virginica)中的哪一品種。spa
咱們先來討論欠特徵化(under-featuring)的狀況,咱們的數據集中有4個維度的特徵,而且這4個特徵和目標target的相關度都是很高的,換句話說這4個特徵都是富含信息量的特徵:
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffleif __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # load origin feature X_train_vec = X_train[:, :4] X_test_vec = X_test[:, :4] print "Pearson Relevance X[0]: ", numpy.corrcoef(np.array([i[0] for i in X_train_vec[:, 0:1]]), Y_train)[0, 1] print "Pearson Relevance X[1]: ", numpy.corrcoef(np.array([i[0] for i in X_train_vec[:, 1:2]]), Y_train)[0, 1] print "Pearson Relevance X[2]: ", numpy.corrcoef(np.array([i[0] for i in X_train_vec[:, 2:3]]), Y_train)[0, 1] print "Pearson Relevance X[3]: ", numpy.corrcoef(np.array([i[0] for i in X_train_vec[:, 3:4]]), Y_train)[0, 1]
4個特徵的皮爾森相關度都超過了0.5
如今咱們分別嘗試只使用1個、2個、3個、4個特徵狀況下,訓練獲得的樸素貝葉斯模型的泛化和預測性能:
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffle def model_tain_and_test(feature_cn): # load origin feature X_train_vec = X_train[:, :feature_cn] X_test_vec = X_test[:, :feature_cn] # train model muNB.fit(X_train_vec, Y_train) # predidct the test data y_predict = muNB.predict(X_test_vec) print "feature_cn: ", feature_cn print 'accuracy is: {0}'.format(accuracy_score(Y_test, y_predict)) print 'error is: {0}'.format(confusion_matrix(Y_test, y_predict)) print ' ' if __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # train and test the generalization and prediction model_tain_and_test(1) model_tain_and_test(2) model_tain_and_test(3) model_tain_and_test(4)
能夠看到,只使用1個特徵的時候,在測試集上的預測精確度只有33.3%,隨着特徵數的增長,測試集上的預測精確度逐漸增長。
用貝葉斯網的角度來看樸素貝葉斯模型,有以下結構圖,
Xi節點這裏至關於特徵,網絡中每一個Xi節點的增長,都會改變對Class結果的機率推理,Xi越多,推理的準確度也就越高。
從信息論的角度也很好理解,咱們能夠將P(Class | Xi)當作是條件熵的信息傳遞過程,咱們提供的信息越多,原則上,對Class的不肯定性就會越低。
至此,咱們得出以下結論:
特徵工程過程當中須要特別關注描述完整性問題(description integrity problem),特徵維度沒有完整的狀況下,提供再多的數據對模型效果都沒有實質的幫助。樣本集的機率完整性要從「特徵完整性」和「數據完整性」兩個方面保證,它們兩者歸根結底仍是信息完整性的本質問題。
如今咱們在原始的4個特徵維度上,繼續增長新的無用特徵,即那種和目標target相關度很低的特徵。
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffle import random def feature_expend(feature_vec): # colum_1 * colum_2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 1])]))) # random from colum_1 feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 0]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 0]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 0]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 0]]))) # random from colum_2 feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 1]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 1]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 1]]))) feature_vec = np.hstack((feature_vec, np.array([[random.uniform(.0, i)] for i in feature_vec[:, 1]]))) return feature_vec def model_tain_and_test(X_train, X_test, Y_train, Y_test, feature_cn): # load origin feature X_train_vec = X_train[:, :feature_cn] X_test_vec = X_test[:, :feature_cn] # train model muNB.fit(X_train_vec, Y_train) # predidct the test data y_predict = muNB.predict(X_test_vec) print "feature_cn: ", feature_cn print 'accuracy is: {0}'.format(accuracy_score(Y_test, y_predict)) print 'error is: {0}'.format(confusion_matrix(Y_test, y_predict)) print ' ' if __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # expend feature X_train = feature_expend(X_train) X_test = feature_expend(X_test) print "X_test: ", X_test # show Pearson Relevance for i in range(len(X_train[0])): print "Pearson Relevance X[{0}]: ".format(i), numpy.corrcoef(np.array([i[0] for i in X_train[:, i:i+1]]), Y_train)[0, 1] model_tain_and_test(X_train, X_test, Y_train, Y_test, len(X_train[0]))
咱們用random函數模擬了一個無用的新特徵,能夠看到,無用的特徵對模型不但沒有幫助,反而下降了模型的性能。
至此,咱們得出以下結論:
特徵不是越多越多,機器學習不是洗衣機,一股腦將全部特徵都丟進去,而後雙手合十,期望着模型能施展魔法,自動篩選出有用的好特徵,固然,dropout/正則化這些手段確實有助於提升模型性能,它們的工做本質也是經過去除一些特徵,從而緩解垃圾特徵對模型帶來的影響。
固然,將來也許會發展出autoFeature的工程技術,可是做爲數據科學工做者,咱們本身必需要理解特徵工程的意義。
所謂的「特徵加工」,具體來講就是對原始的特徵進行線性變換(拉伸和旋轉),獲得新的特徵,例如:
本質上來講,咱們能夠將深度神經網絡的隱層看作是一種特徵加工操做,稍有不一樣的是,深度神經網絡中激活函數充當了非線性扭曲的做用,不過其本質思想仍是不變的。
那接下來問題是,特徵加工對模型的性能有沒有影響呢?
準確的回答是,特徵加工對模型的影響取決於新增特徵的相關度,以及壞特徵在全部特徵中的佔比。
咱們來經過幾個實驗解釋上面這句話,下面筆者先經過模擬出幾個典型場景,最終給出總結結論:
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffle def feature_expend(feature_vec): # colum_1 * colum_2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 1])]))) # colum_1 / colum_2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.divide(feature_vec[:, 0], feature_vec[:, 1])]))) # colum_3 * colum_4 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 2], feature_vec[:, 3])]))) # colum_4 * colum_1 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 3], feature_vec[:, 0])]))) # colum_1 ^ 2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 0])]))) # colum_2 ^ 2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 1], feature_vec[:, 1])]))) # colum_3 ^ 2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 2], feature_vec[:, 2])]))) # colum_4 ^ 2 # feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 3], feature_vec[:, 3])]))) return feature_vec def model_tain_and_test(X_train, X_test, Y_train, Y_test, feature_cn): # load origin feature X_train_vec = X_train[:, :feature_cn] X_test_vec = X_test[:, :feature_cn] # train model muNB.fit(X_train_vec, Y_train) # predidct the test data y_predict = muNB.predict(X_test_vec) print "feature_cn: ", feature_cn print 'accuracy is: {0}'.format(accuracy_score(Y_test, y_predict)) print 'error is: {0}'.format(confusion_matrix(Y_test, y_predict)) print ' ' if __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # expend feature X_train = feature_expend(X_train) X_test = feature_expend(X_test) print "X_test: ", X_test # show Pearson Relevance for i in range(len(X_train[0])): print "Pearson Relevance X[{0}]: ".format(i), numpy.corrcoef(np.array([i[0] for i in X_train[:, i:i+1]]), Y_train)[0, 1] model_tain_and_test(X_train, X_test, Y_train, Y_test, len(X_train[0])-1) model_tain_and_test(X_train, X_test, Y_train, Y_test, len(X_train[0]))
上面代碼中,咱們新增了一個「colum_1 * colum_2」特徵維度,而且打印了該特徵的皮爾森相關度,相關度只有0.15,這是一個不好的特徵。同時該壞特徵佔了總特徵的1/5比例,是一個不低的比例。
所以在這種狀況下,模型的檢出效果受到了影響,降低了。緣由以前也解釋過,壞的特徵由於累乘效應,影響了最終的機率值。
# -*- coding: utf-8 -*- from sklearn.naive_bayes import GaussianNB import numpy as np from sklearn.datasets import load_iris from sklearn.metrics import accuracy_score from sklearn.metrics import confusion_matrix import numpy from sklearn.utils import shuffle def feature_expend(feature_vec): # colum_1 * colum_2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 1])]))) # colum_1 / colum_2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.divide(feature_vec[:, 0], feature_vec[:, 1])]))) # colum_3 * colum_4 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 2], feature_vec[:, 3])]))) # colum_4 * colum_1 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 3], feature_vec[:, 0])]))) # colum_1 ^ 2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 0], feature_vec[:, 0])]))) # colum_2 ^ 2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 1], feature_vec[:, 1])]))) # colum_3 ^ 2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 2], feature_vec[:, 2])]))) # colum_4 ^ 2 feature_vec = np.hstack((feature_vec, np.array([[i] for i in np.multiply(feature_vec[:, 3], feature_vec[:, 3])]))) return feature_vec def model_tain_and_test(X_train, X_test, Y_train, Y_test, feature_cn): # load origin feature X_train_vec = X_train[:, :feature_cn] X_test_vec = X_test[:, :feature_cn] # train model muNB.fit(X_train_vec, Y_train) # predidct the test data y_predict = muNB.predict(X_test_vec) print "feature_cn: ", feature_cn print 'accuracy is: {0}'.format(accuracy_score(Y_test, y_predict)) print 'error is: {0}'.format(confusion_matrix(Y_test, y_predict)) print ' ' if __name__ == '__main__': # naive Bayes muNB = GaussianNB() # load data iris = load_iris() print "np.shape(iris.data): ", np.shape(iris.data) # feature vec X_train = iris.data[:int(len(iris.data)*0.8)] X_test = iris.data[int(len(iris.data)*0.8):] # label Y_train = iris.target[:int(len(iris.data)*0.8)] Y_test = iris.target[int(len(iris.data)*0.8):] # shuffle X_train, Y_train = shuffle(X_train, Y_train) X_test, Y_test = shuffle(X_test, Y_test) # expend feature X_train = feature_expend(X_train) X_test = feature_expend(X_test) print "X_test: ", X_test # show Pearson Relevance for i in range(len(X_train[0])): print "Pearson Relevance X[{0}]: ".format(i), numpy.corrcoef(np.array([i[0] for i in X_train[:, i:i+1]]), Y_train)[0, 1] model_tain_and_test(X_train, X_test, Y_train, Y_test, len(X_train[0]))
在這個場景中,「colum_1 * colum_2」這個壞特徵依然存在,但和上一個場景不一樣的是,除了這個壞特徵以外,新增的特徵都是好特徵(相關度都很高)。
根據簡單的乘積因子原理能夠很容易理解,這個壞特徵對最終機率數值的影響會被「稀釋」,從而下降了對模型性能的影響。
至此,咱們得出以下結論:
深度神經網絡的隱層結構大規模增長了特徵的數量。本質上,深度神經網絡經過矩陣線性變換和非線性激活函數獲得海量的特徵維度的組合。咱們能夠想象到,其中必定有好特徵(相關度高),也必定會有壞特徵(相關度低)。
可是有必定咱們能夠肯定,好特徵的出現機率確定是遠遠大於壞特徵的,由於全部的特徵都是從輸入層的好特徵衍生而來的(遺傳進化思想)。那麼當新增特徵數量足夠多的時候,從機率上就能夠證實,好特徵的影響就會遠遠大於壞特徵,從而消解掉了壞特徵對模型性能的影響。這就是爲何深度神經網絡的適應度很強的緣由之一。
用一句通俗的話來講就是:若是你有牛刀,殺雞爲啥不用牛刀?用牛刀殺雞的好處在於,無論來的是雞仍是牛,都能自適應地保證能殺掉。
冗餘特徵和過特徵化現象在機器學習模型中並不罕見,在不一樣的模型中有不一樣的表現形式,例如:
這裏列舉一些筆者在工程實踐中總結出的一些指導性原則: