特徵提取研究的主要問題是,如何在數據集未明確表示結果的前提下,從中提取出重要的潛在特徵來。和無監督聚類同樣,特徵提取算法的目的不是爲了預測,而是要嘗試對數據進行特徵識別,以此獲得隱藏在數據背後的深層次意義。php
回想一下聚類算法的基本概念,聚類算法將數據集中的每一行數據分別分配給了某個組(group)或某個點(point),每一項數據都精確對應於一個組,這個組表明了組內成員的平均水平。html
特徵提取是這種聚類思想更爲通常的表現形式,它會嘗試從數據集中尋找新的數據行,將這些新找到的數據行加以組合,就能夠從新構造出數據集。和原始數據集不同,位於新數據集中的每一行數據並不屬於某個聚類,而是由若干特徵的組合構造而成的。git
從特徵提取的這個原理來看,特徵提取也能夠理解爲一種泛化的降維概念。github
在這篇文章中,筆者會嘗試從底層的數學原理出發,闡述這些概念之間的聯繫和區別。其實不管是特徵提取、降維、聚類,都只是從不用的角度解決同一類問題,其背後的原理都是共通的。web
這是一個如何在多人談話時鑑別聲音。人類聽覺系統的一個顯著特徵就是:在一我的聲鼎沸屋子裏,儘管有許多不一樣的聲音混雜在一塊兒涌入咱們的耳朵,可咱們仍是可以從中鑑別出某個聲音來,大腦很是擅長於從聽到的全部噪聲中分離出單獨的聲音。算法
一樣的目標,經過本文將要討論的獨立特徵提取技術,計算機就有可能完成一樣的任務。shell
獨立特徵提取的一個重要應用是,對重複出現於一組文檔中的單詞使用組合進行模式識別(word-usage pattern recognition),這能夠幫助咱們有效地識別出,以不一樣組合形式出現於各個文檔中的主題。從文檔中提取出的這些主題,就能夠做爲一個泛化的通常化特徵,用於歸納一整類文旦。同時,經過對文檔進行獨立特徵識別,咱們會發現,一篇文章能夠包含不止一個主題,同時,同一個主題也能夠用於不止一篇文章。編程
例如,對於某個重大新聞,每每會有多家報社從不一樣角度都進行了報道,雖然各家報社的側重點各有不一樣,可是不可避免會有一些公共主題是交叉的。具體來講,例如美國大選特朗普當選總統,CNN和紐約時報都進行了報道,兩家報紙的側重點各有不一樣,可是都不約而同地談到了特朗普過去的從商經歷等。api
須要特別注意的是,必需要保證要搜索的文檔中存在內容重疊的主題(acrossover topic),若是沒有任何文檔存在共同之處,那麼算法將很難從中提取出重要的通用特徵,若是強行提取則最終獲得的特徵將會變得毫無心義(這也體現了獨立特徵提取技術的降維和聚類本質)。數組
股票市場是一個集體智慧共同做用的典型場景,在股市經濟活動中,每個天然人都遵循最大收益原則開展活動(凱恩斯的開不見的手宏觀經濟理論),通俗地說就是,每一個投資者作每項投資的目的都是爲了使本身的收益最大化。整個社會千千萬萬個投資者共同造成了一個股票市場。
基於這個前提下,咱們假設股票市場數據背後潛藏着諸多緣由,正是這些緣由共同組成的結果,致使了證券市場的結果。咱們能夠將獨立特徵提取算法應用於這些數據,尋找數據背後的緣由,以及它們各自對結果所構成的影響。
獨立特徵提取技術能夠用來對股市的成交量進行分析。所謂成交量,就是指在某一給定時間段內所買賣的股票份數,下圖是Yahoo!股票的走勢圖,
位於最上方的線表明了收盤價,下面的柱狀圖則給出了成交量。
咱們發現,當股票價格有較大變化時,成交量在那幾天每每就會變得很高。這一般會發生在公司發表重要聲明或發佈財務報告的時候。此外,當有涉及公司或業界新聞報道時,也會致使價格出現」突變「。在缺乏外部事件影響的狀況下,對於某隻股票而言,成交量一般(但不老是)保持不變的。
從成交量中提取出的獨立特徵,基本上就是明面上或者背後的某些」利好「或」很差「事件。
Relevant Link:
《集體智慧編程》Toby segaran - 第10章
NMF是一個很是數學化的概念,咱們本章會詳細討論其底層數學原理,不過,筆者但願經過一個簡單易懂的例子來爲讀者朋友先創建一個感性的直觀概念,爲以後的原理討論做鋪墊。
假設咱們如今經過詞頻語言模型,已經將一個文檔集(不少篇文章)轉換爲了一個【M,N】矩陣,M表明有M篇文章,N表明有N個詞,元素的數值表明該單詞在對應文章中的詞頻計數。
咱們的目標是要對着個矩陣進行因式分解,即,找到兩個更小的矩陣,使得兩者相乘以獲得原來的矩陣。這兩個矩陣分別是特徵矩陣和權重矩陣。
【特徵矩陣】
在該矩陣中,每一個特徵對應一行,每一個單詞對應一列。矩陣中的數字表明瞭某個單詞相對於某個特徵的重要程度。
因爲每一個特徵都應該對應於在一組文章中出現的某個主題,所以假若有一篇文章報道了一個新的電視秀節目,那麼也許咱們會指望這篇文章相對於單詞」television「可以有一個較高的權重值。
一個特徵矩陣示例
因爲矩陣中的每一行都對應於一個由若干單詞聯合組成的特徵,所以很顯然,只要將不一樣數量的特徵矩陣按行組合起來,就有可能從新構造出文章矩陣來。
而如何組合不一樣的特徵矩陣,用多少特徵矩陣來組合,就是由權重矩陣來控制。
【權重矩陣】
該矩陣的做用是,將特徵映射到文章矩陣,其中每一行對應於一篇文章每一列對應於一個特徵。矩陣中的數字表明瞭每一個特徵應用於每篇文章的程度。
一個權重矩陣示例
下圖給出了一個文章矩陣的重構過程,只要將權重矩陣與特徵矩陣相乘,就能夠重構出文章矩陣,
遵守矩陣乘法法則,特徵矩陣的行數和權重矩陣的列數必需要相等。若是特徵數量與文章數量剛好相等,那麼最理想的結果就是可以爲每一篇文章都找到一個與之完美匹配的特徵。
在獨立特徵提取領域中,使用矩陣因式分解的面對,是爲了縮減觀測數據(例如文章)的集合規模,而且保證縮減以後足以反映某些共性特徵。理想狀況下,這個較小的特徵集可以與不一樣的權重值相結合,從而完美地從新構造出原始的數據集。但實際狀況中,這種可能性是很是小的,所以算法的目標是要儘量地從新構造出原始數據集來。
筆者思考:
筆者這裏帶領你們用形象化的思惟來理解一下矩陣因式分解,咱們將其想象爲咱們吃月餅,從中間將其掰開爲兩半,一半是特徵矩陣,一半是權重矩陣。特徵矩陣和權重矩陣本來都不存在,由於咱們一掰,憑空出現了2個小的月餅。那接下來的問題來了,咱們可否隨意的掰這個月餅呢?答案是不行!這個月餅有本身的法則,只容許咱們按照有限幾種方案來掰,由於該法則要求掰開後的兩半還必須可以完美的拼回一個完整的月餅。
回到矩陣因式分解上來,矩陣的因式分解相似於大數的質因分解,一個矩陣只存在少許幾種因式分解方法。而要找到這幾種分解方案,就須要使用一些很是精巧的算法,例如本文要介紹的乘法更新法則(multiplicative update rules)。
咱們的目標很明確,但願找到兩個矩陣(知足M和N的約束便可),這兩個矩陣相乘要儘量接近原始目標矩陣,咱們也創建了損失函數difcost,經過計算最佳的特徵矩陣和權重矩陣,算法嘗試盡最大可能來從新構造文章矩陣。咱們經過difcost函數來度量最終的結果與理想結果的接近程度。
一種可行的優化方法是將其看做是一個優化問題,藉助模擬退火或者遺傳算法搜索到一個滿意的題解,可是這麼作的搜索成本可能過於龐大(隨着原始矩陣尺寸的上升),一個更爲有效的方法是,使用乘法更新法則(multiplicative update rules)。
這些法則產生了4個更新矩陣(update matrices),這裏咱們將最初的文章矩陣稱爲數據矩陣。
爲了更新特徵矩陣和權重矩陣,算法須要作以下幾個操做,
from numpy import * import numpy as np def difcost(a,b): dif=0 for i in range(shape(a)[0]): for j in range(shape(a)[1]): # Euclidean Distance dif+=pow(a[i,j]-b[i,j],2) return dif def factorize(v,pc=10,iter=50): ic=shape(v)[0] fc=shape(v)[1] # Initialize the weight and feature matrices with random values w=matrix([[random.random() for j in range(pc)] for i in range(ic)]) h=matrix([[random.random() for i in range(fc)] for i in range(pc)]) # Perform operation a maximum of iter times for i in range(iter): wh=w*h # Calculate the current difference cost=difcost(v,wh) if i%10==0: print cost # Terminate if the matrix has been fully factorized if cost==0: break # Update feature matrix hn=(transpose(w)*v) hd=(transpose(w)*w*h) h=matrix(array(h)*array(hn)/array(hd)) # Update weights matrix wn=(v*transpose(h)) wd=(w*h*transpose(h)) w=matrix(array(w)*array(wn)/array(wd)) return w,h if __name__ == '__main__': l1 = [[1,2,3], [4,5,6]] m1 = matrix(l1) m2 = matrix([[1,2], [3,4], [5,6]]) print "np.shape(m1*m2): ", np.shape(m1*m2) w, h = factorize(m1*m2, pc=3, iter=100) print "w: ", w print "h: ", h print "w*h: ", w*h print "m1*m2: ", m1*m2
能夠看到,算法成功地找到了權重矩陣和特徵矩陣,使得這兩個矩陣相乘的結果與原始矩陣幾乎完美匹配。
值得注意的是,MUR方法要求咱們明確指定但願找到的特徵數。例如,在一段錄音中的兩個聲音,或者是當天的5大新聞主題。
有了前兩節的鋪墊,如今咱們來討論一些NMF的理論概念。
NMF(Non-negative matrix factorization),即對於任意給定的一個非負矩陣V,其可以尋找到一個非負矩陣W和一個非負矩陣H,知足條件V=W*H。從而將一個非負的矩陣分解爲左右兩個非負矩陣的乘積。其中,
分解過程以下圖所示,
對於如何衡量因式分解的效果,有3種損失函數形式,其中第一種咱們上前面的例子中已經使用到了簡化版本。
【squared frobenius norm】
其中:
α爲L1&L2正則化參數,𝜌爲L1正則化佔總正則化項的比例,||*||_1爲L1範數。
咱們前面章節中的difcost函數,就是當α和𝜌都爲0的的時候的一種損失函數特例。
【Kullback-Leibler (KL)】
【Itakura-Saito (IS)】
實際上,上面三個公式是beta-divergence family中的三個特殊狀況(分別是當β=2,1,0),其原型是:
這裏咱們不使用手寫函數了,直接用sklearn封裝好的NMF庫。
# -*- coding: utf-8 -*- from sklearn.decomposition import NMF from sklearn.datasets import load_iris X, _ = load_iris(True) # can be used for example for dimensionality reduction, source separation or topic extraction nmf = NMF(n_components=4, # k value,默認會保留所有特徵 init=None, # W H 的初始化方法,包括'random' | 'nndsvd'(默認) | 'nndsvda' | 'nndsvdar' | 'custom'. solver='cd', # 'cd' | 'mu' beta_loss='frobenius', # {'frobenius', 'kullback-leibler', 'itakura-saito'},通常默認就好 tol=1e-4, # 中止迭代的極限條件 max_iter=200, # 最大迭代次數 random_state=None, alpha=0., # 正則化參數 l1_ratio=0., # 正則化參數 verbose=0, # 冗長模式 shuffle=False # 針對"cd solver" ) # -----------------函數------------------------ print('params:', nmf.get_params()) # 獲取構造函數參數的值,也能夠nmf.attr獲得,因此下面我會省略這些屬性 nmf.fit(X) W = nmf.fit_transform(X) W = nmf.transform(X) nmf.inverse_transform(W) # -----------------屬性------------------------ H = nmf.components_ # H矩陣 print('reconstruction_err_', nmf.reconstruction_err_) # 損失函數值 print('n_iter_', nmf.n_iter_) # 實際迭代次數 # -----------------輸出------------------------ # H即爲降維後的新矩陣(維度4是咱們經過參數指定的) print "H: ", H print "W: ", W
再看一個更具體的圖像處理的例子,使用NMF對圖像進行降維,提取出圖像中的關鍵元素,
已知Olivetti人臉數據共400個,每一個數據是64*64大小,原始數據維度4096。
NMF分解獲得的W矩陣至關於從原始矩陣中提取的特徵,那麼就可使用NMF對400我的臉數據進行特徵提取。
經過設置k的大小,設置提取的特徵的數目。在本實驗中設置k=6,隨後將提取的特徵以圖像的形式展現出來。
# -*- coding: utf-8 -*- from time import time from numpy.random import RandomState import matplotlib.pyplot as plt from sklearn.datasets import fetch_olivetti_faces from sklearn import decomposition n_row, n_col = 2, 3 n_components = n_row * n_col image_shape = (64, 64) rng = RandomState(0) # ############################################################################# # Load faces data dataset = fetch_olivetti_faces('./', True, random_state=rng) faces = dataset.data n_samples, n_features = faces.shape print("Dataset consists of %d faces, features is %s" % (n_samples, n_features)) def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray): plt.figure(figsize=(2. * n_col, 2.26 * n_row)) plt.suptitle(title, size=16) for i, comp in enumerate(images): plt.subplot(n_row, n_col, i + 1) vmax = max(comp.max(), -comp.min()) plt.imshow(comp.reshape(image_shape), cmap=cmap, interpolation='nearest', vmin=-vmax, vmax=vmax) plt.xticks(()) plt.yticks(()) plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.) # ############################################################################# estimators = [ ('Non-negative components - NMF', decomposition.NMF(n_components=n_components, init='nndsvda', tol=5e-3)) ] # ############################################################################# # Plot a sample of the input data plot_gallery("First centered Olivetti faces", faces[:n_components]) # ############################################################################# # Do the estimation and plot it for name, estimator in estimators: print("Extracting the top %d %s..." % (n_components, name)) t0 = time() data = faces estimator.fit(data) train_time = (time() - t0) print("done in %0.3fs" % train_time) components_ = estimator.components_ print('components_:', components_.shape, '\n**\n', components_) plot_gallery('%s - Train time %.1fs' % (name, train_time), components_) plt.show()
原始400圖像中的前6張(普通人臉)
NMF從原始400張圖像中提取出的6個獨立通用特徵
這個例子很是形象直觀地說明了NMF提取出的特徵矩陣的含義。對於人臉圖像來講,輸入NMF算法的是表明了原始圖像的像素矩陣,而矩陣因式分解獲得的權重矩陣W則表明了一種像素權重從新分配分配方案。
具體地說,上述6張特徵向量,每一個向量的維度都等於原始圖像(4096)維,但區別在於,每一個特徵中,各個像素的值是不一樣的,它們分別表明了不一樣的特徵。有的特徵側重於對鼻子部分的特徵提取,因此在鼻子區域的像素權重分配的就會多,其餘的以此類推。
這個小節,咱們來看一個有趣的想法,可以用隨機優化技術直接獲得矩陣的因式分解結果呢?咱們來寫一個實驗代碼看看結果:
# -*- coding: utf-8 -*- from numpy import * import numpy as np def difcost(a, b): dif = 0 for i in range(shape(a)[0]): for j in range(shape(a)[1]): # Euclidean Distance dif += pow(a[i, j] - b[i, j], 2) return dif def factorize(v, pc=10, iter=50): ic = shape(v)[0] fc = shape(v)[1] # Initialize the weight and feature matrices with random values w = matrix([[random.random() for j in range(pc)] for i in range(ic)]) h = matrix([[random.random() for i in range(fc)] for i in range(pc)]) # Perform operation a maximum of iter times for i in range(iter): wh = w * h # Calculate the current difference cost = difcost(v, wh) #if i % 10 == 0: # print cost # Terminate if the matrix has been fully factorized if cost == 0: break # Update feature matrix hn = (transpose(w) * v) hd = (transpose(w) * w * h) h = matrix(array(h) * array(hn) / array(hd)) # Update weights matrix wn = (v * transpose(h)) wd = (w * h * transpose(h)) w = matrix(array(w) * array(wn) / array(wd)) return w, h def difcost_2d(a, b): dif = 0 for i in range(shape(a)[0]): for j in range(shape(a)[1]): # Euclidean Distance dif += pow(a[i, j]-b[i, j], 2) return dif def hillclimb_2d(v, pc=2, costf=difcost_2d): ic = shape(v)[0] fc = shape(v)[1] # Initialize the weight and feature matrices with random values w = np.array([[random.random() for j in range(pc)] for i in range(ic)]) h = np.array([[random.random() for i in range(fc)] for i in range(pc)]) # Create a random solution sol_w = [] sol_h = [] for x_axi in v: sol_w.append([random.randint(0, y_axi) for y_axi in x_axi]) sol_h.append([random.randint(0, y_axi) for y_axi in x_axi]) # Main loop for w min_cost, current_cost, last_round_cost = 9999999999, 0, 0 patient_count = 0 while 1: best_tmp_w = [] best_tmp_h = [] for j in range(len(w)): # an tmp copy of x for w tmp_w = w # One away in each direction if sol_w[j][0] > w[j][0]: tmp_w[j][0] = sol_w[j][0] else: tmp_w[j][0] = w[j][0] if sol_w[j][1] > w[j][1]: tmp_w[j][1] = sol_w[j][1] else: tmp_w[j][1] = w[j][1] # record the candidate best_tmp_w.append(tmp_w) # now update h matrix for i in range(len(h)): # an tmp copy of x for h tmp_h = h # One away in each direction if sol_h[i][0] > h[i][0]: tmp_h[i][0] = sol_h[i][0] else: tmp_h[i][0] = h[i][0] if sol_h[i][1] > h[i][1]: tmp_h[i][1] = sol_h[i][1] else: tmp_h[i][1] = h[i][1] # record the candidate best_tmp_h.append(tmp_h) # hill climb one dimension one time # See what the best solution amongst the neighbors is i_w, j_h = 0, 0 for i in range(len(best_tmp_w)): for j in range(len(best_tmp_h)): current_cost = costf(v, best_tmp_w[i] * best_tmp_h[j]) if current_cost <= min_cost: i_w = i j_h = j min_cost = current_cost # update w, h w = best_tmp_w[i_w] h = best_tmp_h[j_h] # If there's no improvement, then we've reached the top if min_cost == last_round_cost and patient_count >= 99: print "min_cost == last_round_cost" break else: last_round_cost = min_cost patient_count += 1 print "patient_count: ", patient_count print "min_cost: ", min_cost print "last_round_cost: ", last_round_cost #print "w: ", w #print "h: ", h return w, h if __name__ == '__main__': m1 = np.array([[1, 2, 3], [4, 5, 6]]) m2 = np.array([[1, 2], [3, 4], [5, 6]]) v = np.dot(m1, m2) print "np.shape(v): ", np.shape(v) print "v: ", v print " " w_optimization, h_optimization = hillclimb_2d(v=v, pc=2, costf=difcost_2d) print "w_optimization: ", w_optimization print "h_optimization: ", h_optimization print " " print "w_optimization * h_optimization: ", w_optimization * h_optimization
從結果能夠看到,隨機優化的結果並很差,之因此號稱」萬能優化算法「的隨機優化算法,在這個場景下無論用了,主要緣由以下:
反過來也看到,那什麼場景適合用隨機優化算法呢?
Relevant Link:
https://en.wikipedia.org/wiki/Non-negative_matrix_factorization https://blog.csdn.net/jeffery0207/article/details/84348117 http://latex.91maths.com/ 《Learning the parts of objects by non-negative matrix factorization》D.D.Lee,H.S.Seung
廣義上來講,PCA是一種基於線性降維的特徵提取方法。
咱們仍是用人臉圖像來闡述這個概念,咱們的目的是利用PCA算法從一堆人臉圖像數據中,找到一組共同特徵,即所謂的大衆臉特徵,
# -*- coding: utf-8 -*- import logging from time import time from numpy.random import RandomState import matplotlib.pyplot as plt from sklearn.datasets import fetch_olivetti_faces from sklearn.cluster import MiniBatchKMeans from sklearn import decomposition # Display progress logs on stdout logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s') n_row, n_col = 2, 3 n_components = n_row * n_col image_shape = (64, 64) rng = RandomState(0) # ############################################################################# # Load faces data dataset = fetch_olivetti_faces(shuffle=True, random_state=rng) faces = dataset.data n_samples, n_features = faces.shape # global centering faces_centered = faces - faces.mean(axis=0) # local centering faces_centered -= faces_centered.mean(axis=1).reshape(n_samples, -1) print("Dataset consists of %d faces" % n_samples) def plot_gallery(title, images, n_col=n_col, n_row=n_row, cmap=plt.cm.gray): plt.figure(figsize=(2. * n_col, 2.26 * n_row)) plt.suptitle(title, size=16) for i, comp in enumerate(images): plt.subplot(n_row, n_col, i + 1) vmax = max(comp.max(), -comp.min()) plt.imshow(comp.reshape(image_shape), cmap=cmap, interpolation='nearest', vmin=-vmax, vmax=vmax) plt.xticks(()) plt.yticks(()) plt.subplots_adjust(0.01, 0.05, 0.99, 0.93, 0.04, 0.) # ############################################################################# # PCA estimators = [ ('Eigenfaces - PCA using randomized SVD', decomposition.PCA(n_components=n_components, svd_solver='randomized', whiten=True), True) ] # ############################################################################# # Plot a sample of the input data plot_gallery("First centered Olivetti faces", faces_centered[:n_components]) # ############################################################################# # Do the estimation and plot it for name, estimator, center in estimators: print("Extracting the top %d %s..." % (n_components, name)) t0 = time() data = faces if center: data = faces_centered estimator.fit(data) train_time = (time() - t0) print("done in %0.3fs" % train_time) # print the components_ components_ = estimator.components_ print('components_:', components_.shape, '\n**\n', components_) if hasattr(estimator, 'cluster_centers_'): components_ = estimator.cluster_centers_ else: components_ = estimator.components_ # Plot an image representing the pixelwise variance provided by the # estimator e.g its noise_variance_ attribute. The Eigenfaces estimator, # via the PCA decomposition, also provides a scalar noise_variance_ # (the mean of pixelwise variance) that cannot be displayed as an image # so we skip it. if (hasattr(estimator, 'noise_variance_') and estimator.noise_variance_.ndim > 0): # Skip the Eigenfaces case plot_gallery("Pixelwise variance", estimator.noise_variance_.reshape(1, -1), n_col=1, n_row=1) plot_gallery('%s - Train time %.1fs' % (name, train_time), components_[:n_components]) plt.show()
原始人臉圖像中的前6張(普通人臉)
PCA從原始圖像中提取出的6個獨立通用特徵(大衆臉特徵)
主成分矩陣
從運行結果中能夠看到,PCA並無任何關於人臉的先驗信息,可是PCA分解後的主成分矩陣,仍是基本呈現了一我的臉的輪廓,也就是說,PCA降維後獲得的主成分矩陣,表明了原始人臉圖像數據集中的通用特徵。
主成分矩陣的每一行的維數都和原始圖像的維數相同,區別在於不一樣像素點的權重分配不一樣,這點和NMF是同樣的。
筆者提醒:
嚴格來講,PCA是一種降維算法,並非一種獨立特徵提取算法,可是在知足特定前提假設條件下,PCA能夠做爲一種獨立特徵提取算法,
Relevant Link:
https://www.cnblogs.com/LittleHann/p/6558575.html#_label3_1_4_1 https://scikit-learn.org/stable/auto_examples/decomposition/plot_faces_decomposition.html#sphx-glr-auto-examples-decomposition-plot-faces-decomposition-py
前面咱們已經各自討論了SVD和NMF算法的原理和應用,從表面形式上看,它們兩者彷佛都是在作矩陣分解,這章咱們來梳理總結一下它們兩者各自的異同。
通常把SVD和NMF分解獲得的U矩陣中的列稱爲特徵向量,從形式上看,兩者的特徵向量有明顯的區別:
整體來講,在Forebenius範數意義下,SVD分解具備全局最優意義上的處理能力,但數據的可解釋性不強;而NMF須要迭代計算,運算速度較慢,但分解的可解釋性強,帶有輔助約束的算法可以使分解的矩陣有更大的稀疏性。另外,NMF分解針對不一樣類型的數據進行處理性能差別明顯。
Relevant Link:
http://xueshu.baidu.com/usercenter/paper/show?paperid=2e99407850fe26f810cd1ddfa556caff&site=xueshu_se
雞尾酒問題是一個經典的盲源信號分離(blind signal separation)問題。
假設在一個開party的房間裏有兩我的同時說話,房間裏兩個不一樣的位置上各放一個麥克風,各自記錄了一段聲音(時間信號),x1(t)和x2(t),這兩段聲音信號都包含了兩我的說的話的混合,咱們怎樣獲得他們分別說了什麼話呢?
記源信號是s1(t)和s2(t),這個問題能夠用如下等式描述:
若是s1,s2是統計獨立的隨機信號,那麼就能夠用x1,x2去還原。ICA正是利用統計上的獨立性從混合信號中分解出源信號的。
在討論混合矩陣、權重特徵這些具體概念,咱們先將aij形象地理解是說話者跟麥克風的距離,以下圖,
如今假設聲源信號分別爲」sinusoidal signal「和」square signal「,將其混合爲一個混合信號,並加入必定的噪音干擾,來看一下PCA和ICA信號分解的結果,
import numpy as np import matplotlib.pyplot as plt from scipy import signal from sklearn.decomposition import FastICA, PCA # ############################################################################# # Generate sample data np.random.seed(0) n_samples = 2000 time = np.linspace(0, 8, n_samples) s1 = np.sin(2 * time) # Signal 1 : sinusoidal signal s2 = np.sign(np.sin(3 * time)) # Signal 2 : square signal S = np.c_[s1, s2] S += 0.2 * np.random.normal(size=S.shape) # Add noise S /= S.std(axis=0) # Standardize data # Mix data A = np.array([[1, 1], [0.5, 2]]) # Mixing matrix X = np.dot(S, A.T) # Generate observations # print X print "X: ", X # Compute ICA ica = FastICA(n_components=2) S_ = ica.fit_transform(X) # Reconstruct signals A_ = ica.mixing_ # Get estimated mixing matrix # We can `prove` that the ICA model applies by reverting the unmixing. assert np.allclose(X, np.dot(S_, A_.T) + ica.mean_) # For comparison, compute PCA pca = PCA(n_components=2) H = pca.fit_transform(X) # Reconstruct signals based on orthogonal components # ############################################################################# # Plot results plt.figure() models = [X, S, S_, H] names = ['Observations (mixed signal)', 'True Sources', 'ICA recovered signals', 'PCA recovered signals'] colors = ['red', 'steelblue', 'orange'] for ii, (model, name) in enumerate(zip(models, names), 1): plt.subplot(4, 1, ii) plt.title(name) for sig, color in zip(model.T, colors): plt.plot(sig, color=color) plt.subplots_adjust(0.09, 0.04, 0.94, 0.94, 0.26, 0.46) plt.show()
能夠看到,ICA和PCA都成功地對原始混合信號進行了信號分解,獲得了原始的信源分信號。
相似於EM算法的思想,咱們用隱變量模型(latent variables model)來描述ICA:
ICA模型成立的基本假設以下:
ICA算法的核心目標就是,基於觀測獲得的x序列,估計A矩陣和s隱序列。
ICA可分解的前提是,隱變量序列si彼此獨立,而高斯的不相關和獨立是等價的,因此咱們這裏討論高斯不相關性。
假設混合矩陣是方陣,若是s1和s2都是服從高斯分佈,那麼x1,x2也是高斯的,且不相關,方差爲1 。聯合機率密度以下:
機率分佈圖以下,
從圖中能夠看出,分佈是對稱的,這樣就失去了方向信息,也就是A的列向量的信息。A也就預測不出來了。所謂的矩陣分解,本質就是在提取矩陣中的特徵向量的方向信息。
能夠證實,一個高斯分佈通過正交變換之後仍然是那個高斯分佈,並且變量之間還是獨立的,因此若是變量都是高斯分佈的,咱們只能知道A是一個正交變換,不能徹底肯定A。
因此,對數據集進行ICA分解的大前提是,數據集獨立信號之間須要彼此獨立。
咱們假設每一個si有機率密度ps,那麼給定時刻原信號(未知)的聯合分佈就是,
如今,咱們必須對si假設一個機率分佈函數F(x),才能繼續求解W和s(EM算法中的E步驟)。可是F(x)不能是隨機任意形式的,要知足兩個性質:
研究代表,sigmoid函數很合適,它的定義域爲負無窮到正無窮,值域爲0到1,同時仍是一個對稱函數(即指望/均值爲0)。
如今咱們就求W了(EM算法中的M步驟),在給定訓練樣本{x(i)(x(i)1,x(i)2,x(i)3,...,,x(i)n);i = 1,...,m}後,對W的對數似然估計以下:
最後求導以後的結果公式以下,
咱們知道,ICA算法的核心目的是從觀測序列中獲得一組彼此相對獨立的si序列。所以度量si序列提取效果的好壞是衡量其獨立性。
根據中心極限定理,若一隨機變量X由許多相互獨立的隨機變量Si(i=1,2,...,N)之和混合組成,只要Si具備有限的均值和方差,則不論其爲什麼種分佈,隨機變量X較Si更接近高斯分佈。所以,在分離過程彙總,可經過對分離結果的非高斯性度量來表示分離結果間的互相獨立性,當非高斯性度量達到最大時,則代表已完成對各個獨立份量的分離。
度量費高斯性的方法有以下幾種:
Relevant Link:
https://blog.csdn.net/u014485485/article/details/78452820 https://danieljyc.github.io/2014/06/13/%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A015-3--%E7%8B%AC%E7%AB%8B%E6%88%90%E5%88%86%E5%88%86%E6%9E%90ica%EF%BC%88independent-component-analysis%EF%BC%89/ https://www.dataivy.cn/blog/%E7%8B%AC%E7%AB%8B%E6%88%90%E5%88%86%E5%88%86%E6%9E%90independent-component-analysis_ica/ https://www.cnblogs.com/Determined22/p/6357291.html https://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_blind_source_separation.html#sphx-glr-auto-examples-decomposition-plot-ica-blind-source-separation-py https://www.jianshu.com/p/de396e8cce15
一言以蔽之,自動編碼器是一種無監督的神經網絡模型,它能夠用於像
之類的目的,同時AE用學習到的新特徵能夠重構出原始輸入數據,稱之爲解碼(decoding)。
一個最基本的AE架構以下圖,包括編碼和解碼兩個過程,
自動編碼器的編碼與解碼
自動編碼器是將輸入 進行編碼,獲得新的特徵
,而且但願原始的輸入
可以重新的特徵
重構出來。編碼過程以下:
能夠看到,和神經網絡結構同樣,其編碼過程本質上線性組合以後加上非線性的激活函數。
同時,利用新的特徵 ,能夠對輸入
重構,即解碼過程:
咱們但願重構出的 和儘量一致,能夠採用最小化負對數似然的損失函數來訓練這個模型:
通常狀況下,咱們會對自動編碼器加上一些限制,經常使用的是使 ,這稱爲綁定權重(tied weights),本文全部的自動編碼器都加上這個限制。有時候,咱們還會給自動編碼器加上更多的約束條件,去噪自動編碼器以及稀疏自動編碼器就屬於這種狀況,由於大部分時候單純地重構原始輸入並無什麼意義,咱們但願自動編碼器在近似重構原始輸入的狀況下可以捕捉到原始輸入更有價值的信息,在這方面,embedding詞向量嵌入有相似的思想。
這章咱們介紹幾種不一樣類型的Auto-Encoder,從不一樣角度理解AE的思想內涵。
# -*- coding: utf-8 -*- from keras.layers import Input, Dense from keras.models import Model from keras.datasets import mnist import numpy as np import matplotlib.pyplot as plt ######################## 先創建一個全鏈接的編碼器和解碼器 ######################## # this is the size of our encoded representations encoding_dim = 32 # 32 floats -> compression of factor 24.5, assuming the input is 784 floats # this is our input placeholder input_img = Input(shape=(784,)) # "encoded" is the encoded representation of the input encoded = Dense(encoding_dim, activation='relu')(input_img) # "decoded" is the lossy reconstruction of the input decoded = Dense(784, activation='sigmoid')(encoded) # this model maps an input to its reconstruction autoencoder = Model(input=input_img, output=decoded) ######################## 能夠單獨的使用編碼器和解碼器 # this model maps an input to its encoded representation encoder = Model(input=input_img, output=encoded) # create a placeholder for an encoded (32-dimensional) input encoded_input = Input(shape=(encoding_dim,)) # retrieve the last layer of the autoencoder model decoder_layer = autoencoder.layers[-1] # create the decoder model decoder = Model(input=encoded_input, output=decoder_layer(encoded_input)) ######################## 訓練自編碼器,來重構MNIST中的數字,這裏使用逐像素的交叉熵做爲損失函數,優化器爲adam autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') ######################## 準備MNIST數據,將其歸一化和向量化,而後訓練 (x_train, _), (x_test, _) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:]))) x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:]))) print x_train.shape print x_test.shape autoencoder.fit(x_train, x_train, nb_epoch=50, batch_size=256, shuffle=True, validation_data=(x_test, x_test)) ######################## 可視化一下重構出來的輸出 # encode and decode some digits # note that we take them from the *test* set encoded_imgs = encoder.predict(x_test) decoded_imgs = decoder.predict(encoded_imgs) n = 10 # how many digits we will display plt.figure(figsize=(20, 4)) for i in range(1, n): # display original ax = plt.subplot(2, n, i) plt.imshow(x_test[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) # display reconstruction ax = plt.subplot(2, n, i + n) plt.imshow(decoded_imgs[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
AE解碼後的結果以下,
能夠看到,該全鏈接神經網絡成功地從784維的圖像矩陣中提取出了有效的32維低維特徵信息,而且基於這32維特徵,幾乎無損地重建出了原始的輸入圖像。從這個例子中,咱們能夠感覺到一些深度學習可解釋性方面的內涵。
上一節中咱們的隱層有32個神經元,這種狀況下,通常而言自編碼器學到的是PCA的一個近似。可是若是咱們對隱層單元施加稀疏性約束(正則化約束)的話,會獲得更爲緊湊的表達,只有一小部分神經元會被激活。在Keras中,咱們能夠經過添加一個activity_regularizer達到對某層激活值進行約束的目的:
# -*- coding: utf-8 -*- from keras.layers import Input, Dense from keras.models import Model from keras.datasets import mnist from keras import regularizers import numpy as np import matplotlib.pyplot as plt ######################## 先創建一個全鏈接的編碼器和解碼器 ######################## # this is the size of our encoded representations encoding_dim = 64 # 64 floats -> compression of factor 24.5, assuming the input is 784 floats input_img = Input(shape=(784,)) # add a Dense layer with a L1 activity regularizer encoded = Dense(encoding_dim, activation='relu', kernel_regularizer=regularizers.l1(10e-5))(input_img) decoded = Dense(784, activation='sigmoid')(encoded) autoencoder = Model(input=input_img, output=decoded) ######################## 能夠單獨的使用編碼器和解碼器 # this model maps an input to its encoded representation encoder = Model(input=input_img, output=encoded) # create a placeholder for an encoded (32-dimensional) input encoded_input = Input(shape=(encoding_dim,)) # retrieve the last layer of the autoencoder model decoder_layer = autoencoder.layers[-1] # create the decoder model decoder = Model(input=encoded_input, output=decoder_layer(encoded_input)) ######################## 訓練自編碼器,來重構MNIST中的數字,這裏使用逐像素的交叉熵做爲損失函數,優化器爲adam autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') ######################## 準備MNIST數據,將其歸一化和向量化,而後訓練 (x_train, _), (x_test, _) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:]))) x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:]))) print x_train.shape print x_test.shape autoencoder.fit(x_train, x_train, epochs=100, batch_size=256, shuffle=True, validation_data=(x_test, x_test)) ######################## 可視化一下重構出來的輸出 # encode and decode some digits # note that we take them from the *test* set encoded_imgs = encoder.predict(x_test) decoded_imgs = decoder.predict(encoded_imgs) n = 10 # how many digits we will display plt.figure(figsize=(20, 4)) for i in range(1, n): # display original ax = plt.subplot(2, n, i) plt.imshow(x_test[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) # display reconstruction ax = plt.subplot(2, n, i + n) plt.imshow(decoded_imgs[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
隱層設置爲32維
隱層設置爲64維
能夠看到,隱層的維數越高,AE解碼還原後的圖像辨識度就相對越高。但總的來講,增長了稀疏約束後,編碼出來的碼字(隱層特徵向量)都會更加稀疏。
在保持32維隱層的狀況下,稀疏自編碼器的在10000個測試圖片上的碼字均值爲3.33,而以前的爲7.30。
咱們能夠將多個自編碼器疊起來,本質上是提升深度神經網絡的複雜度,提高特徵提取效率,
# -*- coding: utf-8 -*- from keras.layers import Input, Dense from keras.models import Model from keras.datasets import mnist from keras import regularizers import numpy as np import matplotlib.pyplot as plt ######################## 先創建一個全鏈接的編碼器和解碼器 ######################## # this is the size of our encoded representations encoding_dim = 32 # 64 floats -> compression of factor 24.5, assuming the input is 784 floats input_img = Input(shape=(784,)) encoded = Dense(128, activation='relu')(input_img) encoded = Dense(64, activation='relu')(encoded) encoded = Dense(32, activation='relu')(encoded) decoded = Dense(64, activation='relu')(encoded) decoded = Dense(128, activation='relu')(decoded) decoded = Dense(784, activation='sigmoid')(decoded) autoencoder = Model(input=input_img, output=decoded) autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') ######################## 訓練自編碼器,來重構MNIST中的數字,這裏使用逐像素的交叉熵做爲損失函數,優化器爲adam autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') ######################## 準備MNIST數據,將其歸一化和向量化,而後訓練 (x_train, _), (x_test, _) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:]))) x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:]))) print x_train.shape print x_test.shape autoencoder.fit(x_train, x_train, epochs=100, batch_size=128, shuffle=True, validation_data=(x_test, x_test)) ######################## 可視化一下重構出來的輸出 # encode and decode some digits # note that we take them from the *test* set decoded_imgs = autoencoder.predict(x_test) n = 10 # how many digits we will display plt.figure(figsize=(20, 4)) for i in range(1, n): # display original ax = plt.subplot(2, n, i) plt.imshow(x_test[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) # display reconstruction ax = plt.subplot(2, n, i + n) plt.imshow(decoded_imgs[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
效果比AE要好一些,固然,訓練時間也增長了。
當輸入是圖像時,使用卷積神經網絡基本是一個較好的策略,在工程項目中,用於處理圖像的自動編碼器幾乎都是卷積自動編碼器。
# -*- coding: utf-8 -*- from keras.layers import Input, Dense from keras.models import Model from keras.datasets import mnist import numpy as np from keras.layers import Input, Dense, Convolution2D, MaxPooling2D, UpSampling2D from keras.models import Model import matplotlib.pyplot as plt ######################## 先創建一個編碼器和解碼器 ######################## input_img = Input(shape=(1, 28, 28)) x = Convolution2D(16, 3, 3, activation='relu', border_mode='same')(input_img) x = MaxPooling2D((2, 2), border_mode='same')(x) x = Convolution2D(8, 3, 3, activation='relu', border_mode='same')(x) x = MaxPooling2D((2, 2), border_mode='same')(x) x = Convolution2D(8, 3, 3, activation='relu', border_mode='same')(x) encoded = MaxPooling2D((2, 2), border_mode='same')(x) # at this point the representation is (8, 4, 4) i.e. 128-dimensional x = Convolution2D(8, 3, 3, activation='relu', border_mode='same')(encoded) x = UpSampling2D((2, 2))(x) x = Convolution2D(8, 3, 3, activation='relu', border_mode='same')(x) x = UpSampling2D((2, 2))(x) x = Convolution2D(16, 3, 3, activation='relu')(x) x = UpSampling2D((2, 2))(x) decoded = Convolution2D(1, 3, 3, activation='sigmoid', border_mode='same')(x) autoencoder = Model(input_img, decoded) autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') ######################## 訓練自編碼器,來重構MNIST中的數字,這裏使用逐像素的交叉熵做爲損失函數,優化器爲adam autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') ######################## 準備MNIST數據,將其歸一化和向量化,而後訓練 (x_train, _), (x_test, _) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = np.reshape(x_train, (len(x_train), 1, 28, 28)) x_test = np.reshape(x_test, (len(x_test), 1, 28, 28)) print x_train.shape print x_test.shape #print "x_test: ", x_test autoencoder.fit(x_train, x_train, nb_epoch=100, batch_size=128, shuffle=True, validation_data=(x_test, x_test) ) ######################## 可視化一下重構出來的輸出 # encode and decode some digits # note that we take them from the *test* set decoded_imgs = autoencoder.predict(x_test) n = 10 # how many digits we will display plt.figure(figsize=(20, 4)) for i in range(1, n): # display original ax = plt.subplot(2, n, i) plt.imshow(x_test[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) # display reconstruction ax = plt.subplot(2, n, i + n) plt.imshow(decoded_imgs[i].reshape(28, 28)) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
這個小節,不介紹新的AE,咱們來看看中間的碼字長什麼樣,咱們將中間碼字reshape成4*32,
# -*- coding: utf-8 -*- from keras.layers import Input, Dense from keras.models import Model from keras.datasets import mnist from keras import regularizers import numpy as np import matplotlib.pyplot as plt ######################## 先創建一個編碼器和解碼器 ######################## # this is the size of our encoded representations encoding_dim = 128 # 64 floats -> compression of factor 24.5, assuming the input is 784 floats input_img = Input(shape=(784,)) # add a Dense layer with a L1 activity regularizer encoded = Dense(encoding_dim, activation='relu', kernel_regularizer=regularizers.l1(10e-5))(input_img) decoded = Dense(784, activation='sigmoid')(encoded) autoencoder = Model(input=input_img, output=decoded) ######################## 能夠單獨的使用編碼器和解碼器 # this model maps an input to its encoded representation encoder = Model(input=input_img, output=encoded) # create a placeholder for an encoded (32-dimensional) input encoded_input = Input(shape=(encoding_dim,)) # retrieve the last layer of the autoencoder model decoder_layer = autoencoder.layers[-1] # create the decoder model decoder = Model(input=encoded_input, output=decoder_layer(encoded_input)) ######################## 訓練自編碼器,來重構MNIST中的數字,這裏使用逐像素的交叉熵做爲損失函數,優化器爲adam autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') ######################## 準備MNIST數據,將其歸一化和向量化,而後訓練 (x_train, _), (x_test, _) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:]))) x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:]))) print x_train.shape print x_test.shape autoencoder.fit(x_train, x_train, epochs=100, batch_size=256, shuffle=True, validation_data=(x_test, x_test)) ######################## 可視化一下中間碼字 # encode and decode some digits # note that we take them from the *test* set encoded_imgs = encoder.predict(x_test) decoded_imgs = decoder.predict(encoded_imgs) n = 10 # how many digits we will display plt.figure(figsize=(20, 8)) for i in range(1,n): ax = plt.subplot(1, n, i) plt.imshow(encoded_imgs[i].reshape(4, 4 * 8).T) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
從神經網絡結構的角度看,AE本質上是一個"matrix-to-matrix"的」結構的濾波器,咱們在不少語言翻譯神經網絡中也會看到相似結構。基於這種濾波器結構的AE網絡,咱們能夠實現圖像降噪方面的目的。
咱們把訓練樣本用噪聲污染,而後使解碼器解碼出乾淨的照片,以得到去噪自動編碼器。首先咱們把原圖片加入高斯噪聲,而後把像素值clip到0~1。
能夠看到,降噪效果很不錯,AE起到了濾波器的做用。
筆者思考:
在實際工程項目中,數據集中的噪點是一個很常見的問題,例如
固然,算法自己也有一些緩解手段來應對噪點問題(例如SVM中的soft-boundary),可是依舊沒法徹底規避數據噪點帶來的過擬合和欠擬合問題。一個可行的作法是,對原始數據先經過AE進行降噪處理後,再輸出機器學習模型(例如決策樹),這樣會獲得更好的模型性能。
咱們已經知道,AE是一個獨立特徵提取的神經網絡,在前面的章節中,咱們介紹了經過AE直接在隱層獲得碼字特徵向量,這是一種一步到位的作法。這小節,咱們對前面的AE進行一些小修改,使得編碼器學習到輸入數據的隱變量模型(機率分佈模型)。隱變量模型是鏈接顯變量集和隱變量集的統計模型,隱變量模型的假設是顯變量是由隱變量的狀態控制的,各個顯變量之間條件獨立。
簡單來講,變分編碼器學習數據機率分佈的一組參數(編碼階段)。並經過在這個機率分佈中採樣,你能夠生成新的輸入數據(解碼階段),即變分編碼器是一個生成模型。
變分編碼器的簡要工做過程以下:
參數藉由兩個損失函數來訓練,
# -*- coding: utf-8 -*- from __future__ import absolute_import from __future__ import division from __future__ import print_function from keras.layers import Lambda, Input, Dense from keras.models import Model from keras.datasets import mnist from keras.losses import mse, binary_crossentropy from keras.utils import plot_model from keras import backend as K import numpy as np import matplotlib.pyplot as plt import argparse import os # reparameterization trick # instead of sampling from Q(z|X), sample epsilon = N(0,I) # z = z_mean + sqrt(var) * epsilon def sampling(args): """Reparameterization trick by sampling from an isotropic unit Gaussian. # Arguments args (tensor): mean and log of variance of Q(z|X) # Returns z (tensor): sampled latent vector """ z_mean, z_log_var = args batch = K.shape(z_mean)[0] dim = K.int_shape(z_mean)[1] # by default, random_normal has mean = 0 and std = 1.0 epsilon = K.random_normal(shape=(batch, dim)) return z_mean + K.exp(0.5 * z_log_var) * epsilon def plot_results(models, data, batch_size=128, model_name="vae_mnist"): """Plots labels and MNIST digits as a function of the 2D latent vector # Arguments models (tuple): encoder and decoder models data (tuple): test data and label batch_size (int): prediction batch size model_name (string): which model is using this function """ encoder, decoder = models x_test, y_test = data #os.makedirs(model_name, exist_ok=True) filename = os.path.join(model_name, "vae_mean.png") # display a 2D plot of the digit classes in the latent space z_mean, _, _ = encoder.predict(x_test, batch_size=batch_size) plt.figure(figsize=(12, 10)) plt.scatter(z_mean[:, 0], z_mean[:, 1], c=y_test) plt.colorbar() plt.xlabel("z[0]") plt.ylabel("z[1]") plt.savefig(filename) plt.show() filename = os.path.join(model_name, "digits_over_latent.png") # display a 30x30 2D manifold of digits n = 30 digit_size = 28 figure = np.zeros((digit_size * n, digit_size * n)) # linearly spaced coordinates corresponding to the 2D plot # of digit classes in the latent space grid_x = np.linspace(-4, 4, n) grid_y = np.linspace(-4, 4, n)[::-1] for i, yi in enumerate(grid_y): for j, xi in enumerate(grid_x): z_sample = np.array([[xi, yi]]) x_decoded = decoder.predict(z_sample) digit = x_decoded[0].reshape(digit_size, digit_size) figure[i * digit_size: (i + 1) * digit_size, j * digit_size: (j + 1) * digit_size] = digit plt.figure(figsize=(10, 10)) start_range = digit_size // 2 end_range = (n - 1) * digit_size + start_range + 1 pixel_range = np.arange(start_range, end_range, digit_size) sample_range_x = np.round(grid_x, 1) sample_range_y = np.round(grid_y, 1) plt.xticks(pixel_range, sample_range_x) plt.yticks(pixel_range, sample_range_y) plt.xlabel("z[0]") plt.ylabel("z[1]") plt.imshow(figure, cmap='Greys_r') plt.savefig(filename) plt.show() # MNIST dataset # 使用MNIST庫來訓練變分編碼器 (x_train, y_train), (x_test, y_test) = mnist.load_data() image_size = x_train.shape[1] original_dim = image_size * image_size x_train = np.reshape(x_train, [-1, original_dim]) x_test = np.reshape(x_test, [-1, original_dim]) x_train = x_train.astype('float32') / 255 x_test = x_test.astype('float32') / 255 # network parameters input_shape = (original_dim, ) intermediate_dim = 512 batch_size = 128 latent_dim = 2 epochs = 50 # VAE model = encoder + decoder # build encoder model # 創建編碼網絡,將輸入影射爲隱分佈的參數: inputs = Input(shape=input_shape, name='encoder_input') x = Dense(intermediate_dim, activation='relu')(inputs) z_mean = Dense(latent_dim, name='z_mean')(x) z_log_var = Dense(latent_dim, name='z_log_var')(x) # use reparameterization trick to push the sampling out as input # note that "output_shape" isn't necessary with the TensorFlow backend # 從這些參數肯定的分佈中採樣,這個樣本至關於以前的隱層值 z = Lambda(sampling, output_shape=(latent_dim,), name='z')([z_mean, z_log_var]) # instantiate encoder model encoder = Model(inputs, [z_mean, z_log_var, z], name='encoder') encoder.summary() plot_model(encoder, to_file='vae_mlp_encoder.png', show_shapes=True) # build decoder model # 將採樣獲得的點映射回去重構原輸入 latent_inputs = Input(shape=(latent_dim,), name='z_sampling') x = Dense(intermediate_dim, activation='relu')(latent_inputs) outputs = Dense(original_dim, activation='sigmoid')(x) # instantiate decoder model decoder = Model(latent_inputs, outputs, name='decoder') decoder.summary() plot_model(decoder, to_file='vae_mlp_decoder.png', show_shapes=True) # instantiate VAE model # 到目前爲止咱們作的工做須要實例化三個模型: # 1. 一個端到端的自動編碼器,用於完成輸入信號的重構 # 2. 一個用於將輸入空間映射爲隱空間的編碼器 # 3. 一個利用隱空間的分佈產生的樣本點生成對應的重構樣本的生成器 outputs = decoder(encoder(inputs)[2]) vae = Model(inputs, outputs, name='vae_mlp') if __name__ == '__main__': parser = argparse.ArgumentParser() help_ = "Load h5 model trained weights" parser.add_argument("-w", "--weights", help=help_) help_ = "Use mse loss instead of binary cross entropy (default)" parser.add_argument("-m", "--mse", help=help_, action='store_true') args = parser.parse_args() models = (encoder, decoder) data = (x_test, y_test) # VAE loss = mse_loss or xent_loss + kl_loss if args.mse: reconstruction_loss = mse(inputs, outputs) else: reconstruction_loss = binary_crossentropy(inputs, outputs) reconstruction_loss *= original_dim kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var) kl_loss = K.sum(kl_loss, axis=-1) kl_loss *= -0.5 vae_loss = K.mean(reconstruction_loss + kl_loss) vae.add_loss(vae_loss) vae.compile(optimizer='adam') vae.summary() plot_model(vae, to_file='vae_mlp.png', show_shapes=True) if args.weights: vae.load_weights(args.weights) else: # train the autoencoder vae.fit(x_train, epochs=epochs, batch_size=batch_size, validation_data=(x_test, None)) vae.save_weights('vae_mlp_mnist.h5') plot_results(models, data, batch_size=batch_size, model_name="vae_mlp") x_test_encoded = encoder.predict(x_test, batch_size=batch_size) plt.figure(figsize=(6, 6)) plt.scatter(x_test_encoded[:, 0], x_test_encoded[:, 1], c=y_test) plt.colorbar() plt.show() # display a 2D manifold of the digits n = 15 # figure with 15x15 digits digit_size = 28 figure = np.zeros((digit_size * n, digit_size * n)) # we will sample n points within [-15, 15] standard deviations grid_x = np.linspace(-15, 15, n) grid_y = np.linspace(-15, 15, n) for i, yi in enumerate(grid_x): for j, xi in enumerate(grid_y): z_sample = np.array([[xi, yi]]) * [1.0, 1.0] x_decoded = decoder.predict(z_sample) digit = x_decoded[0].reshape(digit_size, digit_size) figure[i * digit_size: (i + 1) * digit_size, j * digit_size: (j + 1) * digit_size] = digit plt.figure(figsize=(10, 10)) plt.imshow(figure) plt.show()
編碼器網絡結構
解碼器網絡結構
整體VAE網絡結構
由於咱們的隱空間只有兩維,因此咱們能夠可視化一下。咱們來看看2D平面中不一樣類的近鄰分佈:
上圖每種顏色表明一個數字,相近聚類的數字表明他們在結構上類似。
由於變分編碼器是一個生成模型,咱們能夠用它來生成新數字。咱們能夠從隱平面上採樣一些點,而後生成對應的顯變量,即MNIST的數字:
能夠看到,和上面的2D機率是對應的,不一樣的數字佔據了不一樣的機率分佈區域。
筆者思考:
從這個實驗中,咱們能夠更加深入的理解深度神經網絡的可解釋性的內涵,基本上來講,深度神經網絡強大的擬合能力來源於它的高維非線性,同時由於正則化約束的存在,隱層中間態碼字的權重分配會傾向於集中。這樣,神經網絡就能夠同時具有對複雜問題和簡單問題的擬合能力,整體上上說,深度神經網絡的策略是:」儘可能精確匹配,可是夠用就行「。因此在MNIST這個問題上,AE只須要不多的隱層碼字就能夠提取出MNIST手寫數字中的主要特徵,可是若是面對的是一個更復雜的問題,AE就相對地須要更多的碼字來進行有效編碼。
相關的思想,田淵棟博士在它的論文《Luck Matters: Understanding Training Dynamics of Deep ReLU Networks》中進行了詳細討論,論文很是精彩,推薦讀者朋友們。
Relevant Link:
https://blog.csdn.net/a819825294/article/details/53516980 https://zhuanlan.zhihu.com/p/31742653 https://zhuanlan.zhihu.com/p/67782029 https://github.com/keras-team/keras/blob/master/examples/variational_autoencoder.py https://keras-cn.readthedocs.io/en/latest/legacy/blog/autoencoder/
這個章節,咱們來作一個有趣的實驗,經過圖像處理這個具體問題,來探究AE和PCA在降維和獨立特徵提取方面性能的優劣。
從線性代數角度來看,這兩種技術彷佛有些殊途同歸之妙,咱們來經過實驗驗證一下這個猜測。
咱們分別調用PCA和AE的API,將MNIST手寫數字數據測試集降維到128維,並觀察前10個降維後的圖像,
# -*- coding: utf-8 -*- import numpy as np import pandas as pd from pandas import Series,DataFrame import matplotlib.pyplot as plt from sklearn.decomposition import PCA from keras.datasets import mnist from keras.layers import Input, Dense from keras.models import Model from keras import regularizers if __name__ == '__main__': # load mnist data (x_train, _), (x_test, _) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:]))) x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:]))) print x_train.shape print x_test.shape # 1. PCA: dimensionality reduction into 128 dim pca = PCA(n_components=128, whiten=True) pca.fit(x_train, x_train) X_train_pca = pca.transform(x_train) x_test_pca = pca.transform(x_test) print "x_test_pca: ", x_test_pca print "np.shape(x_test_pca): ", np.shape(x_test_pca) # show top100 sample which is been pca reduction samples = x_test_pca[:8] plt.figure(figsize=(12, 18)) for i in range(len(samples)): plt.subplot(10, 10, i + 1) plt.imshow(samples[i].reshape(32, 4), cmap='gray') plt.axis('off') plt.show() # 2. AE: dimensionality reduction into 128 dim encoding_dim = 128 input_img = Input(shape=(784,)) encoded = Dense(encoding_dim, activation='relu')(input_img) decoded = Dense(784, activation='sigmoid')(encoded) autoencoder = Model(input=input_img, output=decoded) encoder = Model(input=input_img, output=encoded) encoded_input = Input(shape=(encoding_dim,)) autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') autoencoder.fit(x_train, x_train, epochs=100, batch_size=256, shuffle=True, validation_data=(x_test, x_test)) encoded_imgs = encoder.predict(x_test) n = 10 # how many digits we will display plt.figure(figsize=(20, 8)) for i in range(1, n): ax = plt.subplot(1, n, i) plt.imshow(encoded_imgs[i].reshape(4, 32).T) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show()
PCA降維結果
AE降維結果
從肉眼上看,兩種技術的降維效果差很少。那怎麼定量評估這兩種方案的實際效果(保真度和可還原度)呢?咱們能夠將它們降維後的數據集輸入有監督學習算法(例如SVC),並評估有監督模型的預測性能,以此來評價前置降維算法的性能優劣。
# -*- coding: utf-8 -*- import numpy as np import pandas as pd from pandas import Series,DataFrame import matplotlib.pyplot as plt from sklearn.decomposition import PCA from keras.datasets import mnist from keras.layers import Input, Dense from keras.models import Model from keras import regularizers from sklearn.svm import SVC from sklearn.metrics import accuracy_score if __name__ == '__main__': # load mnist data (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train = x_train.astype('float32') / 255. x_test = x_test.astype('float32') / 255. x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:]))) x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:]))) print x_train.shape print "y_train: ", y_train print x_test.shape print "y_test: ", y_test # 1. PCA: dimensionality reduction into 128 dim pca = PCA(n_components=128, whiten=True) pca.fit(x_train, y_train) X_train_pca = pca.transform(x_train) x_test_pca = pca.transform(x_test) print "x_test_pca: ", x_test_pca print "np.shape(x_test_pca): ", np.shape(x_test_pca) # show top100 sample which is been pca reduction samples = x_test_pca[:8] plt.figure(figsize=(12, 18)) for i in range(len(samples)): plt.subplot(10, 10, i + 1) plt.imshow(samples[i].reshape(32, 4), cmap='gray') plt.axis('off') plt.show() # crate supervised learning model svc = SVC(kernel='rbf') svc.fit(X_train_pca, y_train) print 'accuracy for PCA: {0}'.format(accuracy_score(y_test, svc.predict(x_test_pca))) # 2. AE: dimensionality reduction into 128 dim encoding_dim = 128 input_img = Input(shape=(784,)) encoded = Dense(encoding_dim, activation='relu')(input_img) decoded = Dense(784, activation='sigmoid')(encoded) autoencoder = Model(input=input_img, output=decoded) encoder = Model(input=input_img, output=encoded) encoded_input = Input(shape=(encoding_dim,)) autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy') print "x_train: ", x_train autoencoder.fit(x_train, x_train, epochs=100, batch_size=256, shuffle=True, validation_data=(x_test, x_test)) X_train_ae = encoder.predict(x_train) x_test_ae = encoder.predict(x_test) n = 10 # how many digits we will display plt.figure(figsize=(20, 8)) for i in range(1, n): ax = plt.subplot(1, n, i) plt.imshow(X_train_ae[i].reshape(4, 32).T) plt.gray() ax.get_xaxis().set_visible(False) ax.get_yaxis().set_visible(False) plt.show() # crate supervised learning model svc = SVC(kernel='rbf') svc.fit(X_train_ae, y_train) print 'accuracy for AE: {0}'.format(accuracy_score(y_test, svc.predict(x_test_ae)))
分別用PCA和AE降維後的訓練集訓練獲得的SVC有監督模型,對一樣降維後測試集進行預測(降維算法這裏充當一個前置樣本預處理器),結果以下,
從實驗結果來看,AE和PCA的降維效果不相上下,不存在誰比誰有質的飛躍。讀者朋友還能夠用DAE代替PCA作一樣的實驗,結果也差很少,DAE的效果有輕微提高,但沒有質的飛躍。
從這個實驗中能夠看出,咱們不須要盲目崇拜深度學習,雖然如今深度學習是最熱門的一個前沿論文領域,可是有着堅實理論基礎的傳統機器學習算法,依然有着強大的生命力以及不輸深度學習的效果。咱們學習要注重學習底層原理,不要盲目追新。
本文介紹的幾種獨立特徵提取算法基本上是和具體細分領域無關的通用算法。其實在文本降維、文本語義理解、話題提取、圖像處理、音頻信號處理等領域還有不少衍生的獨立特徵提取算法,例如:
本章內容因爲涉及到一些工做上的內容和實驗結果,筆者這裏只放置一些簡要思路,詳細的實驗過程和實驗結果,有興趣的朋友能夠給我發郵件或者微信,咱們能夠離線交流。
代碼裏的blacksamples是筆者本身準備的一份webshell黑樣本,讀者朋友請自行準備。
# -*- coding: utf-8 -*- from time import time from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer from sklearn.decomposition import NMF, LatentDirichletAllocation from sklearn.datasets import fetch_20newsgroups from sklearn.externals import joblib import os import pickle import pandas import numpy as np n_samples = 2000 n_features = 1000 n_components = 10 n_top_words = 20 def print_top_words(model, feature_names, n_top_words): for topic_idx, topic in enumerate(model.components_): message = "Topic #%d: " % topic_idx message += " ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]) print(message) print() def load_webshell_apidata(): op_dict = { 'apicalls': [], 'path': [] } rootDir = "./blacksamples" for lists in os.listdir(rootDir): if lists == '.DS_Store': continue pickle_path = os.path.join(rootDir, lists, "vector.pickle") if os.path.exists(pickle_path): op = pickle.load(open(pickle_path, 'rb')) op_dict['apicalls'].append(' '.join(op['apicalls'])) # add as a apicall string "api api" op_dict['path'].append(op['path']) df = pandas.DataFrame(data=op_dict) return df print("Loading dataset...") t0 = time() dataset_apiifo = load_webshell_apidata() dataset = dataset_apiifo['apicalls'].values.tolist() dataset_path = dataset_apiifo['path'].values.tolist() #print "dataset: ", dataset data_samples = dataset[:n_samples] print "data_samples[:10]: ", data_samples[:10] print("done in %0.3fs." % (time() - t0)) print(" ") # Use tf-idf features for NMF. print("Extracting tf-idf features for NMF...") tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english') t0 = time() tfidf = tfidf_vectorizer.fit_transform(data_samples) print("done in %0.3fs." % (time() - t0)) print(" ") # Use tf (raw term count) features for LDA. print("Extracting tf features for LDA...") tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english') t0 = time() tf = tf_vectorizer.fit_transform(data_samples) print("done in %0.3fs." % (time() - t0)) print(" ") # Fit the NMF model print("Fitting the NMF model (Frobenius norm) with tf-idf features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features)) t0 = time() nmf = NMF(n_components=n_components, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf) print("done in %0.3fs." % (time() - t0)) print(" ") print("\nTopics in NMF model (Frobenius norm):") tfidf_feature_names = tfidf_vectorizer.get_feature_names() print_top_words(nmf, tfidf_feature_names, n_top_words) # Fit the NMF model print("Fitting the NMF model (generalized Kullback-Leibler divergence) with " "tf-idf features, n_samples=%d and n_features=%d..." % (n_samples, n_features)) t0 = time() nmf = NMF(n_components=n_components, random_state=1, beta_loss='kullback-leibler', solver='mu', max_iter=1000, alpha=.1, l1_ratio=.5).fit(tfidf) print("done in %0.3fs." % (time() - t0)) print(" ") print("\nTopics in NMF model (generalized Kullback-Leibler divergence):") tfidf_feature_names = tfidf_vectorizer.get_feature_names() print_top_words(nmf, tfidf_feature_names, n_top_words) print("Fitting LDA models with tf features, " "n_samples=%d and n_features=%d..." % (n_samples, n_features)) lda = LatentDirichletAllocation(n_components=n_components, max_iter=5, learning_method='online', learning_offset=50., random_state=0) t0 = time() lda.fit(tf) print("done in %0.3fs." % (time() - t0)) print(" ") print("\nTopics in LDA model:") tf_feature_names = tf_vectorizer.get_feature_names() print_top_words(lda, tf_feature_names, n_top_words)
實驗結果以下:
Topics in NMF model (Frobenius norm): Topic #0: stristr eval preg_match headers_sent error_reporting base64_decode gzinflate mail isref explode include defined require trim getcwd getenv getdir getfakedownloadpath get_info getdoccomment Topic #1: stripos eval gzuncompress setcookie time base64_decode header require_once dirname defined gzinflate require mail explode trim str_ireplace file_exists ini_get rtrim ltrim Topic #2: function_exists eval create_function preg_match gzdecode ob_start header defined error_reporting base64_decode gzinflate mail __lambda_func explode require_once block_bot chr set_time_limit strrev strtr Topic #3: base64_encode eval ini_set error_reporting stream_context_create implode strpos md5 base64_decode file_get_contents gzinflate mail preg_match explode fwrite getfilename fileperms filesize getfakedownloadpath getextension Topic #4: strlen __lambda_func str_repeat ceil str_replace create_function base64_decode xor_data__mut rtrim ltrim str_split trim mt_rand assert getcwd getextension getenv getdir getfakedownloadpath getfilename Topic #5: file_get_contents base64_decode eval mail gzinflate explode date_default_timezone_set strrev set_time_limit trim assert strtr getcwd getdir ymsumrlfso get_magic_quotes_gpc getenv getextension getfakedownloadpath getfilename Topic #6: ord chr strlen array_map eval preg_replace base64_decode range krzbnlqhvt getcwd getextension getenv getdoccomment getfakedownloadpath getfilename gethostbyname getdir get_current_user get_magic_quotes_gpc get_info Topic #7: define eval gzinflate __lambda_func set_time_limit get_magic_quotes_gpc islogin md5 header base64_decode mb_ereg_replace loginpage ob_start error_reporting dirname file_exists isref gethostbyname set_magic_quotes_runtime version_compare Topic #8: substr ltrim trim rtrim sizeof qfyuecrkyr ymsumrlfso str_replace str_split assert md5 stripos strlen wzgygjawpn __lambda_func error_reporting array_flip preg_replace strrpos get_info Topic #9: ini_set strlen set_time_limit unserialize xor_data__mut set_magic_quotes_runtime session_start printlogin md5 base64_decode wsologin implode get_magic_quotes_gpc assert clearstatcache defined ignore_user_abort error_reporting eval getdir () Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=2000 and n_features=1000... done in 0.111s. Topics in NMF model (generalized Kullback-Leibler divergence): Topic #0: eval error_reporting stristr preg_match defined gzuncompress base64_decode define require require_once headers_sent ini_get readdir opendir basename function_exists include header time create_function Topic #1: stripos gzuncompress setcookie time require_once set_error_handler dirname file_exists str_ireplace array_map version_compare unlink chdir gmdate ymsumrlfso get_magic_quotes_gpc getcwd getdir getdoccomment getenv Topic #2: create_function function_exists header gzdecode defined ob_start preg_match base64_decode __lambda_func chdir error_reporting pack xor_data__mut getfilename getfakedownloadpath gethostbyname getextension getenv getdoccomment ymsumrlfso Topic #3: stream_context_create base64_encode file_get_contents ini_set strpos md5 implode error_reporting function_exists eval gethostbyname getfilename getcwd getfakedownloadpath getinfo getextension getenv getdoccomment getdir ymsumrlfso Topic #4: str_replace strlen ceil str_repeat base64_decode __lambda_func substr string_n_random stream_context_create strrpos explode file_get_contents reads file_read mt_rand session_start get_magic_quotes_gpc version_compare md5 extension_loaded Topic #5: base64_decode file_get_contents strrev gzinflate mail strtr explode str_rot13 header set_time_limit date_default_timezone_set php_uname eval get_current_user __construct in_array htmlspecialchars html_n file_exists extension_loaded Topic #6: strlen unserialize ord filesize chr fclose is_file gzinflate base64_decode fopen ini_set range set_time_limit xor_data__mut fread mt_rand strstr function_exists explode readdir Topic #7: define strtolower mb_ereg_replace set_time_limit require urldecode get_magic_quotes_gpc md5 dirname strpos file_exists explode require_once ob_start islogin include strdir rand header ini_get Topic #8: substr trim assert str_split rtrim preg_replace sizeof array_flip wzgygjawpn ltrim md5 getwritingscriptpath gzcompress removeoldlogs isbot ymsumrlfso gmdate strrpos var_dump getfakedownloadpath Topic #9: ini_set set_time_limit strpos implode set_magic_quotes_runtime block_bot is_array class_exists unserialize ignore_user_abort wsologin xor_data__mut dirname microtime preg_match str_ireplace printlogin file_put_contents session_start clearstatcache () Fitting LDA models with tf features, n_samples=2000 and n_features=1000... done in 0.553s. Topics in LDA model: Topic #0: ymsumrlfso substr sizeof mt_rand string_n_random preg_replace wzgygjawpn chr eval strstr function_exists str_replace explode strtolower dirname reads base64_encode require_once file_read file_get_contents Topic #1: strlen ord stripos base64_decode eval gzuncompress xor_data__mut preg_replace __lambda_func str_replace chr create_function ltrim trim rtrim str_split ceil str_repeat substr ini_set Topic #2: mb_ereg_replace urldecode assert explode array_map loginpage array_flip print_r chr ini_get fopen ord eval is_callable in_array str_replace strrpos set_magic_quotes_runtime var_dump substr Topic #3: time is_dir filemtime fileowner fileperms filegroup posix_getpwuid is_link posix_getgrgid filesize basename str_replace urlencode realpath function_exists round define htmlspecialchars readdir is_file Topic #4: eval base64_encode ini_set error_reporting file_get_contents md5 base64_decode implode preg_match strpos stream_context_create strdir set_time_limit chop define str_replace getinfo set_magic_quotes_runtime session_start unserialize Topic #5: array_map ord chr krzbnlqhvt substr sizeof strlen xor_data__mut preg_replace base64_decode eval function_exists error_reporting str_replace str_split __lambda_func explode ini_set create_function unserialize Topic #6: makename trydir getfilename exists file_exists getdir call_user_func removeoldlogs remove md5 substr file_put_contents write gmdate strtotime read dds_debug round str_replace __construct Topic #7: is_file readdir filesize round preg_match strrev str_rot13 strtr basename is_dir posix_getgrgid posix_getpwuid is_link fileperms filegroup fileowner filemtime base64_decode count str_replace Topic #8: qfyuecrkyr strpos substr sizeof header ord define isref checkrequest strlen trim _runmagicquotes preg_replace chr base64_decode run clientip is_array function_exists isallowdip Topic #9: eval wzgygjawpn base64_decode stristr gzinflate function_exists preg_match create_function stripos error_reporting define header defined substr ob_start strtolower headers_sent gzdecode __lambda_func gzuncompress ()
咱們打印了NMF提取出的10個獨立主題其對應的top權重詞,能夠看到,例如eval、base64_decode這種詞是常常出現的詞。這也符合咱們的先驗預期,畢竟若是一個文件中常常出現這些詞,則有很大可能說明該文件是一個黑文件。
咱們知道,一個時間區間內服務器的登陸流水(ssh/rdp)有可能同時包含有惡意攻擊者和合法管理員的登陸行爲,爲了降噪的目的,咱們能夠在登陸流水日誌進入檢測模塊以前,進行ICA/NMF因式分解,從原始流量中先分解出攻擊流量和正常流量(相似文章前面提到的MNIST手寫數字降噪),以後對分別對攻擊流量和正常流量的檢測就會變得更容易。
Relevant Link:
https://scikit-learn.org/stable/modules/decomposition.html#nmf https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.html#sphx-glr-auto-examples-applications-plot-topics-extraction-with-nmf-lda-py