前面幾篇分享了PCA和SVD降維,今天分享一下另外一種降維方式-自編碼降維(auto encoding)。python
自編碼是一種數據壓縮算法,編碼器和解碼器特色是:算法
自編碼由兩個重要組成部分組成:編碼器,解碼器。 網絡結構以下: 網絡
首先自編碼不能用來壓縮圖片,自編碼只是壓縮數據維度,讓這些維度的數據也能表明原數據,而圖片壓縮的目的是抽取一些像素,看上去跟原來圖片儘可能同樣。看上去像,跟數據上的表明性是兩碼事,不徹底同樣。舉個例子,下面的圖,上面色塊和下面色塊顏色同樣嗎?對計算機就是同樣的。app
那自編碼能用來幹啥呢,如下內容來自百度百科,被應用於降維(dimensionality reduction)和異常值檢測(anomaly detection)。包含卷積層構築的自編碼器可被應用於計算機視覺問題,包括圖像降噪(image denoising)、神經風格遷移(neural style transfer)等。其實我只關心降維~dom
這裏的數據依然是ktv app的用戶數據,一共是3748個用戶,16692個歌曲記錄。函數
song_hot_matrix.shape # (3748, 16692)
訓練的目的是,判斷用戶是男是女。學習
decades_hot_matrix.shape # (3748, 2)
這裏編碼維度是500維,其實我試了試300維準確度也同樣。測試
import os # 禁用gpu os.environ["CUDA_VISIBLE_DEVICES"] = "-1" from keras.layers import Input, Dense from keras.models import Model from sklearn.model_selection import train_test_split train_X,test_X, train_Y, test_Y = train_test_split(song_hot_matrix, decades_hot_matrix, test_size = 0.2, random_state = 0) encoding_dim = 500 input_matrix = Input(shape=(song_hot_matrix.shape[1],)) encoded = Dense(encoding_dim, activation='relu')(input_matrix) decoded = Dense(song_hot_matrix.shape[1], activation='sigmoid')(encoded) autoencoder = Model(input_matrix, decoded) autoencoder.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # 訓練 embeding_model = autoencoder.fit(train_X, train_X, epochs=50, batch_size=256, shuffle=True, validation_data=(test_X, test_X)) # 獲得一個編碼器 encoder = Model(input_img, encoded)
結果:優化
# 2998/2998 [==============================] - 2s 679us/step - loss: 0.0092 - accuracy: 0.9984 - val_loss: 0.0316 - val_accuracy: 0.9982 Epoch 49/50 # 2998/2998 [==============================] - 2s 675us/step - loss: 0.0091 - accuracy: 0.9984 - val_loss: 0.0313 - val_accuracy: 0.9982 Epoch 50/50 # 2998/2998 [==============================] - 2s 694us/step - loss: 0.0090 - accuracy: 0.9984 - val_loss: 0.0312 - val_accuracy: 0.9982
從解碼的損失上看還能夠。編碼
# 編碼訓練數據和測試數據 train_X_em = encoder.predict(train_X) test_X_em = encoder.predict(test_X) # 用邏輯迴歸訓練判斷性別模型 from keras.models import Sequential from keras.layers import Dense, Activation, Embedding,Flatten,Dropout train_count = np.shape(train_X)[0] # 構建神經網絡模型 model = Sequential() model.add(Dense(input_dim=train_X_em.shape[1], units=train_Y.shape[1])) model.add(Activation('softmax')) # 選定loss函數和優化器 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(train_X_em, train_Y, epochs=250, batch_size=256, shuffle=True, validation_data=(test_X_em, test_Y))
結果:
# 2998/2998 [==============================] - 0s 13us/step - loss: 0.4151 - accuracy: 0.8266 - val_loss: 0.4127 - val_accuracy: 0.8253 Epoch 248/250 # 2998/2998 [==============================] - 0s 13us/step - loss: 0.4149 - accuracy: 0.8195 - val_loss: 0.4180 - val_accuracy: 0.8413 Epoch 249/250 # 2998/2998 [==============================] - 0s 13us/step - loss: 0.4163 - accuracy: 0.8225 - val_loss: 0.4131 - val_accuracy: 0.8293 Epoch 250/250 # 2998/2998 [==============================] - 0s 13us/step - loss: 0.4152 - accuracy: 0.8299 - val_loss: 0.4142 - val_accuracy: 0.8293
def pred(song_list=[]): blong_hot_matrix = song_label_encoder.encode_hot_dict({"bblong":song_list}, True) blong_hot_matrix = encoder.predict(blong_hot_matrix) y_pred = model.predict_classes(blong_hot_matrix) return user_decades_encoder.decode_list(y_pred) print(pred(["一路向北", "暗香", "菊花臺"])) print(pred(["不要說話", "平凡之路", "李白"])) print(pred(["滿足", "被風吹過的夏天", "龍捲風"])) print(pred(["情人","再見","無賴","離人","你的樣子"])) print(pred(["小情歌","我好想你","無與倫比的美麗"])) print(pred(["忐忑","最炫民族風","小蘋果"])) print(pred(["青春修煉手冊","愛出發","寵愛","魔法城堡","樣"]))
結果
['男'] ['男'] ['男'] ['男'] ['女'] ['男'] ['女']
上面使用了另外一種數據壓縮方式-自編碼,對數據進行壓縮,從結果上看500維特徵是能夠表明數據總體特徵的。