官網示例:https://www.tensorflow.org/tutorials/keras/basic_text_classification
主要步驟:python
包含來自互聯網電影數據庫的50000條影評文本git
https://developers.google.com/machine-learning/guides/text-classification/github
1 # coding=utf-8 2 import tensorflow as tf 3 from tensorflow import keras 4 import numpy as np 5 import matplotlib.pyplot as plt 6 import pathlib 7 import os 8 9 os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' 10 print("TensorFlow version: {} - tf.keras version: {}".format(tf.VERSION, tf.keras.__version__)) # 查看版本 11 ds_path = str(pathlib.Path.cwd()) + "\\datasets\\imdb\\" # 數據集路徑 12 13 # ### 查看numpy格式數據 14 np_data = np.load(ds_path + "imdb.npz") 15 print("np_data keys: ", list(np_data.keys())) # 查看全部的鍵 16 # print("np_data values: ", list(np_data.values())) # 查看全部的值 17 # print("np_data items: ", list(np_data.items())) # 查看全部的item 18 19 # ### 加載IMDB數據集 20 imdb = keras.datasets.imdb 21 (train_data, train_labels), (test_data, test_labels) = imdb.load_data( 22 path=ds_path + "imdb.npz", 23 num_words=10000 # 保留訓練數據中出現頻次在前10000位的字詞 24 ) 25 26 # ### 探索數據:瞭解數據格式 27 # 數據集已通過預處理:每一個樣本都是一個整數數組,表示影評中的字詞 28 # 每一個標籤都是整數值 0 或 1,其中 0 表示負面影評,1 表示正面影評 29 print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels))) 30 print("First record: {}".format(train_data[0])) # 第一條影評(影評文本已轉換爲整數,其中每一個整數都表示字典中的一個特定字詞) 31 print("Before len:{} len:{}".format(len(train_data[0]), len(train_data[1]))) # 影評的長度會有所不一樣 32 # 將整數轉換回字詞 33 word_index = imdb.get_word_index(ds_path + "imdb_word_index.json") # 整數值與詞彙的映射字典 34 word_index = {k: (v + 3) for k, v in word_index.items()} 35 word_index["<PAD>"] = 0 36 word_index["<START>"] = 1 37 word_index["<UNK>"] = 2 # unknown 38 word_index["<UNUSED>"] = 3 39 reverse_word_index = dict([(value, key) for (key, value) in word_index.items()]) 40 41 42 def decode_review(text): 43 """查詢包含整數到字符串映射的字典對象""" 44 return ' '.join([reverse_word_index.get(i, '?') for i in text]) 45 46 47 print("The content of first record: ", decode_review(train_data[0])) # 顯示第1條影評的文本 48 49 # ### 準備數據 50 # 影評(整數數組)必須轉換爲張量,而後才能饋送到神經網絡中,並且影評的長度必須相同 51 # 採用方法:填充數組,使之都具備相同的長度,而後建立一個形狀爲 max_length * num_reviews 的整數張量 52 train_data = keras.preprocessing.sequence.pad_sequences(train_data, 53 value=word_index["<PAD>"], 54 padding='post', 55 maxlen=256) # 使用 pad_sequences 函數將長度標準化 56 test_data = keras.preprocessing.sequence.pad_sequences(test_data, 57 value=word_index["<PAD>"], 58 padding='post', 59 maxlen=256) 60 print("After - len: {} len: {}".format(len(train_data[0]), len(train_data[1]))) # 樣本的影評長度都已相同 61 print("First record: \n", train_data[0]) # 填充後的第1條影評 62 63 # ### 構建模型 64 # 本示例中,輸入數據由字詞-索引數組構成。要預測的標籤是 0 或 1 65 # 按順序堆疊各個層以構建分類器(模型有多少層,每一個層有多少個隱藏單元) 66 vocab_size = 10000 # 輸入形狀(用於影評的詞彙數) 67 model = keras.Sequential() # 建立一個Sequential模型,而後經過簡單地使用.add()方法將各層添加到模型 68 69 # Embedding層:在整數編碼的詞彙表中查找每一個字詞-索引的嵌入向量 70 # 模型在接受訓練時會學習這些向量,會向輸出數組添加一個維度(batch, sequence, embedding) 71 model.add(keras.layers.Embedding(vocab_size, 16)) 72 # GlobalAveragePooling1D 層經過對序列維度求平均值,針對每一個樣本返回一個長度固定的輸出向量 73 model.add(keras.layers.GlobalAveragePooling1D()) 74 # 長度固定的輸出向量會傳入一個全鏈接 (Dense) 層(包含 16 個隱藏單元) 75 model.add(keras.layers.Dense(16, activation=tf.nn.relu)) 76 # 最後一層與單個輸出節點密集鏈接。應用sigmoid激活函數後,結果是介於 0 到 1 之間的浮點值,表示機率或置信水平 77 model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid)) 78 model.summary() # 打印出關於模型的簡單描述 79 80 # ### 損失函數和優化器 81 # 模型在訓練時須要一個損失函數和一個優化器 82 # 有多種類型的損失函數,通常來講binary_crossentropy更適合處理機率問題,可測量機率分佈之間的「差距」 83 model.compile(optimizer=tf.train.AdamOptimizer(), # 優化器 84 loss='binary_crossentropy', # 損失函數 85 metrics=['accuracy']) # 在訓練和測試期間的模型評估標準 86 87 # ### 建立驗證集 88 # 僅使用訓練數據開發和調整模型,而後僅使用一次測試數據評估準確率 89 # 從原始訓練數據中分離出驗證集,可用於檢查模型處理從未見過的數據的準確率 90 x_val = train_data[:10000] # 從原始訓練數據中分離出10000個樣本,建立一個驗證集 91 partial_x_train = train_data[10000:] 92 y_val = train_labels[:10000] # 從原始訓練數據中分離出10000個樣本,建立一個驗證集 93 partial_y_train = train_labels[10000:] 94 95 # ### 訓練模型 96 # 對partial_x_train和partial_y_train張量中的全部樣本進行迭代 97 # 在訓練期間,監控模型在驗證集(x_val, y_val)的10000個樣本上的損失和準確率 98 history = model.fit(partial_x_train, 99 partial_y_train, 100 epochs=40, # 訓練週期(訓練模型迭代輪次) 101 batch_size=512, # 批量大小(每次梯度更新的樣本數) 102 validation_data=(x_val, y_val), # 驗證數據 103 verbose=2 # 日誌顯示模式:0爲安靜模式, 1爲進度條(默認), 2爲每輪一行 104 ) # 返回一個history對象,包含一個字典,其中包括訓練期間發生的全部狀況 105 106 # ### 評估模型 107 # 在測試模式下返回模型的偏差值和評估標準值 108 results = model.evaluate(test_data, test_labels) # 返回兩個值:損失(表示偏差的數字,越低越好)和準確率 109 print("Result: {}".format(results)) 110 111 # ### 可視化 112 history_dict = history.history # model.fit方法返回一個History回調,它具備包含連續偏差的列表和其餘度量的history屬性 113 print("Keys: {}".format(history_dict.keys())) # 4個條目,每一個條目對應訓練和驗證期間的一個受監控指標 114 loss = history.history['loss'] 115 validation_loss = history.history['val_loss'] 116 accuracy = history.history['acc'] 117 validation_accuracy = history.history['val_acc'] 118 epochs = range(1, len(accuracy) + 1) 119 120 plt.subplot(121) # 建立損失隨時間變化的圖,做爲1行2列圖形矩陣中的第1個subplot 121 plt.plot(epochs, loss, 'bo', label='Training loss') # 繪製圖形, 參數「bo」表示藍色圓點狀(blue dot) 122 plt.plot(epochs, validation_loss, 'b', label='Validation loss') # 參數「b」表示藍色線狀(solid blue line) 123 plt.title('Training and validation loss') # 標題 124 plt.xlabel('Epochs') # x軸標籤 125 plt.ylabel('Loss') # y軸標籤 126 plt.legend() # 繪製圖例 127 128 plt.subplot(122) # 建立準確率隨時間變化的圖 129 plt.plot(epochs, accuracy, color='red', marker='o', label='Training accuracy') 130 plt.plot(epochs, validation_accuracy, 'r', linewidth=1, label='Validation accuracy') 131 plt.title('Training and validation accuracy') 132 plt.xlabel('Epochs') 133 plt.ylabel('Accuracy') 134 plt.legend() 135 136 plt.savefig("./outputs/sample-2-figure.png", dpi=200, format='png') 137 plt.show() # 顯示圖形
TensorFlow version: 1.12.0
np_data keys: ['x_test', 'x_train', 'y_train', 'y_test']
Training entries: 25000, labels: 25000
First record: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
Before len:218 len:189
The content of first record: <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
After - len: 256 len: 256
First record:
[ 1 14 22 16 43 530 973 1622 1385 65 458 4468 66 3941
4 173 36 256 5 25 100 43 838 112 50 670 2 9
35 480 284 5 150 4 172 112 167 2 336 385 39 4
172 4536 1111 17 546 38 13 447 4 192 50 16 6 147
2025 19 14 22 4 1920 4613 469 4 22 71 87 12 16
43 530 38 76 15 13 1247 4 22 17 515 17 12 16
626 18 2 5 62 386 12 8 316 8 106 5 4 2223
5244 16 480 66 3785 33 4 130 12 16 38 619 5 25
124 51 36 135 48 25 1415 33 6 22 12 215 28 77
52 5 14 407 16 82 2 8 4 107 117 5952 15 256
4 2 7 3766 5 723 36 71 43 530 476 26 400 317
46 7 4 2 1029 13 104 88 4 381 15 297 98 32
2071 56 26 141 6 194 7486 18 4 226 22 21 134 476
26 480 5 144 30 5535 18 51 36 28 224 92 25 104
4 226 65 16 38 1334 88 12 16 283 5 16 4472 113
103 32 15 16 5345 19 178 32 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0]
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 16) 160000
_________________________________________________________________
global_average_pooling1d (Gl (None, 16) 0
_________________________________________________________________
dense (Dense) (None, 16) 272
_________________________________________________________________
dense_1 (Dense) (None, 1) 17
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________
Model summary:
Train on 15000 samples, validate on 10000 samples
Epoch 1/40
- 1s - loss: 0.6914 - acc: 0.5768 - val_loss: 0.6887 - val_acc: 0.6371
Epoch 2/40
- 1s - loss: 0.6835 - acc: 0.7170 - val_loss: 0.6784 - val_acc: 0.7431
Epoch 3/40
- 1s - loss: 0.6680 - acc: 0.7661 - val_loss: 0.6591 - val_acc: 0.7566
Epoch 4/40
- 1s - loss: 0.6407 - acc: 0.7714 - val_loss: 0.6290 - val_acc: 0.7724
Epoch 5/40
- 1s - loss: 0.6016 - acc: 0.8012 - val_loss: 0.5880 - val_acc: 0.7914
Epoch 6/40
- 1s - loss: 0.5543 - acc: 0.8191 - val_loss: 0.5435 - val_acc: 0.8059
Epoch 7/40
- 1s - loss: 0.5040 - acc: 0.8387 - val_loss: 0.4984 - val_acc: 0.8256
Epoch 8/40
- 1s - loss: 0.4557 - acc: 0.8551 - val_loss: 0.4574 - val_acc: 0.8390
Epoch 9/40
- 1s - loss: 0.4132 - acc: 0.8659 - val_loss: 0.4227 - val_acc: 0.8483
Epoch 10/40
- 1s - loss: 0.3763 - acc: 0.8795 - val_loss: 0.3946 - val_acc: 0.8558
Epoch 11/40
- 1s - loss: 0.3460 - acc: 0.8874 - val_loss: 0.3740 - val_acc: 0.8601
Epoch 12/40
- 1s - loss: 0.3212 - acc: 0.8929 - val_loss: 0.3540 - val_acc: 0.8689
Epoch 13/40
- 1s - loss: 0.2984 - acc: 0.8999 - val_loss: 0.3402 - val_acc: 0.8713
Epoch 14/40
- 1s - loss: 0.2796 - acc: 0.9057 - val_loss: 0.3280 - val_acc: 0.8737
Epoch 15/40
- 1s - loss: 0.2633 - acc: 0.9101 - val_loss: 0.3187 - val_acc: 0.8762
Epoch 16/40
- 1s - loss: 0.2493 - acc: 0.9141 - val_loss: 0.3110 - val_acc: 0.8786
Epoch 17/40
- 1s - loss: 0.2356 - acc: 0.9200 - val_loss: 0.3046 - val_acc: 0.8791
Epoch 18/40
- 1s - loss: 0.2237 - acc: 0.9237 - val_loss: 0.2994 - val_acc: 0.8810
Epoch 19/40
- 1s - loss: 0.2126 - acc: 0.9278 - val_loss: 0.2955 - val_acc: 0.8829
Epoch 20/40
- 1s - loss: 0.2028 - acc: 0.9316 - val_loss: 0.2920 - val_acc: 0.8832
Epoch 21/40
- 1s - loss: 0.1932 - acc: 0.9347 - val_loss: 0.2893 - val_acc: 0.8836
Epoch 22/40
- 1s - loss: 0.1844 - acc: 0.9389 - val_loss: 0.2877 - val_acc: 0.8843
Epoch 23/40
- 1s - loss: 0.1765 - acc: 0.9421 - val_loss: 0.2867 - val_acc: 0.8853
Epoch 24/40
- 1s - loss: 0.1685 - acc: 0.9469 - val_loss: 0.2852 - val_acc: 0.8844
Epoch 25/40
- 1s - loss: 0.1615 - acc: 0.9494 - val_loss: 0.2848 - val_acc: 0.8858
Epoch 26/40
- 1s - loss: 0.1544 - acc: 0.9522 - val_loss: 0.2850 - val_acc: 0.8859
Epoch 27/40
- 1s - loss: 0.1486 - acc: 0.9543 - val_loss: 0.2860 - val_acc: 0.8847
Epoch 28/40
- 1s - loss: 0.1424 - acc: 0.9573 - val_loss: 0.2857 - val_acc: 0.8867
Epoch 29/40
- 1s - loss: 0.1367 - acc: 0.9587 - val_loss: 0.2867 - val_acc: 0.8867
Epoch 30/40
- 1s - loss: 0.1318 - acc: 0.9607 - val_loss: 0.2882 - val_acc: 0.8863
Epoch 31/40
- 1s - loss: 0.1258 - acc: 0.9634 - val_loss: 0.2899 - val_acc: 0.8868
Epoch 32/40
- 1s - loss: 0.1212 - acc: 0.9652 - val_loss: 0.2919 - val_acc: 0.8854
Epoch 33/40
- 1s - loss: 0.1159 - acc: 0.9679 - val_loss: 0.2941 - val_acc: 0.8853
Epoch 34/40
- 1s - loss: 0.1115 - acc: 0.9690 - val_loss: 0.2972 - val_acc: 0.8852
Epoch 35/40
- 1s - loss: 0.1077 - acc: 0.9705 - val_loss: 0.2988 - val_acc: 0.8845
Epoch 36/40
- 1s - loss: 0.1028 - acc: 0.9727 - val_loss: 0.3020 - val_acc: 0.8841
Epoch 37/40
- 1s - loss: 0.0990 - acc: 0.9737 - val_loss: 0.3050 - val_acc: 0.8830
Epoch 38/40
- 1s - loss: 0.0956 - acc: 0.9745 - val_loss: 0.3087 - val_acc: 0.8824
Epoch 39/40
- 1s - loss: 0.0914 - acc: 0.9765 - val_loss: 0.3109 - val_acc: 0.8832
Epoch 40/40
- 1s - loss: 0.0878 - acc: 0.9780 - val_loss: 0.3148 - val_acc: 0.8822
32/25000 [..............................] - ETA: 0s
3328/25000 [==>...........................] - ETA: 0s
7296/25000 [=======>......................] - ETA: 0s
11072/25000 [============>.................] - ETA: 0s
14304/25000 [================>.............] - ETA: 0s
17888/25000 [====================>.........] - ETA: 0s
21760/25000 [=========================>....] - ETA: 0s
25000/25000 [==============================] - 0s 16us/step
Result: [0.33562567461490633, 0.87216]
Keys: dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])
錯誤提示:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
.......
Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz: None -- [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
處理方法:json
錯誤提示:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
......
Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json: None -- [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
處理方法:api