AI - TensorFlow - 示例02:影評文本分類

影評文本分類

官網示例:https://www.tensorflow.org/tutorials/keras/basic_text_classification
主要步驟:python

  • 1.加載IMDB數據集
  • 2.探索數據:瞭解數據格式、將整數轉換爲字詞
  • 3.準備數據
  • 4.構建模型:隱藏單元、損失函數和優化器
  • 5.建立驗證集
  • 6.訓練模型
  • 7.評估模型
  • 8.可視化:建立準確率和損失隨時間變化的圖

IMDB數據集

包含來自互聯網電影數據庫的50000條影評文本git

MLCC文本分類指南

https://developers.google.com/machine-learning/guides/text-classification/github

示例

腳本內容

GitHub:https://github.com/anliven/Hello-AI/blob/master/Google-Learn-and-use-ML/2_basic_text_classification.py數據庫

  1 # coding=utf-8
  2 import tensorflow as tf
  3 from tensorflow import keras
  4 import numpy as np
  5 import matplotlib.pyplot as plt
  6 import pathlib
  7 import os
  8 
  9 os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
 10 print("TensorFlow version: {}  - tf.keras version: {}".format(tf.VERSION, tf.keras.__version__))  # 查看版本
 11 ds_path = str(pathlib.Path.cwd()) + "\\datasets\\imdb\\"  # 數據集路徑
 12 
 13 # ### 查看numpy格式數據
 14 np_data = np.load(ds_path + "imdb.npz")
 15 print("np_data keys: ", list(np_data.keys()))  # 查看全部的鍵
 16 # print("np_data values: ", list(np_data.values()))  # 查看全部的值
 17 # print("np_data items: ", list(np_data.items()))  # 查看全部的item
 18 
 19 # ### 加載IMDB數據集
 20 imdb = keras.datasets.imdb
 21 (train_data, train_labels), (test_data, test_labels) = imdb.load_data(
 22     path=ds_path + "imdb.npz",
 23     num_words=10000  # 保留訓練數據中出現頻次在前10000位的字詞
 24 )
 25 
 26 # ### 探索數據:瞭解數據格式
 27 # 數據集已通過預處理:每一個樣本都是一個整數數組,表示影評中的字詞
 28 # 每一個標籤都是整數值 0 或 1,其中 0 表示負面影評,1 表示正面影評
 29 print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))
 30 print("First record: {}".format(train_data[0]))  # 第一條影評(影評文本已轉換爲整數,其中每一個整數都表示字典中的一個特定字詞)
 31 print("Before len:{} len:{}".format(len(train_data[0]), len(train_data[1])))  # 影評的長度會有所不一樣
 32 # 將整數轉換回字詞
 33 word_index = imdb.get_word_index(ds_path + "imdb_word_index.json")  # 整數值與詞彙的映射字典
 34 word_index = {k: (v + 3) for k, v in word_index.items()}
 35 word_index["<PAD>"] = 0
 36 word_index["<START>"] = 1
 37 word_index["<UNK>"] = 2  # unknown
 38 word_index["<UNUSED>"] = 3
 39 reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
 40 
 41 
 42 def decode_review(text):
 43     """查詢包含整數到字符串映射的字典對象"""
 44     return ' '.join([reverse_word_index.get(i, '?') for i in text])
 45 
 46 
 47 print("The content of first record: ", decode_review(train_data[0]))  # 顯示第1條影評的文本
 48 
 49 # ### 準備數據
 50 # 影評(整數數組)必須轉換爲張量,而後才能饋送到神經網絡中,並且影評的長度必須相同
 51 # 採用方法:填充數組,使之都具備相同的長度,而後建立一個形狀爲 max_length * num_reviews 的整數張量
 52 train_data = keras.preprocessing.sequence.pad_sequences(train_data,
 53                                                         value=word_index["<PAD>"],
 54                                                         padding='post',
 55                                                         maxlen=256)  # 使用 pad_sequences 函數將長度標準化
 56 test_data = keras.preprocessing.sequence.pad_sequences(test_data,
 57                                                        value=word_index["<PAD>"],
 58                                                        padding='post',
 59                                                        maxlen=256)
 60 print("After - len: {} len: {}".format(len(train_data[0]), len(train_data[1])))  # 樣本的影評長度都已相同
 61 print("First record: \n", train_data[0])  # 填充後的第1條影評
 62 
 63 # ### 構建模型
 64 # 本示例中,輸入數據由字詞-索引數組構成。要預測的標籤是 0 或 1
 65 # 按順序堆疊各個層以構建分類器(模型有多少層,每一個層有多少個隱藏單元)
 66 vocab_size = 10000  # 輸入形狀(用於影評的詞彙數)
 67 model = keras.Sequential()  # 建立一個Sequential模型,而後經過簡單地使用.add()方法將各層添加到模型
 68 
 69 # Embedding層:在整數編碼的詞彙表中查找每一個字詞-索引的嵌入向量
 70 # 模型在接受訓練時會學習這些向量,會向輸出數組添加一個維度(batch, sequence, embedding)
 71 model.add(keras.layers.Embedding(vocab_size, 16))
 72 # GlobalAveragePooling1D 層經過對序列維度求平均值,針對每一個樣本返回一個長度固定的輸出向量
 73 model.add(keras.layers.GlobalAveragePooling1D())
 74 # 長度固定的輸出向量會傳入一個全鏈接 (Dense) 層(包含 16 個隱藏單元)
 75 model.add(keras.layers.Dense(16, activation=tf.nn.relu))
 76 # 最後一層與單個輸出節點密集鏈接。應用sigmoid激活函數後,結果是介於 0 到 1 之間的浮點值,表示機率或置信水平
 77 model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
 78 model.summary()  # 打印出關於模型的簡單描述
 79 
 80 # ### 損失函數和優化器
 81 # 模型在訓練時須要一個損失函數和一個優化器
 82 # 有多種類型的損失函數,通常來講binary_crossentropy更適合處理機率問題,可測量機率分佈之間的「差距」
 83 model.compile(optimizer=tf.train.AdamOptimizer(),  # 優化器
 84               loss='binary_crossentropy',  # 損失函數
 85               metrics=['accuracy'])  # 在訓練和測試期間的模型評估標準
 86 
 87 # ### 建立驗證集
 88 # 僅使用訓練數據開發和調整模型,而後僅使用一次測試數據評估準確率
 89 # 從原始訓練數據中分離出驗證集,可用於檢查模型處理從未見過的數據的準確率
 90 x_val = train_data[:10000]  # 從原始訓練數據中分離出10000個樣本,建立一個驗證集
 91 partial_x_train = train_data[10000:]
 92 y_val = train_labels[:10000]  # 從原始訓練數據中分離出10000個樣本,建立一個驗證集
 93 partial_y_train = train_labels[10000:]
 94 
 95 # ### 訓練模型
 96 # 對partial_x_train和partial_y_train張量中的全部樣本進行迭代
 97 # 在訓練期間,監控模型在驗證集(x_val, y_val)的10000個樣本上的損失和準確率
 98 history = model.fit(partial_x_train,
 99                     partial_y_train,
100                     epochs=40,  # 訓練週期(訓練模型迭代輪次)
101                     batch_size=512,  # 批量大小(每次梯度更新的樣本數)
102                     validation_data=(x_val, y_val),  # 驗證數據
103                     verbose=2  # 日誌顯示模式:0爲安靜模式, 1爲進度條(默認), 2爲每輪一行
104                     )  # 返回一個history對象,包含一個字典,其中包括訓練期間發生的全部狀況
105 
106 # ### 評估模型
107 # 在測試模式下返回模型的偏差值和評估標準值
108 results = model.evaluate(test_data, test_labels)  # 返回兩個值:損失(表示偏差的數字,越低越好)和準確率
109 print("Result: {}".format(results))
110 
111 # ### 可視化
112 history_dict = history.history  # model.fit方法返回一個History回調,它具備包含連續偏差的列表和其餘度量的history屬性
113 print("Keys: {}".format(history_dict.keys()))  # 4個條目,每一個條目對應訓練和驗證期間的一個受監控指標
114 loss = history.history['loss']
115 validation_loss = history.history['val_loss']
116 accuracy = history.history['acc']
117 validation_accuracy = history.history['val_acc']
118 epochs = range(1, len(accuracy) + 1)
119 
120 plt.subplot(121)  # 建立損失隨時間變化的圖,做爲1行2列圖形矩陣中的第1個subplot
121 plt.plot(epochs, loss, 'bo', label='Training loss')  # 繪製圖形, 參數「bo」表示藍色圓點狀(blue dot)
122 plt.plot(epochs, validation_loss, 'b', label='Validation loss')  # 參數「b」表示藍色線狀(solid blue line)
123 plt.title('Training and validation loss')  # 標題
124 plt.xlabel('Epochs')  # x軸標籤
125 plt.ylabel('Loss')  # y軸標籤
126 plt.legend()  # 繪製圖例
127 
128 plt.subplot(122)  # 建立準確率隨時間變化的圖
129 plt.plot(epochs, accuracy, color='red', marker='o', label='Training accuracy')
130 plt.plot(epochs, validation_accuracy, 'r', linewidth=1, label='Validation accuracy')
131 plt.title('Training and validation accuracy')
132 plt.xlabel('Epochs')
133 plt.ylabel('Accuracy')
134 plt.legend()
135 
136 plt.savefig("./outputs/sample-2-figure.png", dpi=200, format='png')
137 plt.show()  # 顯示圖形

 

運行結果

TensorFlow version: 1.12.0
np_data keys:  ['x_test', 'x_train', 'y_train', 'y_test']
Training entries: 25000, labels: 25000
First record: [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]
Before len:218 len:189
The content of first record:  <START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
After - len: 256 len: 256
First record: 
 [   1   14   22   16   43  530  973 1622 1385   65  458 4468   66 3941
    4  173   36  256    5   25  100   43  838  112   50  670    2    9
   35  480  284    5  150    4  172  112  167    2  336  385   39    4
  172 4536 1111   17  546   38   13  447    4  192   50   16    6  147
 2025   19   14   22    4 1920 4613  469    4   22   71   87   12   16
   43  530   38   76   15   13 1247    4   22   17  515   17   12   16
  626   18    2    5   62  386   12    8  316    8  106    5    4 2223
 5244   16  480   66 3785   33    4  130   12   16   38  619    5   25
  124   51   36  135   48   25 1415   33    6   22   12  215   28   77
   52    5   14  407   16   82    2    8    4  107  117 5952   15  256
    4    2    7 3766    5  723   36   71   43  530  476   26  400  317
   46    7    4    2 1029   13  104   88    4  381   15  297   98   32
 2071   56   26  141    6  194 7486   18    4  226   22   21  134  476
   26  480    5  144   30 5535   18   51   36   28  224   92   25  104
    4  226   65   16   38 1334   88   12   16  283    5   16 4472  113
  103   32   15   16 5345   19  178   32    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 16)          160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 16)                272       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
=================================================================
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________
Model summary: 
Train on 15000 samples, validate on 10000 samples
Epoch 1/40
 - 1s - loss: 0.6914 - acc: 0.5768 - val_loss: 0.6887 - val_acc: 0.6371
Epoch 2/40
 - 1s - loss: 0.6835 - acc: 0.7170 - val_loss: 0.6784 - val_acc: 0.7431
Epoch 3/40
 - 1s - loss: 0.6680 - acc: 0.7661 - val_loss: 0.6591 - val_acc: 0.7566
Epoch 4/40
 - 1s - loss: 0.6407 - acc: 0.7714 - val_loss: 0.6290 - val_acc: 0.7724
Epoch 5/40
 - 1s - loss: 0.6016 - acc: 0.8012 - val_loss: 0.5880 - val_acc: 0.7914
Epoch 6/40
 - 1s - loss: 0.5543 - acc: 0.8191 - val_loss: 0.5435 - val_acc: 0.8059
Epoch 7/40
 - 1s - loss: 0.5040 - acc: 0.8387 - val_loss: 0.4984 - val_acc: 0.8256
Epoch 8/40
 - 1s - loss: 0.4557 - acc: 0.8551 - val_loss: 0.4574 - val_acc: 0.8390
Epoch 9/40
 - 1s - loss: 0.4132 - acc: 0.8659 - val_loss: 0.4227 - val_acc: 0.8483
Epoch 10/40
 - 1s - loss: 0.3763 - acc: 0.8795 - val_loss: 0.3946 - val_acc: 0.8558
Epoch 11/40
 - 1s - loss: 0.3460 - acc: 0.8874 - val_loss: 0.3740 - val_acc: 0.8601
Epoch 12/40
 - 1s - loss: 0.3212 - acc: 0.8929 - val_loss: 0.3540 - val_acc: 0.8689
Epoch 13/40
 - 1s - loss: 0.2984 - acc: 0.8999 - val_loss: 0.3402 - val_acc: 0.8713
Epoch 14/40
 - 1s - loss: 0.2796 - acc: 0.9057 - val_loss: 0.3280 - val_acc: 0.8737
Epoch 15/40
 - 1s - loss: 0.2633 - acc: 0.9101 - val_loss: 0.3187 - val_acc: 0.8762
Epoch 16/40
 - 1s - loss: 0.2493 - acc: 0.9141 - val_loss: 0.3110 - val_acc: 0.8786
Epoch 17/40
 - 1s - loss: 0.2356 - acc: 0.9200 - val_loss: 0.3046 - val_acc: 0.8791
Epoch 18/40
 - 1s - loss: 0.2237 - acc: 0.9237 - val_loss: 0.2994 - val_acc: 0.8810
Epoch 19/40
 - 1s - loss: 0.2126 - acc: 0.9278 - val_loss: 0.2955 - val_acc: 0.8829
Epoch 20/40
 - 1s - loss: 0.2028 - acc: 0.9316 - val_loss: 0.2920 - val_acc: 0.8832
Epoch 21/40
 - 1s - loss: 0.1932 - acc: 0.9347 - val_loss: 0.2893 - val_acc: 0.8836
Epoch 22/40
 - 1s - loss: 0.1844 - acc: 0.9389 - val_loss: 0.2877 - val_acc: 0.8843
Epoch 23/40
 - 1s - loss: 0.1765 - acc: 0.9421 - val_loss: 0.2867 - val_acc: 0.8853
Epoch 24/40
 - 1s - loss: 0.1685 - acc: 0.9469 - val_loss: 0.2852 - val_acc: 0.8844
Epoch 25/40
 - 1s - loss: 0.1615 - acc: 0.9494 - val_loss: 0.2848 - val_acc: 0.8858
Epoch 26/40
 - 1s - loss: 0.1544 - acc: 0.9522 - val_loss: 0.2850 - val_acc: 0.8859
Epoch 27/40
 - 1s - loss: 0.1486 - acc: 0.9543 - val_loss: 0.2860 - val_acc: 0.8847
Epoch 28/40
 - 1s - loss: 0.1424 - acc: 0.9573 - val_loss: 0.2857 - val_acc: 0.8867
Epoch 29/40
 - 1s - loss: 0.1367 - acc: 0.9587 - val_loss: 0.2867 - val_acc: 0.8867
Epoch 30/40
 - 1s - loss: 0.1318 - acc: 0.9607 - val_loss: 0.2882 - val_acc: 0.8863
Epoch 31/40
 - 1s - loss: 0.1258 - acc: 0.9634 - val_loss: 0.2899 - val_acc: 0.8868
Epoch 32/40
 - 1s - loss: 0.1212 - acc: 0.9652 - val_loss: 0.2919 - val_acc: 0.8854
Epoch 33/40
 - 1s - loss: 0.1159 - acc: 0.9679 - val_loss: 0.2941 - val_acc: 0.8853
Epoch 34/40
 - 1s - loss: 0.1115 - acc: 0.9690 - val_loss: 0.2972 - val_acc: 0.8852
Epoch 35/40
 - 1s - loss: 0.1077 - acc: 0.9705 - val_loss: 0.2988 - val_acc: 0.8845
Epoch 36/40
 - 1s - loss: 0.1028 - acc: 0.9727 - val_loss: 0.3020 - val_acc: 0.8841
Epoch 37/40
 - 1s - loss: 0.0990 - acc: 0.9737 - val_loss: 0.3050 - val_acc: 0.8830
Epoch 38/40
 - 1s - loss: 0.0956 - acc: 0.9745 - val_loss: 0.3087 - val_acc: 0.8824
Epoch 39/40
 - 1s - loss: 0.0914 - acc: 0.9765 - val_loss: 0.3109 - val_acc: 0.8832
Epoch 40/40
 - 1s - loss: 0.0878 - acc: 0.9780 - val_loss: 0.3148 - val_acc: 0.8822

   32/25000 [..............................] - ETA: 0s
 3328/25000 [==>...........................] - ETA: 0s
 7296/25000 [=======>......................] - ETA: 0s
11072/25000 [============>.................] - ETA: 0s
14304/25000 [================>.............] - ETA: 0s
17888/25000 [====================>.........] - ETA: 0s
21760/25000 [=========================>....] - ETA: 0s
25000/25000 [==============================] - 0s 16us/step
Result: [0.33562567461490633, 0.87216]
Keys: dict_keys(['val_loss', 'val_acc', 'loss', 'acc'])

 

 

問題處理

問題1:執行imdb.load_data()失敗

錯誤提示:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
.......
Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz: None -- [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

處理方法:json

  • 方法1:手工下載文件,導入數據時參數path使用絕對路徑;
  • 方法2:手工下載文件,並存放在「~/.keras/datasets」目錄下對應的文件夾;

問題2:執行imdb.get_word_index()失敗

錯誤提示:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json
......
Exception: URL fetch failure on https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json: None -- [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond

處理方法:api

  • 手工下載文件,導入數據時參數path使用絕對路徑;
相關文章
相關標籤/搜索