AI => Tensorflow2.0語法 - dataset數據封裝+訓測驗切割（二）

時間 2019-11-16

標籤 tensorflow2.0 tensorflow 語法 dataset 數據封裝測驗切割欄目軟件設計简体版

原文原文鏈接

訓練集-測試集-驗證集切割

方法1：（借用三方sklearn庫）

由於sklearn的train_test_split只能切2份，因此咱們須要切2次：python

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    x, y,                # x,y是原始數據
    test_size=0.2        # test_size默認是0.25
)  # 返回的是 剩餘訓練集+測試集

x_train, x_valid, y_train, y_valid = train_test_split(
    x_train, y_train,    # 把上面剩餘的 x_train, y_train繼續拿來切
    test_size=0.2        # test_size默認是0.25
)  # 返回的是 二次剩餘訓練集+驗證集

切分好的數據，通常須要作 batch_size， shuffle等，可使用 tf.keras模型的 fit() 一步傳遞！
eg:git

model.compile(
    loss=keras.losses.mean_squared_error, 
    optimizer=keras.optimizers.SGD(),
    metrics=['acc']    # 注意這個metrics參數，下面一會就提到
)

history = model.fit(
    x_train, 
    y_train, 
    validation_data=(x_valid, y_valid),     # 驗證集在這裏用了！！！
    epochs=100, 
    batch_size = 32      #  batch_size 不傳也行，由於默認就是32
    shuffle=True,        #  shuffle    不傳也行，由於默認就是True
    # callbacks=callbacks, #
)
度量指標 = model.evaluate(x_test, y_test)    # 返回的是指標（可能包括loss,acc）
# 這裏說一下，爲何我說可能包括。
# 由於這個返回結果取決於 你的  model.compile() 傳遞的參數
    # 若是你傳了  metrics=['acc']， 那麼這個度量指標的返回結果就是 (loss, acc)
    # 若是你沒傳 metrics ，         那麼這個度量指標的返回結果就是一個 loss

y_predict = model.predict(x_test)            # 返回的是預測結果

方法2：（tf.split）

本身封裝的代碼：功能包括： 3切分，亂序數據集，分批操做一體化！！！（可能有瑕疵）
已上傳至Github : https://github.com/hacker-lin...
定義部分：github

class HandlerData:
    def __init__(self, x, y):
        """我封裝的類，數據經過實例化傳進來保存"""
        self.x = x
        self.y = y

    def shuffle_and_batch(self, x, y, batch_size=None):
        """默認定死亂序操做，batch_size可選參數， 其實亂序參數也應該設置可選的。懶了"""
        data = tf.data.Dataset.from_tensor_slices((x, y))    # 封裝 dataset數據集格式

        data_ = data.shuffle(        # 亂序
            buffer_size=x.shape[0],  # 官方文檔說明 shuffle的buffer_size 必須大於或等於樣本數量
        )
        if batch_size:
            data_ = data_.batch(batch_size)
        return data_

    def train_test_valid_split(self, 
        test_size=0.2,                 # 測試集的切割比例
        valid_size=0.2,                # 驗證集的切割比例
        batch_size=32,                 # batch_size 默認我設爲了32
        is_batch_and_shuffle=True      # 這個是需不須要亂序和分批，默認設爲使用亂序和分批
    ):
    
        sample_num = self.x.shape[0]    # 獲取樣本總個數
        train_sample = int(sample_num * (1 - test_size - valid_size))  # 訓練集的份數
        test_sample = int(sample_num * test_size)                      # 測試集測份數
        valid_train = int(sample_num * valid_size)                     # 驗證集的份數
        # 這三個爲何我用int包裹起來了，由於我調試過程當中發現，有浮點數計算精度缺失現象。
        # 因此必須轉整形
        
        # tf.split()  此語法上一篇我講過，分n份，每份可不一樣數量
        x_train, x_test, x_valid = tf.split(  
            self.x,
            num_or_size_splits=[train_sample, test_sample, valid_train],
            axis=0
        )
        y_train, y_test, y_valid = tf.split(
            self.y,
            [train_sample, test_sample, valid_train],
            axis=0
        )
        # 由於份數是我切割x,y以前計算出來的公共變量。因此不用擔憂 x,y不匹配的問題。
            
        if is_batch_and_shuffle:   # 是否使用亂序和分批，默認是使用的，因此走這條
            return (
                self.shuffle_and_batch(x_train, y_train, batch_size=batch_size),
                self.shuffle_and_batch(x_test, y_test, batch_size=batch_size),
                self.shuffle_and_batch(x_valid, y_valid, batch_size=batch_size),
            )
        else:    # 若是你只想要切割後的原生數據，那麼你把is_batch_and_shuffle傳False就走這條路了
            return (
                (x_train, y_train),
                (x_test, y_test),
                (x_valid, y_valid)
            )

調用案例：測試

x = tf.ones([1000, 5000])
y = tf.ones([1000, 1])

data_obj = HandlerData(x,y)   # x是原生的樣本數據，x是原生的label數據

# 方式1：使用亂序，使用分批，就是一個參數都不用傳，全是默認值
train, test, valid = data_obj.train_test_valid_split(
    # test_size=0.2, 
    # valid_size=0.2, 
    # batch_size=32, 
    # is_batch_and_shuffle=True
) # 這些參數你均可以不傳，這都是設置的默認值。

print(train)
print(test)
print(valid)

# 結果
>>> <BatchDataset shapes: ((None, 5000), (None, 1)), types: (tf.float32, tf.float32)>
>>> <BatchDataset shapes: ((None, 5000), (None, 1)), types: (tf.float32, tf.float32)>
>>> <BatchDataset shapes: ((None, 5000), (None, 1)), types: (tf.float32, tf.float32)>

# 雖然你看見了樣本數爲None，可是不要緊，由於你還沒使用，遍歷一下就明白了    
for x_train,y_train in train:
    print(x_train.shape,y_train.shape)

# 結果  600 // 32 == 18 （你能夠查一下正好18個）
# 結果  600 % 32 == 24 （你能夠看一下最後一個就是24）
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(32, 5000) (32, 1)
(24, 5000) (24, 1)   # 32個一批，最後一個就是餘數 24個了。


# 方式2：不使用亂序，使用分批，只要原生數據，
(x_train, y_train), (x_test, y_test), (x_valid, y_valid) = data_obj.train_test_valid_split(
    # test_size=0.2,
    # valid_size=0.2,
    # batch_size=32,
    is_batch_and_shuffle=False    # 這個改成False便可，其餘參數可選
)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)
print(x_valid.shape, y_valid.shape)

# 結果
>>> (600, 5000) (600, 1)
>>> (200, 5000) (200, 1)
>>> (200, 5000) (200, 1)

方式3（訓驗分割）

history = model.fit(
    .....
    validation_split=0.2      # 訓練集分出0.2給驗證集
)

數據處理（dataset）

這個模塊的做用就是，將咱們的數據，或者 TF張量，封裝成數據集。
這個數據集具備成品API，好比：能夠幫助咱們，分批，亂序，製做迭代，等一些列操做。編碼

基本理解

dataset = tf.data.Dataset.from_tensor_slices(np.arange(16).reshape(4,4))
按理來講（先不取），數據形狀應該是這樣的。  （一個大列表裏面，有4個小列表）
[
    [0, 1, 2 ,3 ],
    [4, 5, 6 ,7 ],
    [8, 9, 10,11],
    [12,13,14,15],
]

for data in dataset:   # 封裝的數據集須要遍歷（或者 iter() 改變爲迭代器類型），才能返回值
    print(data)        # 每遍歷一條就是裏面的小列表。 eg:第一條形狀： [0, 1, 2 ,3 ]
                       # 可是別忘了。咱們這是Tensorflow，所以每層數據集都被封裝爲Tensor。
                       # 所以，咱們每遍歷出一條數據，都是一條Tensor
輸出：
>>    tf.Tensor([0 1 2 3], shape=(4,), dtype=int32)
      tf.Tensor([4 5 6 7], shape=(4,), dtype=int32)
      tf.Tensor([ 8  9 10 11], shape=(4,), dtype=int32)
      tf.Tensor([12 13 14 15], shape=(4,), dtype=int32)

前面說了，這個數據的格式就是（一個大列表裏面，有4個小列表）
對應來看， （一個大Tensor裏面， 有4個小Tensor）。 記住這個理念

數據來源參數類型

參數傳元組：lua

question = [[1, 0], [1, 1]]
answer = ['encode', 'decoder']
dataset = tf.data.Dataset.from_tensor_slices( (question, answer) ) # 用元組包起來了
for data in dataset:
    print(data[0],'=>' ,data[1])
輸出:
>> tf.Tensor([1 0], shape=(2,), dtype=int32) => tf.Tensor(b'encode', shape=(), dtype=string)
   tf.Tensor([1 1], shape=(2,), dtype=int32) => tf.Tensor(b'decoder', shape=(), dtype=string)
   
你能夠看出它自動把咱們傳遞的 question 和 answer 兩個大列表。  "至關於作了zip()操做"。

# 個人實驗經歷：訓練 Encoder-Decoder模型的，"問答對數據"，作編碼後，就能夠這樣用元組傳。

參數傳字典：調試

data_dict = {
    'encoder': [1, 0],
    'decoder': [1, 1]
}

dataset = tf.data.Dataset.from_tensor_slices(data_dict)
for data in dataset:    # 其實每個元素就是一個字典
    print(data)

# 其實就是把你的 value部分，轉成了Tensor類型。 整體結構沒變

鏈式調用

Dataset API 大多數操做幾乎都是鏈式調用（就像python字符串的 replace方法）
用上面的數據做爲案例數據，介紹幾種API：code

batch (分批)

for data in dataset.batch(2):    # 若設置 drop_remainder=True，則最後餘下一批會被丟棄
    print(data) 
輸出： 
>>    tf.Tensor([[0 1 2 3] [4 5 6 7]], shape=(2, 4), dtype=int32)
      tf.Tensor([[ 8  9 10 11] [12 13 14 15]], shape=(2, 4), dtype=int32)
                     
上面說過，默認就是 遍歷出的每一個子項，就是一個Tensor，  如上數據，遍歷出 4個Tensor
而調用 batch(2) 後， 把2個子項分紅一批， 而後再包裝成爲Tensor。
so, 4/2 = 2批 ， 包裝成2個Tensor

repeat（重複使用數據：epoch理念，重複訓練n輪次）

注意（傳的就是總重複數，算自身）： 
    1. 若是repeat() 不傳參數，那就是無限重複。。。
    2. 若是傳參數 = 0,  那麼表明不取數據
    3. 若是傳參數 = 1,  那麼表明一共就一份數據
    4. 若是傳參數 = 2,  那麼表明一共就2份數據（把本身算上，一共2份，就這麼個重複的意思）
    
for data in dataset.repeat(2).batch(3):   # 重複2次。 3個一組  （這就是鏈式調用）
    print(data)

結果
>>  tf.Tensor([[ 0  1  2  3] [ 4  5  6  7] [ 8  9 10 11]], shape=(3, 4), dtype=int32)  
    tf.Tensor([[12 13 14 15] [ 0  1  2  3] [ 4  5  6  7]], shape=(3, 4), dtype=int32)
    tf.Tensor([[ 8  9 10 11] [12 13 14 15]], shape=(2, 4), dtype=int32)  
    
    原數據是4個子項，  重複2次 ：  4*2=8 
    而後鏈式調用分3批： 8/3=2 ..... 2    （整批3個一組， 最後一批餘數一組）
    # 還要注意一下， 它們重複是順序重複拼接。 分批時，能夠首尾相連的 
    （eg:就像小時候吃的一連串棒棒糖， 拽很差，會把上一個的糖皮連着拽下來）