谷歌GPU雲計算平臺,免費又好用

最近一直在爲深度學習模型訓練而苦惱,NVIDIA GTX 960、2G顯存,即便是跑遷移學習的模型,也慢的要死,訓練過程當中電腦還基本上動不了。曾考慮升級顯卡,但當時買的是品牌機,可擴展性不好,必需要買一臺新的主機。到京東上瞧了一下,RTX 2080 TI顯卡的遊戲主機,差很少須要兩萬,更別說那些支持多路顯卡的深度學習主機。無奈之下,只好嘗試一下谷歌的GPU雲計算平臺。谷歌的GPU雲計算平臺並非新鮮事物,我在去年就寫過關於它的兩篇文章:python

在那篇文章中,我寫到了Google Colab的不足:web

Google Colab最大的不足就是使用虛擬機,這意味着什麼呢?ubuntu

這意味着咱們自行安裝的庫,好比Keras,在虛擬機重啓以後,就會被複原。爲了可以持久保存數據,咱們能夠藉助Google Drive。還記得咱們以前爲了掛載Google Drive所作的操做嗎?這也一樣意味着重啓以後,要使用Google Drive,又要把上面的步驟執行一遍。更糟糕的是,不只僅是虛擬機重啓會這樣,在Google Colab的會話斷掉以後也會這樣,而Google Colab的會話最多可以持續12小時。小程序

然而,通過此次的試用,感受上述缺點仍是能夠克服的:微信小程序

  1. 在Google Colab上安裝軟件特別快,畢竟主機運行在國外,因此不論是pip安裝python軟件包,仍是用apt安裝ubuntu安裝包,都是飛快。只要咱們把軟件安裝的命令保存下來,下次再運行就是點點鼠標的事情。
  2. Google提供了很是簡單的腳原本掛載Google Drive,雖然仍然須要受權過程,但過程簡化了許多。並且通過這幾天的試用,即便重啓runtime,也不須要從新掛載。
  3. Google Colab的硬件通過大幅升級,加上咱們最多的是進行遷移學習的模型訓練,這個訓練量比起從頭訓練要小的多,也就是說咱們一般並不須要在Google Colab上訓練超過12小時。

在最新的Google Colab中,增強了和github的集成,咱們編寫的Jupyter Notebook能夠很是方便的同步到github。bash

下面我就以 識狗君 這個小應用 (請參考:github.com/mogoweb/AID…) 說說如何在Google Colab上進行深度學習模型的訓練。在這篇文章中你將學會:微信

  • 在Google Colab中使用Tensorflow 2.0 alpha
  • 掛載Google drive,並下載stanford 狗類別數據集到Google drive
  • 使用tensorflow dataset API加載圖片集
  • 使用Tensorflow keras進行遷移學習,訓練狗狗分類模型

若是你尚未使用過Google Colab,請參閱前面兩篇文章。app

安裝tensorflow 2.0 alpha

!pip install tensorflow-gpu==2.0.0-alpha0
複製代碼

原來runtime的tensorflow版本是1.13,升級後根據提示,點擊 RESTART RUNTIME 按鈕,重啓運行時。

在頂部的菜單選擇 代碼執行程序更改運行時類型,在筆記本設置硬件加速器,選擇GPU。這裏還有一個TPU的選項,可是嘗試後,發現並無什麼效果,後續再研究。

能夠檢驗一下版本是否正確,GPU是否啓用:

import tensorflow as tf
from tensorflow.python.client import device_lib

print(tf.__version__)
tf.test.is_gpu_available()
tf.test.gpu_device_name()
device_lib.list_local_devices()
複製代碼

輸出以下:

2.0.0-alpha0
True
'/device:GPU:0'
[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 5989980768649980945, name: "/device:XLA_GPU:0"
 device_type: "XLA_GPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 17264748630327588387
 physical_device_desc: "device: XLA_GPU device", name: "/device:XLA_CPU:0"
 device_type: "XLA_CPU"
 memory_limit: 17179869184
 locality {
 }
 incarnation: 6451752053688610824
 physical_device_desc: "device: XLA_CPU device", name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 14892338381
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 17826891841425573733
 physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"]
複製代碼

能夠看到,GPU用的是Tesla T4,這但是至關牛逼的顯卡,有2560個CUDA核心,集成320個Tensor Core核心,FP32浮點性能8.1TFLOPS,INT4浮點性能最高260TFLOPS。

你還能夠查看CPU和內存信息:

!cat /proc/meminfo
!cat /proc/cpuinfo
複製代碼

掛載Google drive,並下載數據集

from google.colab import drive

drive.mount('/content/gdrive')
複製代碼

運行後會出現一個連接,點擊連接,進行受權,能夠獲得受權碼,粘帖到框框中,而後回車,就能夠完成掛載,是否是很方便?

掛載以後,Google雲端硬盤的內容就位於 /content/gdrive/My Drive/ 下,我在Colab Notebooks下創建了一個AIDog的目錄做爲項目文件夾,將當前目錄切換到該文件夾,方便後續的操做。

import os

project_path = '/content/gdrive/My Drive/Colab Notebooks/AIDog'  #change dir to your project folder
os.chdir(project_path)  #change dir
複製代碼

如今當前目錄就是 /content/gdrive/My Drive/Colab Notebooks/AIDog ,接下來下載並解壓狗狗數據集:

!wget http://vision.stanford.edu/aditya86/ImageNetDogs/images.tar
!tar xvf images.tar
複製代碼

圖片解壓到Images文件夾,咱們可使用ls命令看看是否成功:

!ls ./Images
複製代碼

使用tensorflow dataset API

圖片文件按照類別分目錄組織,咱們能夠手動將數據集劃分爲訓練數據集和驗證數據集,由於圖片數量比較多,咱們能夠選擇10%做爲驗證數據集:

images_root = pathlib.Path(FLAGS.image_dir)

  all_image_paths = list(images_root.glob("*/*"))
  all_image_paths = [str(path) for path in all_image_paths]
  random.shuffle(all_image_paths)

  images_count = len(all_image_paths)
  print(images_count)

  split = int(0.9 * images_count)
  train_images = all_image_paths[:split]
  validate_images = all_image_paths[split:]

  label_names = sorted(item.name for item in images_root.glob("*/") if item.is_dir())
  label_to_index = dict((name, index) for index, name in enumerate(label_names))

  train_image_labels = [label_to_index[pathlib.Path(path).parent.name] for path in train_images]
  validate_image_labels = [label_to_index[pathlib.Path(path).parent.name] for path in validate_images]

  train_path_ds = tf.data.Dataset.from_tensor_slices(train_images)
  train_image_ds = train_path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)

  train_label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(train_image_labels, tf.int64))

  train_image_label_ds = tf.data.Dataset.zip((train_image_ds, train_label_ds))

  validate_path_ds = tf.data.Dataset.from_tensor_slices(validate_images)
  validate_image_ds = validate_path_ds.map(load_and_preprocess_image, num_parallel_calls=AUTOTUNE)
  validate_label_ds = tf.data.Dataset.from_tensor_slices(tf.cast(validate_image_labels, tf.int64))
  validate_image_label_ds = tf.data.Dataset.zip((validate_image_ds, validate_label_ds))

  model = build_model(len(label_names))
  num_train = len(train_images)
  num_val = len(validate_images)
  steps_per_epoch = round(num_train) // BATCH_SIZE
  validation_steps = round(num_val) // BATCH_SIZE
複製代碼

其中加載和預處理圖片的代碼以下:

def preprocess_image(image):
  image = tf.image.decode_jpeg(image, channels=3)
  image = tf.image.resize(image, [299, 299])
  image /= 255.0  # normalize to [0,1] range
  # normalized to the[-1, 1] range
  image = 2 * image - 1

  return image

def load_and_preprocess_image(path):
  image = tf.io.read_file(path)
  return preprocess_image(image)
複製代碼

使用tensorflow keras框架進行遷移學習

咱們能夠在inception v3的基礎上進行遷移學習,保留除輸出層以外全部層的結構,而後增長一個softmax層:

def build_model(num_classes):
  # Create the base model from the pre-trained model Inception V3
  base_model = keras.applications.InceptionV3(input_shape=IMG_SHAPE,
                                              # We cannot use the top classification layer of the pre-trained model as it contains 1000 classes.
                                              # It also restricts our input dimensions to that which this model is trained on (default: 299x299)
                                              include_top=False,
                                              weights='imagenet')
  base_model.trainable = False
  # Using Sequential API to stack up the layers
  model = keras.Sequential([
    base_model,
    keras.layers.GlobalAveragePooling2D(),
    keras.layers.Dense(num_classes,
                       activation='softmax')
  ])

  # Compile the model to configure training parameters
  model.compile(optimizer='adam',
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
  return model
複製代碼

接下來喂入dataset,進行訓練,須要注意的是,Google的官方文檔上的示例對ds作了一個shuffle的操做,但在實際運行中,這個操做特別佔用內存,其實在前面的代碼中,咱們已經對Images文件的加載作了shuffle處理,因此這裏其實不用額外作一次shuffle:

model = build_model(len(label_names))
  ds = train_image_label_ds
  ds = ds.repeat()
  ds = ds.batch(BATCH_SIZE)
  # `prefetch` lets the dataset fetch batches, in the background while the model is training.
  ds = ds.prefetch(buffer_size=AUTOTUNE)

  model.fit(ds,
            epochs=FLAGS.epochs,
            steps_per_epoch=steps_per_epoch,
            validation_data=validate_image_label_ds.repeat().batch(BATCH_SIZE),
            validation_steps=validation_steps,
            callbacks=[tensorboard_callback, model_checkpoint_callback])
複製代碼

在Google Colab上訓練,一個epoch只須要一分多鐘,通過實驗,差很少10個epoch,驗證數據集的精度就能夠趨於穩定,因此整個過程大約十幾分鍾,遠遠低於系統限制的十二個小時。

W0513 11:13:28.577609 139977708910464 training_utils.py:1353] Expected a shuffled dataset but input dataset `x` is not shuffled. Please invoke `shuffle()` on input dataset.
Epoch 1/10
358/358 [==============================] - 80s 224ms/step - loss: 1.2201 - accuracy: 0.7302 - val_loss: 0.4699 - val_accuracy: 0.8678
Epoch 2/10
358/358 [==============================] - 74s 206ms/step - loss: 0.4624 - accuracy: 0.8692 - val_loss: 0.4347 - val_accuracy: 0.8686
Epoch 3/10
358/358 [==============================] - 75s 209ms/step - loss: 0.3582 - accuracy: 0.8939 - val_loss: 0.4164 - val_accuracy: 0.8726
Epoch 4/10
358/358 [==============================] - 75s 210ms/step - loss: 0.2795 - accuracy: 0.9175 - val_loss: 0.4175 - val_accuracy: 0.8750
Epoch 5/10
358/358 [==============================] - 76s 213ms/step - loss: 0.2275 - accuracy: 0.9357 - val_loss: 0.4174 - val_accuracy: 0.8726
Epoch 6/10
358/358 [==============================] - 75s 210ms/step - loss: 0.1914 - accuracy: 0.9478 - val_loss: 0.4227 - val_accuracy: 0.8702
Epoch 7/10
358/358 [==============================] - 75s 209ms/step - loss: 0.1576 - accuracy: 0.9584 - val_loss: 0.4415 - val_accuracy: 0.8654
Epoch 8/10
358/358 [==============================] - 75s 209ms/step - loss: 0.1368 - accuracy: 0.9639 - val_loss: 0.4364 - val_accuracy: 0.8686
Epoch 9/10
358/358 [==============================] - 76s 211ms/step - loss: 0.1162 - accuracy: 0.9729 - val_loss: 0.4401 - val_accuracy: 0.8686
Epoch 10/10
358/358 [==============================] - 75s 211ms/step - loss: 0.1001 - accuracy: 0.9768 - val_loss: 0.4574 - val_accuracy: 0.8678
複製代碼

模型訓練完畢以後,能夠輸出爲saved_model格式,方便使用tensorflow serving進行部署:

keras.experimental.export_saved_model(model, model_path)
複製代碼

至此,整個過程完畢,訓練出的模型保存於 model_path,咱們能夠下載下來,用於狗狗識別。

以上完整源代碼,能夠訪問我google雲端硬盤:

colab.research.google.com/drive/1KSEk…

參考

  1. www.tensorflow.org/tutorials/l…
  2. zhuanlan.zhihu.com/p/30751039
  3. towardsdatascience.com/downloading…
  4. medium.com/@himanshura…

你還能夠看

image
相關文章
相關標籤/搜索