[譯] TensorFlow 廣度和深度學習的教程

時間 2019-11-24

原文原文鏈接

譯者：charsdavy

校對者：MRNIU

在前文 @{$wide$TensorFlow Liner Model Tutorial} 中，咱們使用人口收入普查數據集訓練了一個 logistic 線性迴歸模型去預測我的年收入超過 5 萬美圓的機率。TensorFlow 在訓練深度神經網絡方面效果也很好，那麼你可能會考慮該如何取捨它的功能了 -- 但是，爲何不選擇二者兼得呢？那麼，是否能夠將二者的優點結合在一個模型中呢？html

在這篇文章中，咱們將會介紹如何使用 TF.Learn API 同時訓練一個廣度線性模型和一個深度前饋神經網絡。這種方法結合了記憶和泛化的優點。它在通常的大規模迴歸和具備稀疏輸入特性的分類問題（例如，分類特徵存在一個很大的可能值域）上頗有效。若是你有興趣學習更多關於廣度和深度學習如何工做的問題，請參考研究論文python

如今，咱們來看一個簡單的例子。git

上圖展現了廣度模型（具備稀疏特徵和轉換性質的 logistic 迴歸模型），深度模型（具備一個嵌入層和多個隱藏層的前饋神經網絡），廣度和深度模型（二者的聯合訓練）的區別比較。在高層級裏，只須要經過如下三個步驟就能使用 TF.Learn API 配置廣度，深度或廣度和深度模型。github

選擇廣度部分的特徵：選擇要使用的稀疏基本列和交叉列。api
選擇深度部分的特徵：選擇連續列，每一個分類列的嵌入維度和隱藏層大小。bash
將它們一塊兒放入廣度和深度模型（DNNLinearCombinedClassifier）。網絡

安裝

若是想要嘗試本教程中的代碼：app

安裝 TensorFlow ，請前往此處。機器學習
下載教程代碼。ide
安裝 pandas 數據分析庫。由於本教程中須要使用 pandas 數據。雖然 tf.learn 不要求 pandas，可是它支持 pandas。安裝 pandas：

a. 獲取 pip：

# Ubuntu/Linux 64-bit
$ sudo apt-get install python-pip python-dev

# Mac OS X
$ sudo easy_install pip
$ sudo easy_install --upgrade six複製代碼

b. 使用 pip 安裝 pandas

$ sudo pip install pandas複製代碼

若是你在安裝過程當中遇到問題，請前往 pandas 網站上的說明。

執行如下命令來訓練教程中描述的線性模型：

$ python wide_n_deep_tutorial.py --model_type=wide_n_deep複製代碼

請繼續閱讀，瞭解此代碼如何構建其線性模型。

定義基本特徵列

首先，定義咱們使用的基本分類和連續特徵的列。這些列將被做爲模型的廣度部分和深度部分的構件塊。

import tensorflow as tf

gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])
education = tf.feature_column.categorical_column_with_vocabulary_list(
    "education", [
        "Bachelors", "HS-grad", "11th", "Masters", "9th",
        "Some-college", "Assoc-acdm", "Assoc-voc", "7th-8th",
        "Doctorate", "Prof-school", "5th-6th", "10th", "1st-4th",
        "Preschool", "12th"
    ])
marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    "marital_status", [
        "Married-civ-spouse", "Divorced", "Married-spouse-absent",
        "Never-married", "Separated", "Married-AF-spouse", "Widowed"
    ])
relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    "relationship", [
        "Husband", "Not-in-family", "Wife", "Own-child", "Unmarried",
        "Other-relative"
    ])
workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    "workclass", [
        "Self-emp-not-inc", "Private", "State-gov", "Federal-gov",
        "Local-gov", "?", "Self-emp-inc", "Without-pay", "Never-worked"
    ])

# 展現一個哈希的例子:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    "occupation", hash_bucket_size=1000)
native_country = tf.feature_column.categorical_column_with_hash_bucket(
    "native_country", hash_bucket_size=1000)

# 連續基列
age = tf.feature_column.numeric_column("age")
education_num = tf.feature_column.numeric_column("education_num")
capital_gain = tf.feature_column.numeric_column("capital_gain")
capital_loss = tf.feature_column.numeric_column("capital_loss")
hours_per_week = tf.feature_column.numeric_column("hours_per_week")

# 轉換
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])複製代碼

廣度模型：具備交叉特徵列的線性模型

廣度模型是一個具備稀疏和交叉特徵列的線性模型：

base_columns = [
    gender, native_country, education, occupation, workclass, relationship,
    age_buckets,
]

crossed_columns = [
    tf.feature_column.crossed_column(
        ["education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        [age_buckets, "education", "occupation"], hash_bucket_size=1000),
    tf.feature_column.crossed_column(
        ["native_country", "occupation"], hash_bucket_size=1000)
]複製代碼

具備交叉特徵列的廣度模型能夠有效地記憶特徵之間的稀疏交互。也就是說，交叉特徵列不能歸納沒有在訓練數據中出現的特徵組合。讓咱們採用嵌入方式來添加一個深度模型來修復這個問題。

深度模型：嵌入式神經網絡

深度模型是一個前饋神經網絡，如前圖所示。每個稀疏，高維度分類特徵首先都會被轉換成一個低維度密集的實值矢量，一般被稱爲嵌入式矢量。這些低維度密集的嵌入式矢量與連續特徵相連，而後在正向傳遞中饋入神經網絡的隱藏層。嵌入值隨機初始化，並與其餘模型參數一塊兒訓練，以最大化減小訓練損失。若是你有興趣瞭解更多關於嵌入的知識，請在查閱教程 Vector Representations of Words 或在 Wikipedia 上查閱 Word Embedding。

咱們將使用 embedding_column 配置分類嵌入列，並將它們與連續列鏈接：

deep_columns = [
    tf.feature_column.indicator_column(workclass),
    tf.feature_column.indicator_column(education),
    tf.feature_column.indicator_column(gender),
    tf.feature_column.indicator_column(relationship),
    # 展現一個嵌入例子
    tf.feature_column.embedding_column(native_country, dimension=8),
    tf.feature_column.embedding_column(occupation, dimension=8),
    age,
    education_num,
    capital_gain,
    capital_loss,
    hours_per_week,
]複製代碼

嵌入的 dimension 越高，自由度就越高，模型將不得不學習這些特性的表示。爲了簡單起見，咱們設置全部特徵列的維度爲 8。從經驗上看，關於維度的設定最好是從 \log_{2}(n) 或 k\sqrt[4]{n} 值開始，這裏的 n 表明特徵列中惟一特徵的數量，k 是一個很小的常量（一般小於10）。

經過密集嵌入，深度模型能夠更好的歸納，並更好對以前沒有在訓練數據中碰見的特徵進行預測。然而，當兩個特徵列之間的底層交互矩陣是稀疏和高等級時，很難學習特徵列的有效低維度表示。在這種狀況下，大多數特徵對之間的交互應該爲零，除了少數幾個，但密集的嵌入將致使全部特徵對的非零預測，從而可能過分泛化。另外一方面，具備交叉特徵的線性模型能夠用更少的模型參數有效地記住這些「異常規則」。

如今，咱們來看看如何聯合訓練廣度和深度模型，讓它們優點和劣勢互補。

將廣度和深度模型結合爲一體

經過將其最終輸出的對數概率做爲預測結合起來，而後將預測提供給 logistic 損失函數，將廣度模型和深度模型相結合。全部的圖形定義和變量分配都已經被處理，因此你只須要建立一個 DNNLinearCombinedClassifier：

import tempfile
model_dir = tempfile.mkdtemp()
m = tf.contrib.learn.DNNLinearCombinedClassifier(
    model_dir=model_dir,
    linear_feature_columns=wide_columns,
    dnn_feature_columns=deep_columns,
    dnn_hidden_units=[100, 50])複製代碼

訓練和評估模型

在訓練模型以前，請先閱讀人口普查數據集，就像在《TensorFlow 線性模型教程》中所作的同樣。輸入數據處理的代碼再次爲你提供方便：

import pandas as pd
import urllib

# 爲數據集定義列名
CSV_COLUMNS = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race", "gender",
    "capital_gain", "capital_loss", "hours_per_week", "native_country",
    "income_bracket"
]

def maybe_download(train_data, test_data):
  """Maybe downloads training data and returns train and test file names."""
  if train_data:
    train_file_name = train_data
  else:
    train_file = tempfile.NamedTemporaryFile(delete=False)
    urllib.request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
        train_file.name)  # pylint: disable=line-too-long
    train_file_name = train_file.name
    train_file.close()
    print("Training data is downloaded to %s" % train_file_name)

  if test_data:
    test_file_name = test_data
  else:
    test_file = tempfile.NamedTemporaryFile(delete=False)
    urllib.request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
        test_file.name)  # pylint: disable=line-too-long
    test_file_name = test_file.name
    test_file.close()
    print("Test data is downloaded to %s"% test_file_name)

  return train_file_name, test_file_name

def input_fn(data_file, num_epochs, shuffle):
  """Input builder function."""
  df_data = pd.read_csv(
      tf.gfile.Open(data_file),
      names=CSV_COLUMNS,
      skipinitialspace=True,
      engine="python",
      skiprows=1)
  # 移除 NaN 元素
  df_data = df_data.dropna(how="any", axis=0)
  labels = df_data["income_bracket"].apply(lambda x: ">50K" in x).astype(int)
  return tf.estimator.inputs.pandas_input_fn(
      x=df_data,
      y=labels,
      batch_size=100,
      num_epochs=num_epochs,
      shuffle=shuffle,
      num_threads=5)複製代碼

閱讀數據以後，你能夠訓練並評估模型：

# 將 num_epochs 設置爲 None，以得到無限的數據流
m.train(
    input_fn=input_fn(train_file_name, num_epochs=None, shuffle=True),
    steps=train_steps)
# 在全部數據被消耗以前，爲了運行評估，設置 steps 爲 None
results = m.evaluate(
    input_fn=input_fn(test_file_name, num_epochs=1, shuffle=False),
    steps=None)
print("model directory = %s" % model_dir)
for key in sorted(results):
  print("%s: %s" % (key, results[key]))複製代碼