Which API(s) should you use? You should use the highest level of abstraction that solves the problem. The higher levels of abstraction are easier to use, but are also (by design) less flexible. We recommend you start with the highest-level API first and get everything working. If you need additional flexibility for some special modeling concerns, move one level lower. Note that each level is built using the APIs in lower levels, so dropping down the hierarchy should be reasonably straightforward.html
比官方文檔更簡明、易懂、實用!python
首先介紹基本概念,由 csv 文件導入數據,用 dataFrame 裝載、抽象數據(dataFrame 能夠理解爲一個二維表)。接着講述如何訪問數據,Python 中訪問 dict/list 的方式廣泛適用於 dataFrame 。再以後講解如何操做數據,除了直接使用 NumPy 函數外,還有個特別有用的 Series.apply 。最後是若干個練習。 。 第一題,插入列、對列的總體運算;第二題,關於 reindex ,能夠經過對索引排序來改變整個數據的排序。git
經過一個很簡單的實例(根據一個輸入特徵:城市街區的粒度,使用 TensorFlow 中的 LinearRegressor 類預測中位數房價)介紹如何使用 TensorFlow ,包括從導入數據到訓練模型、調整參數的整個流程。首先須要搭建機器學習環境api
在運行代碼前,先將 csv 文件下載到本地並放到 .py 文件的同一目錄下。數組
# In this first cell, we'll load the necessary libraries. import math from IPython import display from matplotlib import cm from matplotlib import gridspec from matplotlib import pyplot as plt import numpy as np import pandas as pd from sklearn import metrics import tensorflow as tf from tensorflow.python.data import Dataset tf.logging.set_verbosity(tf.logging.ERROR) pd.options.display.max_rows = 10 pd.options.display.float_format = '{:.1f}'.format # Next, we'll load our data set. california_housing_dataframe = pd.read_csv("california_housing_train.csv", sep=",") california_housing_dataframe = california_housing_dataframe.reindex( np.random.permutation(california_housing_dataframe.index)) california_housing_dataframe["median_house_value"] /= 1000.0 california_housing_dataframe print(california_housing_dataframe.describe())
longitude 經度
latitude 緯度
housing median age 住房中位數年齡
total rooms 房子總數
total bedrooms 臥室總數
population 人口
households 戶數
median income 收入
median house value 房價中位數安全
咱們的目標是預測 median house value (某個街區的房價中位數),輸入是 total rooms (某個街區的房子總數)。app
爲了訓練咱們的模型,將用到 TensorFlow Estimator 提供的 LinearRegressor 接口。 這個 API 負責處理大量低級模型管道,並提供便捷的方法來執行模型訓練,評估和推理(也就是預測)。less
爲了將訓練數據導入到 tensorflow 中,咱們須要指明每一個特徵所包含的數據類型。有兩種主要的數據類型將在本次或者將來的練習中用到:dom
在 tensorflow 中,咱們用一種稱爲「特徵列」的構造指明一個特徵的數據類型。特徵列僅存儲對於特徵數據的描述,它們自己並不包含特徵數據。機器學習
首先,咱們將僅僅使用一個數值特徵做爲輸入,房子總數 total_rooms。如下代碼從 dataframe 中提取 total_rooms 數據,並用 numeric_column 定義「特徵列」,指明該列數據是數值類型:
# Define the input feature: total_rooms. my_feature = california_housing_dataframe[["total_rooms"]] # Configure a numeric feature column for total_rooms. feature_columns = [tf.feature_column.numeric_column("total_rooms")]
PS. 注意分辨 特徵 my_feature 與 特徵列 featute_columns
接下來,咱們將定義咱們的目標,也就是 房價中位數 median_house_value 。 一樣,咱們能夠把它從 dataframe 中提取出來:
# Define the label. targets = california_housing_dataframe["median_house_value"]
咱們將用 LinearRegressor 配置一個線性迴歸模型。咱們用 GradientDescentOptimizer(梯度降低優化器)
來訓練此模型,它實現了Mini-Batch SGD(小批量隨機梯度降低,每次迭代隨機選擇 10 ~ 1000 個 example)。學習速率 learning_rate
控制了梯度步長的大小(梯度 * 學習速率 = 下一點點距離上一點的距離)。
注意:爲了安全起見,咱們還用到了梯度裁剪 clip_gradients_by_norm
,它確保了訓練期間梯度不會變得太過大,從而致使梯度降低失敗。
# Use gradient descent as the optimizer for training the model. my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001) my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0) # Configure the linear regression model with our feature columns and optimizer. # Set a learning rate of 0.0000001 for Gradient Descent. linear_regressor = tf.estimator.LinearRegressor( feature_columns=feature_columns, optimizer=my_optimizer )
爲了將加州住房數據導入到咱們的 LinearRegressor 中,咱們須要定義一個輸入函數(input function),它指明瞭 TensorFlow 應當如何預處理數據,以及如何在模型訓練期間進行批處理,洗牌(打亂數據)和重複。
首先,咱們將 Pandas 特徵數據轉化成 Numpy 數組的字典。以後咱們能夠經過 TensorFlow 的 Dataset API 用這些數據構造一個 dataset 對象,而後把咱們的數據拆分紅每批次大小爲 batch_size 的小批次,以針對指定的 num_epochs 進行重複。
注意:當 num_epochs 爲默認值時,輸入的數據將被重複無限次。
接下來,若是 shuffle 被設置爲 True , 咱們將對數據進行「混洗」以便它在訓練期間被隨機地傳遞給模型,buffer_size 參數指定了 shuffle 將隨機採樣的數據集大小。
最後,咱們的輸入函數爲 dataset 構造了一個迭代器,並將下一批次的數據返回給 LinearRegressor。
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None): """Trains a linear regression model of one feature. Args: features: pandas DataFrame of features targets: pandas DataFrame of targets batch_size: Size of batches to be passed to the model shuffle: True or False. Whether to shuffle the data. num_epochs: Number of epochs for which data should be repeated. None = repeat indefinitely Returns: Tuple of (features, labels) for next data batch """ # Convert pandas data into a dict of np arrays. features = {key:np.array(value) for key,value in dict(features).items()} # Construct a dataset, and configure batching/repeating. ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit ds = ds.batch(batch_size).repeat(num_epochs) # Shuffle the data, if specified. if shuffle: ds = ds.shuffle(buffer_size=10000) # Return the next batch of data. features, labels = ds.make_one_shot_iterator().get_next() return features, labels
_ = linear_regressor.train( input_fn = lambda:my_input_fn(my_feature, targets), steps=100 )
# Create an input function for predictions. # Note: Since we're making just one prediction for each example, we don't # need to repeat or shuffle the data here. prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False) # Call predict() on the linear_regressor to make predictions. predictions = linear_regressor.predict(input_fn=prediction_input_fn) # Format predictions as a NumPy array, so we can calculate error metrics. predictions = np.array([item['predictions'][0] for item in predictions]) # Print Mean Squared Error and Root Mean Squared Error. mean_squared_error = metrics.mean_squared_error(predictions, targets) root_mean_squared_error = math.sqrt(mean_squared_error) print ("Mean Squared Error (on training data): %0.3f" % mean_squared_error) print ("Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error)
平局方差錯誤 MSE 很難解釋,咱們一般看 根號 MSE 也就是 RMSE ,RMSE 有個很是棒的屬性就是能夠直接和原始數據進行比對。
min_house_value = california_housing_dataframe["median_house_value"].min() max_house_value = california_housing_dataframe["median_house_value"].max() min_max_difference = max_house_value - min_house_value print ("Min. Median House Value: %0.3f" % min_house_value) print ("Max. Median House Value: %0.3f" % max_house_value) print ("Difference between Min. and Max.: %0.3f" % min_max_difference) print ("Root Mean Squared Error: %0.3f" % root_mean_squared_error)