算法初探：Tensorflow及PAI平臺的使用

時間 2019-12-13

標籤算法初探 tensorflow pai 平臺使用简体版

原文原文鏈接

前言

Tensorflow這個詞由來已久，可是對它的理解一直就停留在「聽過」的層面。以前作過一個無線圖片適配問題智能識別的項目，基於Tensorflow實現了GoogLeNet - Inception V3網絡（一種包含22層的深層卷積神經網絡），可是基本上也屬於「盲人摸象」、「照葫蘆畫瓢」的程度。做爲當今機器學習乃至深度學習界出現頻率最高的一個詞，有必要去了解一下它究竟是個什麼東西。python

而PAI，做爲一站式地機器學習和算法服務平臺，它大大簡化了模型構建、模型訓練、調參、模型性能評估、服務化等一系列算法的工做，能夠幫助咱們更快捷地實現算法實驗和應用。git

1、Tensorflow初探

1. 安裝和啓動

由於我本身的mac-pro安裝了docker，因此安裝Tensorflow的環境很是簡單，只要拉取Tensorflow的官方鏡像就能夠完成Tensorflow的環境搭建。算法

#拉取tensorflow鏡像 docker pull tensorflow/tensorflow #建立一個tensorflow的工做目錄，掛載到容器內 mkdir -p /Users/znifeng/docker-data/tensorflow/notebooks #啓動容器 docker run -it --rm --name myts -v /Users/znifeng/docker-data/tensorflow/notebooks:/notebooks -p 8888:8888 tensorflow/tensorflow

啓動成功後，將看到以下信息：
docker

複製連接http://127.0.0.1:8888/?token=487c52e0aa0cd2a7b231bf909c1d6666482f8ed03353e510到瀏覽器，就能夠看到jupyter（支持在線編寫和調試python的交互式筆記本）頁面：瀏覽器

網絡

接下來，你能夠在jupyter上或者在docker容器內部編寫和調試tensorflow的代碼，容器內部已經包含了tensorflow的全部庫。session

2. 基本使用

2.1 核心概念

使用圖(graph)來表示計算任務
使用張量（tensor）來表示數據。張量與矢量的區別：矢量至關於一階的張量，張量能夠從0階到多階（多維）
圖中的每個節點稱之爲op（operation），每個op有0或多個Tensor做爲輸入，執行計算後產出0或多個Tensor做爲輸出
在被稱之爲會話Session的上下文（context）中執行圖
經過變量（variable）來維護狀態
使用feed和fetch能夠爲任意的操做賦值或者從其中獲取數據
使用placeholder來定義佔位符，在運行時傳入對應的參數值

TensorFlow程序一般被組織成一個構建階段和一個執行階段。在構建階段，op的執行步驟被描述成一個圖，在執行階段，使用會話執行圖中的op。在Python中，返回的tensor是numpy.ndarray對象。；在C/C++中，返回的是tensorflow:Tensor實例。機器學習

2.2 使用示例

2.2.1 第一個helloworld程序：

import tensorflow as tf

#第一階段： 構建圖
#定義一個1x2的矩陣，矩陣元素爲[3 3] matrix1 = tf.constant([[3., 3.]]) #定義一個2x1的矩陣，矩陣元素爲[2 2] matrix2 = tf.constant([[2.],[2.]]) # 建立一個矩陣乘法 matmul op , 把 'matrix1' 和 'matrix2' 做爲輸入. product = tf.matmul(matrix1, matrix2) #第二階段： 執行圖 with tf.Session() as sess: print "matrix1: %s" % sess.run(matrix1) print "matrix2: %s" % sess.run(matrix2) print "result type: %s" % type(sess.run(product)) print "result: %s" % sess.run(product)

輸出結果：ide

matrix1: [[3. 3.]] matrix2: [[2.] [2.]] result type: <type 'numpy.ndarray'> result: [[12.]]

如上圖所示，在第一階段（構建圖）中，咱們的每一行操做其實都是一個operation，包含兩個constant操做和一個矩陣相乘的操做，每一個operation的輸出都是tensor，它的類型是num.ndarray。實際上，構建階段咱們只是定義op，並不會真正去執行；而在第二階段中，經過定義了一個會話session，咱們纔會在會話中真正開始執行前面定義的各個operation，而後得到執行的結果。函數

2.2.2 使用tensorflow實現識別手寫數字（1～9）模型 —— Softmax Regression

import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) import tensorflow as tf ##Define input data format #input images, each image is represented by a tensor of 784 dimensions x = tf.placeholder("float", [None,784]) #input labels, each label values one digit of [0,9], which is represented by a tensor of 10 dimensions y_ = tf.placeholder("float", [None, 10]) ##Define Model and algorithm #weight array of each feature VS predicted result W = tf.Variable(tf.zeros([784, 10])) #bias of each digit b = tf.Variable(tf.zeros([10])) #predicted probability array of an image, which is of 10 dimensions. #tf.matmul:矩陣相乘 y = tf.nn.softmax(tf.matmul(x,W) + b) #cross-entropy or called loss function #tf.reduce_sum:壓縮求和: tf.reduce_sum(x, 0)將x按行求和，tf.reduce_sum(x, 1)將x按列求和，tf.reduce_sum(x, [0, 1])按行列求和 cross_entropy = -tf.reduce_sum(y_*tf.log(y)) #gredient descent algorithm train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy) ##Training model #initialize all variables init = tf.global_variables_initializer() with tf.Session() as sess: sess.run(init) for i in range(1000): batch_xs, batch_ys = mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, y_:batch_ys}) ##Evaluation #tf.argmax(vector, 1)：返回的是vector中的最大值的索引號。tf.argmax(vector, 0) correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) print sess.run(accuracy, feed_dict={x: mnist.test.images, y_: mnist.test.labels})

輸出結果：

0.904

模型中使用的數據來自tensorflow/g3doc/tutorials/mnist/，總共包含7萬張784（28x28）像素的1～9的數字圖片，其中55000張用於模型訓練做爲訓練集，5000張做爲驗證集，剩餘10000張用做測試集。

所以，每一張圖片均可以用一個784維的向量來表示，向量裏的每個元素表示某個像素的強度值，介於0和1之間。使用x = tf.placeholder("float", [None,784])表示輸入集x，其中placeholder爲佔位符，在實際使用時，咱們再經過feed傳入具體的行數來替換其中的None。y_則爲對應的實際結果，由於結果集合爲0～9，所以咱們能夠用[y0,y1,...,y9]的十維向量來表示結果，好比數字「1」能夠表示爲[0,1,0,0,0,0,0,0,0,0]。本例中，定義的模型爲線性模型，先用wx+b獲得初步結果z，再經過softmax函數將z折射獲得0～9各個數字的機率值。

在模型求解時，咱們須要定義一個指標來評估模型是好的。而在機器學習中，一般是定義指標來表示模型是壞的，而後儘可能最小化該指標獲得最優解，該指標也稱爲成本（cost）或損失（loss）函數。本例中，咱們用「交叉熵」來做爲損失函數cross_entropy = -tf.reduce_sum(y_*tf.log(y))，而後用梯度降低的方式求解train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)。其中0.01爲降低的速率。

能夠看到，通過1000次迭代，咱們的模型的預測結果的準確率達到了90.4%。

2.2.3 使用tensorflow實現識別手寫數字（1～9）模型 —— DeepCNN

上節中用softmax模型預測的準確率大概在90%，接下來嘗試下用tensorflow實現一個卷積神經網絡模型來識別手寫數字。

import input_data mnist = input_data.read_data_sets("MNIST_data/", one_hot=True) import tensorflow as tf def weight_variable(shape): initial = tf.truncated_normal(shape, stddev=0.1) return tf.Variable(initial) def bias_variable(shape): initial = tf.constant(0.1, shape=shape) return tf.Variable(initial) def conv2d(x, W): return tf.nn.conv2d(x, W, strides=[1,1,1,1], padding="SAME") def max_pool_2x2(x): return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME') x = tf.placeholder("float", shape=[None, 784]) y_ = tf.placeholder("float", shape=(None, 10)) #第一層卷積 W_conv1 = weight_variable([5, 5, 1, 32]) b_conv1 = bias_variable([32]) x_image = tf.reshape(x, [-1,28,28,1]) h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1) h_pool1 = max_pool_2x2(h_conv1) #第二層卷積 W_conv2 = weight_variable([5, 5, 32, 64]) b_conv2 = bias_variable([64]) h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2) + b_conv2) h_pool2 = max_pool_2x2(h_conv2) #密集鏈接層 W_fc1 = weight_variable([7 * 7 * 64, 1024]) b_fc1 = bias_variable([1024]) h_pool2_flat = tf.reshape(h_pool2, [-1, 7*7*64]) h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1) #Dropout防止過擬合 keep_prob = tf.placeholder("float") h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob) #輸出層: softmax W_fc2 = weight_variable([1024, 10]) b_fc2 = bias_variable([10]) y_conv=tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2) #訓練和評估模型 cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv)) train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy) correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y_,1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) for i in range(2000): batch = mnist.train.next_batch(50) if i%200 == 0: train_accuracy = sess.run(accuracy, feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0}) print "step %d, training accuracy %g"%(i, train_accuracy) train_step.run(feed_dict={x: batch[0], y_: batch[1], keep_prob: 0.5}) print "final training accuracy %g" % sess.run(accuracy, feed_dict={x:batch[0], y_: batch[1], keep_prob: 1.0})

輸出結果：

step 0, training accuracy 0.14 step 200, training accuracy 0.92 step 400, training accuracy 0.96 step 600, training accuracy 1 step 800, training accuracy 0.9 step 1000, training accuracy 0.96 step 1200, training accuracy 0.94 step 1400, training accuracy 0.94 step 1600, training accuracy 0.98 step 1800, training accuracy 1 final training accuracy 0.99

能夠看到，2000次迭代後，deepccn的預測準確率達到了：99%。

2、PAI平臺的使用

前面介紹了Tensorflow的基本概念和使用，下面簡單介紹下使用PAI完成LR模型訓練的基本過程。整個流程大概包含如下步驟：離線數據開發 -> PAI平臺實驗搭建 -> 模型服務化

2.1 離線數據開發

首先，算法的基礎是數據，咱們首先要經過對業務的分析，找出影響目標結果的特徵，並對特徵數據進行採集，獲得各項特徵的原始數據。這一部分，能夠在odps完成。如「需求風險智能識別」中，涉及到多個表的join和指標的提取、計算等，對於一些缺省的值能夠按照不一樣的策略來填充（如用0填充，或者該項的平均值）。最終獲得以下的訓練數據：