本文主要包括:html
1、什麼是LSTMpython
Long Short Term 網絡即爲LSTM,是一種循環神經網絡(RNN),能夠學習長期依賴問題。RNN 都具備一種重複神經網絡模塊的鏈式的形式。在標準的 RNN 中,這個重複的模塊只有一個很是簡單的結構,例如一個 tanh 層。git
如上爲標準的RNN神經網絡結構,LSTM則與此不一樣,其網絡結構如圖:github
其中,網絡中各個元素圖標爲:網絡
LSTM 經過精心設計的稱做爲「門」的結構來去除或者增長信息到細胞狀態的能力。門是一種讓信息選擇式經過的方法。他們包含一個 sigmoid 神經網絡層和一個 pointwise 乘法操做。LSTM 擁有三個門,來保護和控制細胞狀態。app
首先是忘記門:dom
如上,忘記門中須要注意的是,訓練的是一個wf的權值,並且上一時刻的輸出和當前時刻的輸入是一個concat操做。忘記門決定咱們會從細胞狀態中丟棄什麼信息,由於sigmoid函數的輸出是一個小於1的值,至關於對每一個維度上的值作一個衰減。函數
而後是信息增長門,決定了什麼新的信息到細胞狀態中:學習
其中,sigmoid決定了什麼值須要更新,tanh建立一個新的細胞狀態的候選向量Ct,該過程訓練兩個權值Wi和Wc。通過第一個和第二個門後,能夠肯定傳遞信息的刪除和增長,便可以進行「細胞狀態」的更新。測試
第三個門就是信息輸出門:
經過sigmoid肯定細胞狀態那個部分將輸出,tanh處理細胞狀態獲得一個-1到1之間的值,再將它和sigmoid門的輸出相乘,輸出程序肯定輸出的部分。
2、LSTM的曲線擬合
2.1 股票價格預測
下面介紹一個網上經常使用的利用LSTM作股票價格的迴歸例子,數據:
如上,能夠看到用例包含:index_code,date,open,close,low,high,volume,money,change這樣幾個特徵。提取特徵從open-change個特徵,做爲神經網絡的輸入,輸出即爲label。整個代碼以下:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import tensorflow as tf #定義常量 rnn_unit=10 #hidden layer units input_size=7 output_size=1 lr=0.0006 #學習率 #——————————————————導入數據—————————————————————— f=open('dataset_2.csv') df=pd.read_csv(f) #讀入股票數據 data=df.iloc[:,2:10].values #取第3-10列 #獲取訓練集 def get_train_data(batch_size=60,time_step=20,train_begin=0,train_end=5800): batch_index=[] data_train=data[train_begin:train_end] normalized_train_data=(data_train-np.mean(data_train,axis=0))/np.std(data_train,axis=0) #標準化 train_x,train_y=[],[] #訓練集 for i in range(len(normalized_train_data)-time_step): if i % batch_size==0: batch_index.append(i) x=normalized_train_data[i:i+time_step,:7] y=normalized_train_data[i:i+time_step,7,np.newaxis] train_x.append(x.tolist()) train_y.append(y.tolist()) batch_index.append((len(normalized_train_data)-time_step)) return batch_index,train_x,train_y #獲取測試集 def get_test_data(time_step=20,test_begin=5800): data_test=data[test_begin:] mean=np.mean(data_test,axis=0) std=np.std(data_test,axis=0) normalized_test_data=(data_test-mean)/std #標準化 size=(len(normalized_test_data)+time_step-1)//time_step #有size個sample test_x,test_y=[],[] for i in range(size-1): x=normalized_test_data[i*time_step:(i+1)*time_step,:7] y=normalized_test_data[i*time_step:(i+1)*time_step,7] test_x.append(x.tolist()) test_y.extend(y) test_x.append((normalized_test_data[(i+1)*time_step:,:7]).tolist()) test_y.extend((normalized_test_data[(i+1)*time_step:,7]).tolist()) return mean,std,test_x,test_y #——————————————————定義神經網絡變量—————————————————— #輸入層、輸出層權重、偏置 weights={ 'in':tf.Variable(tf.random_normal([input_size,rnn_unit])), 'out':tf.Variable(tf.random_normal([rnn_unit,1])) } biases={ 'in':tf.Variable(tf.constant(0.1,shape=[rnn_unit,])), 'out':tf.Variable(tf.constant(0.1,shape=[1,])) } #——————————————————定義神經網絡變量—————————————————— def lstm(X): batch_size=tf.shape(X)[0] time_step=tf.shape(X)[1] w_in=weights['in'] b_in=biases['in'] input=tf.reshape(X,[-1,input_size]) #須要將tensor轉成2維進行計算,計算後的結果做爲隱藏層的輸入 input_rnn=tf.matmul(input,w_in)+b_in input_rnn=tf.reshape(input_rnn,[-1,time_step,rnn_unit]) #將tensor轉成3維,做爲lstm cell的輸入 cell=tf.nn.rnn_cell.BasicLSTMCell(rnn_unit) init_state=cell.zero_state(batch_size,dtype=tf.float32) output_rnn,final_states=tf.nn.dynamic_rnn(cell, input_rnn,initial_state=init_state, dtype=tf.float32) #output_rnn是記錄lstm每一個輸出節點的結果,final_states是最後一個cell的結果 output=tf.reshape(output_rnn,[-1,rnn_unit]) #做爲輸出層的輸入 w_out=weights['out'] b_out=biases['out'] pred=tf.matmul(output,w_out)+b_out return pred,final_states #——————————————————訓練模型—————————————————— def train_lstm(batch_size=80,time_step=15,train_begin=2000,train_end=5800): X=tf.placeholder(tf.float32, shape=[None,time_step,input_size]) Y=tf.placeholder(tf.float32, shape=[None,time_step,output_size]) # 訓練樣本中第2001 - 5785個樣本,每次取15個 batch_index,train_x,train_y=get_train_data(batch_size,time_step,train_begin,train_end) print(np.array(train_x).shape)# 3785 15 7 print(batch_index) #至關於總共3785句話,每句話15個字,每一個字7個特徵(embadding),對於這些樣本每次訓練80句話 pred,_=lstm(X) #損失函數 loss=tf.reduce_mean(tf.square(tf.reshape(pred,[-1])-tf.reshape(Y, [-1]))) train_op=tf.train.AdamOptimizer(lr).minimize(loss) saver=tf.train.Saver(tf.global_variables(),max_to_keep=15) with tf.Session() as sess: sess.run(tf.global_variables_initializer()) #重複訓練200次 for i in range(200): #每次進行訓練的時候,每一個batch訓練batch_size個樣本 for step in range(len(batch_index)-1): _,loss_=sess.run([train_op,loss],feed_dict={X:train_x[batch_index[step]:batch_index[step+1]],Y:train_y[batch_index[step]:batch_index[step+1]]}) print(i,loss_) if i % 200==0: print("保存模型:",saver.save(sess,'model/stock2.model',global_step=i)) train_lstm() #————————————————預測模型———————————————————— def prediction(time_step=20): X=tf.placeholder(tf.float32, shape=[None,time_step,input_size]) mean,std,test_x,test_y=get_test_data(time_step) pred,_=lstm(X) saver=tf.train.Saver(tf.global_variables()) with tf.Session() as sess: #參數恢復 module_file = tf.train.latest_checkpoint('model') saver.restore(sess, module_file) test_predict=[] for step in range(len(test_x)-1): prob=sess.run(pred,feed_dict={X:[test_x[step]]}) predict=prob.reshape((-1)) test_predict.extend(predict) test_y=np.array(test_y)*std[7]+mean[7] test_predict=np.array(test_predict)*std[7]+mean[7] acc=np.average(np.abs(test_predict-test_y[:len(test_predict)])/test_y[:len(test_predict)]) #誤差 #以折線圖表示結果 plt.figure() plt.plot(list(range(len(test_predict))), test_predict, color='b') plt.plot(list(range(len(test_y))), test_y, color='r') plt.show() prediction()
這個過程並不難理解,下面分析其中維度變換,從而增長對LSTM的理解。
對於RNN的網絡的構建,能夠從輸入張量的維度上理解,這裏咱們使用dynamic_rnn(固然能夠注意與tf.contrib.rnn.static_rnn在使用上的區別):
dynamic_rnn( cell, inputs, sequence_length=None, initial_state=None, dtype=None, parallel_iterations=None, swap_memory=False, time_major=False, scope=None )
其中:
cell:輸入一個RNNcell實例
inputs:RNN神經網絡的輸入,若是 time_major == False (default),輸入的形狀是: [batch_size, max_time, embedding_size];若是 time_major == True, 輸入的形狀是: [ max_time, batch_size, embedding_size]
initial_state: RNN網絡的初始狀態,網絡須要一個初始狀態,對於普通的RNN網絡,初始狀態的形狀是:[batch_size, cell.state_size]
2.2 正弦曲線擬合
對於使用LSTM作曲線擬合,參考https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/5-09-RNN3/,獲得代碼:
import tensorflow as tf import numpy as np import matplotlib.pyplot as plt BATCH_START = 0 #創建 batch data 時候的 index TIME_STEPS = 20 # backpropagation through time 的time_steps BATCH_SIZE = 50 INPUT_SIZE = 1 # x數據輸入size OUTPUT_SIZE = 1 # cos數據輸出 size CELL_SIZE = 10 # RNN的 hidden unit size LR = 0.006 # learning rate # 定義一個生成數據的 get_batch function: def get_batch(): #global BATCH_START, TIME_STEPS # xs shape (50batch, 20steps) xs = np.arange(BATCH_START, BATCH_START+TIME_STEPS*BATCH_SIZE).reshape((BATCH_SIZE, TIME_STEPS)) / (10*np.pi) res = np.cos(xs) # returned xs and res: shape (batch, step, input) return [xs[:, :, np.newaxis], res[:, :, np.newaxis]] # 定義 LSTMRNN 的主體結構 class LSTMRNN(object): def __init__(self, n_steps, input_size, output_size, cell_size, batch_size): self.n_steps = n_steps self.input_size = input_size self.output_size = output_size self.cell_size = cell_size self.batch_size = batch_size with tf.name_scope('inputs'): self.xs = tf.placeholder(tf.float32, [None, n_steps, input_size], name='xs') self.ys = tf.placeholder(tf.float32, [None, n_steps, output_size], name='ys') with tf.variable_scope('in_hidden'): self.add_input_layer() with tf.variable_scope('LSTM_cell'): self.add_cell() with tf.variable_scope('out_hidden'): self.add_output_layer() with tf.name_scope('cost'): self.compute_cost() with tf.name_scope('train'): self.train_op = tf.train.AdamOptimizer(LR).minimize(self.cost) # 設置 add_input_layer 功能, 添加 input_layer: def add_input_layer(self, ): l_in_x = tf.reshape(self.xs, [-1, self.input_size], name='2_2D') # (batch*n_step, in_size) # Ws (in_size, cell_size) Ws_in = self._weight_variable([self.input_size, self.cell_size]) # bs (cell_size, ) bs_in = self._bias_variable([self.cell_size, ]) # l_in_y = (batch * n_steps, cell_size) with tf.name_scope('Wx_plus_b'): l_in_y = tf.matmul(l_in_x, Ws_in) + bs_in # reshape l_in_y ==> (batch, n_steps, cell_size) self.l_in_y = tf.reshape(l_in_y, [-1, self.n_steps, self.cell_size], name='2_3D') # 設置 add_cell 功能, 添加 cell, 注意這裏的 self.cell_init_state, # 由於咱們在 training 的時候, 這個地方要特別說明. def add_cell(self): lstm_cell = tf.contrib.rnn.BasicLSTMCell(self.cell_size, forget_bias=1.0, state_is_tuple=True) with tf.name_scope('initial_state'): self.cell_init_state = lstm_cell.zero_state(self.batch_size, dtype=tf.float32) self.cell_outputs, self.cell_final_state = tf.nn.dynamic_rnn(lstm_cell, self.l_in_y, initial_state=self.cell_init_state, time_major=False) # 設置 add_output_layer 功能, 添加 output_layer: def add_output_layer(self): # shape = (batch * steps, cell_size) l_out_x = tf.reshape(self.cell_outputs, [-1, self.cell_size], name='2_2D') Ws_out = self._weight_variable([self.cell_size, self.output_size]) bs_out = self._bias_variable([self.output_size, ]) # shape = (batch * steps, output_size) with tf.name_scope('Wx_plus_b'): self.pred = tf.matmul(l_out_x, Ws_out) + bs_out # 添加 RNN 中剩下的部分: def compute_cost(self): losses = tf.contrib.legacy_seq2seq.sequence_loss_by_example( [tf.reshape(self.pred, [-1], name='reshape_pred')], [tf.reshape(self.ys, [-1], name='reshape_target')], [tf.ones([self.batch_size * self.n_steps], dtype=tf.float32)], average_across_timesteps=True, softmax_loss_function=self.ms_error, name='losses' ) with tf.name_scope('average_cost'): self.cost = tf.div( tf.reduce_sum(losses, name='losses_sum'), self.batch_size, name='average_cost') tf.summary.scalar('cost', self.cost) def ms_error(self,labels, logits): return tf.square(tf.subtract(labels, logits)) def _weight_variable(self, shape, name='weights'): initializer = tf.random_normal_initializer(mean=0., stddev=1., ) return tf.get_variable(shape=shape, initializer=initializer, name=name) def _bias_variable(self, shape, name='biases'): initializer = tf.constant_initializer(0.1) return tf.get_variable(name=name, shape=shape, initializer=initializer) # 訓練 LSTMRNN if __name__ == '__main__': # 搭建 LSTMRNN 模型 model = LSTMRNN(TIME_STEPS, INPUT_SIZE, OUTPUT_SIZE, CELL_SIZE, BATCH_SIZE) sess = tf.Session() saver=tf.train.Saver(max_to_keep=3) sess.run(tf.global_variables_initializer()) t = 0 if(t == 1): model_file=tf.train.latest_checkpoint('model/') saver.restore(sess,model_file ) xs, res = get_batch() # 提取 batch data feed_dict = {model.xs: xs} pred = sess.run( model.pred,feed_dict=feed_dict) xs.shape = (-1,1) res.shape = (-1, 1) pred.shape = (-1, 1) print(xs.shape,res.shape,pred.shape) plt.figure() plt.plot(xs,res,'-r') plt.plot(xs,pred,'--g') plt.show() else: # matplotlib可視化 plt.ion() # 設置連續 plot plt.show() # 訓練屢次 for i in range(2500): xs, res = get_batch() # 提取 batch data # 初始化 data feed_dict = { model.xs: xs, model.ys: res, } # 訓練 _, cost, state, pred = sess.run( [model.train_op, model.cost, model.cell_final_state, model.pred], feed_dict=feed_dict) # plotting x = xs.reshape(-1,1) r = res.reshape(-1, 1) p = pred.reshape(-1, 1) plt.clf() plt.plot(x, r, 'r', x, p, 'b--') plt.ylim((-1.2, 1.2)) plt.draw() plt.pause(0.3) # 每 0.3 s 刷新一次 # 打印 cost 結果 if i % 20 == 0: saver.save(sess, "model/lstem_text.ckpt",global_step=i)# print('cost: ', round(cost, 4))
能夠看到一個有意思的現象,下面是前後兩個時刻的圖像:
x值較小的點先收斂,x值大的收斂速度很慢。其緣由主要是BPTT的求導過程,對於時間靠前的梯度降低快,能夠參考:http://www.javashuo.com/article/p-tapebxsi-gr.html 中1.2節。將網絡結構改成雙向循環神經網絡:
def add_cell(self): lstm_cell = tf.contrib.rnn.BasicLSTMCell(self.cell_size, forget_bias=1.0, state_is_tuple=True) lstm_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell],1) with tf.name_scope('initial_state'): self.cell_init_state = lstm_cell.zero_state(self.batch_size, dtype=tf.float32) self.cell_outputs, self.cell_final_state = tf.nn.dynamic_rnn(lstm_cell, self.l_in_y, initial_state=self.cell_init_state, time_major=False)
發現收斂速度快了一些。不過這個問題主要仍是是由於x的值過大致使的,修改代碼,將原始的值的獲取進行分段:
BATCH_START = 3000 #創建 batch data 時候的 index TIME_STEPS = 20 # backpropagation through time 的time_steps BATCH_SIZE_r = 50 BATCH_SIZE = 10 INPUT_SIZE = 1 # x數據輸入size OUTPUT_SIZE = 1 # cos數據輸出 size CELL_SIZE = 10 # RNN的 hidden unit size LR = 0.006 # learning rate ii = 0 # 定義一個生成數據的 get_batch function: def get_batch(): global ii # xs shape (50batch, 20steps) xs_r = np.arange(BATCH_START, BATCH_START+TIME_STEPS*BATCH_SIZE_r) xs = xs_r[ii*BATCH_SIZE*TIME_STEPS:(ii+1)*BATCH_SIZE*TIME_STEPS].reshape((BATCH_SIZE, TIME_STEPS)) / (10*np.pi) res = np.cos(xs) ii += 1 if(ii == 5): ii = 0 # returned xs and res: shape (batch, step, input) return [xs[:, :, np.newaxis], res[:, :, np.newaxis]]
而後能夠具體觀測某一段的收斂過程:
# matplotlib可視化 plt.ion() # 設置連續 plot plt.show() # 訓練屢次 for i in range(200): xs,res,pred = [],[],[] for j in range(5): xsj, resj = get_batch() # 提取 batch data if(j != 0): continue # 初始化 data feed_dict = { model.xs: xsj, model.ys: resj, } # 訓練 _, cost, state, predj = sess.run( [model.train_op, model.cost, model.cell_final_state, model.pred], feed_dict=feed_dict) # plotting x = list(xsj.reshape(-1,1)) r = list(resj.reshape(-1, 1)) p = list(predj.reshape(-1, 1)) xs += x res += r pred += p plt.clf() plt.plot(xs, res, 'r', x, p, 'b--') plt.ylim((-1.2, 1.2)) plt.draw() plt.pause(0.3) # 每 0.3 s 刷新一次 # 打印 cost 結果 if i % 20 == 0: saver.save(sess, "model/lstem_text.ckpt",global_step=i)# print('cost: ', round(cost, 4))
能夠看到,當設置的區間比較大,譬如BATCH_START = 3000了,那麼就很難收斂了。
所以,這裏須要注意了,LSTM作迴歸問題的時候,注意觀測值與自變量之間不要差距過大。當咱們改小一些x的值,能夠看到效果如圖:
3、LSTM的分類問題
對於分類問題,其實和迴歸是同樣的,假設在上面的正弦函數的基礎上,若y大於0標記爲1,y小於0標記爲0,則輸出變成了一個n_class(n個類別)的向量,本例中兩個維度分別表明標記爲0的機率和標記爲1的機率。須要修改的地方爲:
首先是數據產生函數,添加一個打標籤的過程:
# 定義一個生成數據的 get_batch function: def get_batch(): #global BATCH_START, TIME_STEPS # xs shape (50batch, 20steps) xs = np.arange(BATCH_START, BATCH_START+TIME_STEPS*BATCH_SIZE).reshape((BATCH_SIZE, TIME_STEPS)) / (200*np.pi) res = np.where(np.cos(4*xs)>=0,0,1).tolist() for i in range(BATCH_SIZE): for j in range(TIME_STEPS): res[i][j] = [0,1] if res[i][j] == 1 else [1,0] # returned xs and res: shape (batch, step, input/output) return [xs[:, :, np.newaxis], np.array(res)]
而後修改損失函數,迴歸問題就不能用最小二乘的損失了,能夠採用交叉熵損失函數:
def compute_cost(self): self.cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels = self.ys,logits = self.pred))
固然,注意一下維度問題就能夠了,效果如圖:
例子代碼 。
4、爲何LSTM有助於消除梯度消失
爲了解決RNN的梯度問題,首先有人提出了滲透單元的辦法,即在時間軸上增長跳躍鏈接,後推廣成LSTM。LSTM其門結構,提供了一種對梯度的選擇的做用。
對於門結構,其實若是關閉,則會一直保存之前的信息,其實也就是縮短了鏈式求導。
譬如,對某些輸入張量訓練獲得的ft一直爲1,則Ct-1的信息能夠一直保存,直到有輸入x獲得的ft爲0,則和前面的信息就沒有關係了。故解決了長時間的依賴問題。由於門控機制的存在,咱們經過控制門的打開、關閉等操做,讓梯度計算沿着梯度乘積接近1的部分建立路徑。
如上,能夠經過門的控制,看到紅色和藍色箭頭表明的路徑下,yt+1的在這個路徑下的梯度與上一時刻梯度保持不變。
對於信息增長門與忘記門的「+」操做,其求導是加法操做而不是乘法操做,該環節梯度爲1,不會產生鏈式求導。如後面的求導,綠色路徑和藍色路徑是相加的關係,保留了以前的梯度。
然而,梯度消失現象能夠改善,可是梯度爆炸仍是可能會出現的。譬如對於綠色路徑:
仍是存在着w致使的梯度爆炸現象。