論文筆記之：Hybrid computing using a neural network with dynamic external memory

時間 2019-11-25

標籤論文筆記 hybrid computing using neural network dynamic external memory 欄目 Hybrid 简体版

原文原文鏈接

Hybrid computing using a neural network with dynamic external memory
git

Nature 2016 github

updated on 2018-07-21 15:30:31 網絡

Paper：http://www.nature.com/nature/journal/vaop/ncurrent/pdf/nature20101.pdf
數據結構

Code：https://github.com/deepmind/dnc dom

Slides: http://people.idsia.ch/~rupesh/rnnsymposium2016/slides/graves.pdf機器學習

Blog：ide

1. Offical blog: https://deepmind.com/research/dnc/ 函數

2. others: 學習

Applications on CV tasks (for example, 3 CVPR-2018 papers): ui

1. VQA: http://openaccess.thecvf.com/content_cvpr_2018/papers/Ma_Visual_Question_Answering_CVPR_2018_paper.pdf

2. One-Shot Image Recognition: http://openaccess.thecvf.com/content_cvpr_2018/papers/Cai_Memory_Matching_Networks_CVPR_2018_paper.pdf

3. Video Caption: http://openaccess.thecvf.com/content_cvpr_2018/papers/Wang_M3_Multimodal_Memory_CVPR_2018_paper.pdf

摘要：人工智能神經網絡在感知處理，序列學習，強化學習領域獲得了很是大的成功，可是限制於其表示變量和數據結構的能力，長時間存儲知識的能力，由於其缺乏一個額外的記憶單元。此處，咱們引入一個機器學習模型，稱爲：a differentiable neural computer (DNC)，包含一個神經網絡，能夠讀取和寫入一個額外的記憶矩陣；相似於計算機當中的 random-access memory。像傳統的計算機同樣，能夠利用其 memory 表示和執行一個複雜的數據結構，可是，像神經網絡同樣，也能夠從數據中進行學習。當進行監督學習的時候，咱們代表一個 DNC 可以成功的回答模擬的問題，在天然語言中進行推理和論證問題。咱們代表，他能夠學習到相似給定特定點的最短距離和推理在隨機產生的圖中丟失的鏈接，而後推廣到特定的 graph，例如：交通運輸網絡和家譜樹結構。當進行強化學習的時候，一個 DNC 能夠完成移動 block 的難題。總的來講，咱們的結果代表，DNCs 可以解決複雜的，結構化的任務，可是這些任務假如沒有 external read-write memory，那麼根本沒法完成的任務。

引言：

雖然最近的突破代表神經網絡在信號處理，序列學習，強化學習上有很強的適應性。可是，認知科學家和神經科學家都認爲：神經網絡在表示變量和數據結構上，能力有限，以及存儲長時間的數據（the neural networks are limited in their ability to represent variables and data structure, and to store data over long timescales without interference）。咱們嘗試結合神經元和計算處理的優點，具體作法是：providing a neural network with read-write access to exernal memory. 整個系統都是可微分的，能夠用 gradient descent 的方法進行 end to end 的學習，容許網絡自動學習若是操做和組織 memory（in a goal-directed manner）。

System Overview：

A DNC is a neural network coupled to an external memory matrix.

若是 memory 能夠認爲是：DNC's RAM，那麼，the network，能夠認爲是 controller，是一個可微分的 CPU，其操做是用 gradient descent 的方法來學習的。DNC 的結構不一樣於最近的神經記憶單元，主要體如今：the memory can be slectively written to as well as read, allowing iterative modification of memory content.

傳統的計算機利用獨特的地址倆訪問 memory content，DNC 利用可微分的注意力機制來定義：distributions over the N rows, or "locations", in the N*M memory matrix M. 這些分佈，咱們稱之爲：weightings，表明了每一個位置涉及到 read or write operation 的程度。

這些功能性的單元，決定和採用了這些權重，咱們稱之爲：「read and write heads」。heads 的操做，如圖1所示。

Interaction between the heads and the memory

The heads 利用了三種不一樣形式的可微分的 attention。

第一種是：the content lookup （內容查找表）， in which a key vector emitted by the controller is compared to the content of each location in memory according to a similarity measure (here, cosine similarity).

第二種 attention 機制記錄了：records transitions between consecutively written locations in an N*N temporal link matrix L.

第三種 attention 分配內存用於 writting。

注意力機制的設計是受到計算上的考慮。

Content lookup 確保了鏈接數據結構的形式；

temporal links 確保了輸入序列的時序檢索；

allocation 提供了 the write head with unused locations.

METHODS

Controller Networks.

在每個時間步驟 t 控制網絡 N 從數據集或者環境中接收一個輸入向量 x_t，而且輸出一個向量 y_t 用於參數化要麼是一個目標向量 z 的預測分佈（監督學習的角度來講），要麼是一個動做分佈（強化學習的角度來講）。另外，the controller 接收一組 R read vectors from the memory matrix M_t-1 at the previous time-step, via the read heads. 它而後發射一個 interface vector，定義了在當前時刻與 memory 的交互。爲了符號表示的方便，咱們將輸入和 read vectors 表示爲 a single controller input vector X_t = [x_t; r_t-1¹; ... ; r_t-1^R]. 任何結構的神經網絡均可以用於 controller，可是咱們這裏採用 deep LSTM 結構的變種：

其中，i，f, s, o, h 分別表明輸入門，遺忘門，狀態（即常規的cell），輸出門，以及 hidden state。

在每個時間步驟，the controller 發射一個輸出向量 vt，以及一個交互向量，定義爲：

假設控制網絡是 recurrent，他的輸出是複雜歷史（X₁, X₂, ... X_t）的函數。因此咱們能夠壓縮 the controller 的操做爲：

It is possible to use a feedforward controller, in which case N is a function of X_t only; however, we use only recurrent controller in this paper.

最終，輸出的向量 y_t 定義爲：adding v_t to a vector obtained by passing the connection of the current read vectors through the RW*Y weight matirx Wr:

這種安排使得 DNC 可以在剛剛讀取到的記憶基礎之上，進行決策的輸出；可是很難將這個信息傳遞到 controller，從而利用他們來決定 v，without carrying a cycle in the computation graph.

Interference parameters：

Before being used to parameterize the memory interactions, the interface vector 被劃分爲以下幾個部分：

每個單獨的成分而後被不一樣的函數進行處理，以確保他們可以在合適的 domain 當中。如：

1. the logistic sigmoid function is used to constrain to [0, 1].

2. the "oneplus" function is used to constrain to [1, 無窮)，其中：

　　　　oneplus(x) = 1 + log(1+e^x)

3. softmax function is used to constrain vectors to S_N, the N-1-dimensional unit simplex:

在處理完畢以後，咱們有以下的變量和向量：

Reading and Writing to memory

選擇位置進行讀寫是依賴於權重（weighting）的，即：屬於0-1之間的 value，總和爲 1. The complete set of allowed weightings over N locations is the non-negative orthant of R^N with the unit simplex as a boundary (known as the "corner of the cube"):

對於 read 操做，R read weightings 被用於計算內容的加權平均，因此，定義 read vectors 爲：

The read vectors 加上下一個時間步驟的 controller input，使之可以訪問到 memory content。

The write operation 被單個 write weighting 所調節，常常跟擦除向量（erase vector）和寫入向量（write vector）一塊兒使用來修改記憶：

其中，E 是 N*W matrix of ones.

Memory Addressing：

這個系統利用了 content-based addressing and dynamic memory allocation to determine where to write in memory ;

　　　　　　　　content-based addressing and temporal memory linkage to determine where to read.

下面將分別介紹這些機制：

Content-based addressing. 全部的 content lookup 操做，都利用以下的函數：

權重 $C(M, k, \beta)$ 定義了一個歸一化的機率分佈（over the memory locations）。

Dynamic memory allocation. 爲了容許 the controller 可以釋放和分配所須要的 memory，咱們研發了一個可微分的相似「free list」的 memory allocation scheme，其中，可用記憶位置的列表（a list of available memory locations）是經過添加和移除 linked list 上的 address 來實現的。在時刻 t 的記憶利用向量爲：u_t, 而且 u₀ = 0。在寫入到 memory 以前，the controller 發射一系列的 free gates，one per read head, 來決定是否最近讀取的位置能夠被釋放？The memory retention vector 表示 how much each location will not be freed by the free gates, 而且定義爲：

因此，the usage vector 能夠定義爲：

直觀上的理解：locations are used if they have been retained by the free gates, and were either already in use or have just been written to. 每一次對一個位置的寫入，都會增長他的 usage，直到1，利用率也能夠用過 free gates 進行逐漸的下降；u_t 的元素從而能夠被約束在 [0, 1]之間。一旦 u_t 被肯定了，the free list 就被定義爲：sorting the indices of the memory locations in ascending order of usage; 對應的，就是 the index of the least used location。分配的權重 a_t 被用於提供新的位置來進行寫操做，即：

若是全部的 usage 都是1，那麼 a_t= 0 ，the controller 就不在可以分配 memory了，除非它首先將已經使用的 locations 進行釋放.

Write weightings：控制器能夠寫入到 newly allocated locations，或者 locations addressed by content, 或者他能夠選擇不進行 write 操做。首先，一個寫入內容權重經過 the write key 和 write strength 來構建：

其中，c_t^w is interpolated with the allocation weighting a_t defined in equation (1) to determine a write weighting:

其中，g_t^a is the allocation gate governing the interpolation and g_t^w is the write gate. 當 write gate 爲 0 的時候，而後就什麼都不進行寫入，而無論其餘參數怎麼樣；這能夠從某種程度上保護記憶，省得受到沒必要要的更新。

Temporal memory linkage：The memory allocation system 不存儲序列信息（stores no information about the order ）。可是，這種次序的信息卻常常是有效的：for example，when a sequence of instructions must be recored and retrieved in order. 因此，咱們採用了一個 link matrix L 來跟蹤連續被修改的記憶位置（to keep track of consecutively modified memory locations）。

L_t[i, j] represents the degree to which location i was the location written to after location j, and each row and column of Lt defines a weighting over locations:

爲了定義 Lt, 咱們須要 a precedence weighting Pt, where element Pt[i] represents the degree to which location i was the last one written to. Pt is defined by the recurrence relation:

每次當一個位置被更新以後，the link matrix 被更新，to remove old links to and from that location. 從上次寫入的位置的新的連接被添加。咱們利用以下的 recurrence relation 來執行這個邏輯：

咱們將 self-links 扔掉了（即：link matrix 的對角線元素所有爲 0），由於：it is unclear how to follow a translation from a location to itself. Lt 的行和列分別表明了：temporal links 進去和出來某一特定 memory slots 的權重。給定 Lt，read head i 的反向權重 b_tⁱ 和前向權重 f_tⁱ 分別定義爲：

其中，是第 t-th 次從上一個時間步驟獲得的 read weighting。

Sparse link matrix. the link matrix is N*N and therefore requires O(N²) resources in both memory and computation to calculate exactly.

Read weighting.