翻譯--SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS

時間 2020-12-30

標籤 GCN 简体版

原文原文鏈接

SEMI-SUPERVISED CLASSIFICATION WITH GRAPH CONVOLUTIONAL NETWORKS

圖卷積神經網絡的半監督學習

原文連接: Semi-Supervised Classification with Graph Convolutional Networks

0. 摘要

We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized first-order approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.

我們提出了一種可擴展的基於圖數據結構的半監督學習方法，該方法基於一個有效的卷積神經網絡變種，這種變種能夠直接對圖進行操作。我們通過光譜圖卷積的局部一階近似來確定卷積網絡結構的選擇。我們的模型在圖的邊的數量上線性縮放，並可以學習隱藏層表示，這些表示既編碼局部圖結構，也能夠編碼節點的特徵。在大量關於引用網絡和知識圖譜網絡數據集的實驗後，我們認爲我們的方法顯著優於相關方法。

1.介紹

We consider the problem of classifying nodes (such as documents) in a graph (such as a citation network), where labels are only available for a small subset of nodes. This problem can be framed as graph-based semi-supervised learning, where label information is smoothed over the graph via some form of explicit graph-based regularization (Zhu et al., 2003; Zhou et al., 2004; Belkin et al., 2006; Weston et al., 2012), e.g. by using a graph Laplacian regularization term in the loss function:

我們考慮對圖（例如引文網絡）中的節點（如引文網絡中的文章）進行分類的問題，其中僅有一小部分節點有標籤（即明確知道該節點屬於哪一類）。這個問題可以被定義爲基於圖的半監督學習，其中標籤信息通過基於顯示圖的正則化而被平滑化，例如，通過在損失函數中使用圖的拉普拉斯正則項：

L_{0} = L_{0} + λ L_{r e g} L_{r e g} = \sum_{i, j} A_{i, j} | | f (X_{i}) - f (X_{j}) | |^{2} = f (X)^{T} Δ f (X) (1)

Here, $L_{0}$ denotes the supervised loss w.r.t. the labeled part of the graph, $f (\cdot)$ can be a neural network-like differentiable function, $λ$ is a weighing factor and $X$ is a matrix of node feature vectors $X_{i}$ . $Δ = D - A$ denotes the unnormalized graph Laplacian of an undirected graph $G = (V, E)$ with $N$ nodes $v_{i} \in V$ , edges $(v_{i}, v_{j}) \in E$ , an adjacency matrix $A \in R^{N \times N}$ (binary or weighted) and a degree matrix $D_{i i} = \sum_{j} A_{i j}$ . The formulation of Eq. 1 relies on the assumption that connected nodes in the graph are likely to share the same label. This assumption, however, might restrict modeling capacity, as graph edges need not necessarily encode node similarity, but could contain additional information.

這裏， $L_{0}$ 表示圖中有label部分的監督損失， $f (\cdot)$ 可以是一個神經網絡似微分函數， $λ$ 是一個加權因子， $X$ 是節點特徵向量 $X_{i}$ 的矩陣， $Δ = D - A$ 表示無向圖 $G = (V, E)$ 的非標準圖拉普拉斯算子， $A \in R^{N \times N}$ （權重爲0，1或者加權）表示鄰接矩陣， $D_{i i} = \sum_{j} A_{i j}$ 表示度矩陣。式 (1)依賴於圖中連接節點可能共享相同標籤的假設。然而，這種假設可能會限制建模能力，因爲圖的邊不一定需要編碼節點相似性，而可能包含其他信息。

In this work, we encode the graph structure directly using a neural network model $f (X, A)$ and train on a supervised target $L_{0}$ for all nodes with labels, thereby avoiding explicit graph-based regularization in the loss function. Conditioning $f (\cdot)$ on the adjacency matrix of the graph will allow the model to distribute gradient information from the supervised loss $L_{0}$ and will enable it to learn representations of nodes both with and without labels.

在本文中，我們直接使用神經網絡模型 $f (X, A)$ 對圖結構進行編碼，並對所有帶標籤的節點進行有監督 loss $L_{0}$ 訓練，從而避免在損失函數中進行基於顯示的圖的正則化。在圖的鄰接矩陣上調節 $f (\cdot)$ 將允許模型從監督損失 $L_{0}$ 中分配梯度信息，並使其能夠學習帶標籤用和不帶標籤的節點的表示。

Our contributions are two-fold. Firstly, we introduce a simple and well-behaved layer-wise propagation rule for neural network models which operate directly on graphs and show how it can be motivated from a first-order approximation of spectral graph convolutions (Hammond et al., 2011). Secondly, we demonstrate how this form of a graph-based neural network model can be used for fast and scalable semi-supervised classification of nodes in a graph. Experiments on a number of datasets demonstrate that our model compares favorably both in classification accuracy and efficiency (measured in wall-clock time) against state-of-the-art methods for semi-supervised learning.

我們做了兩方面的工作。首先，我們爲神經網絡模型引入一個簡單而且表現良好的分層傳播規則，該規則直接在圖上運行，並顯示它如何從光譜圖卷積的一階近似中得到激勵。其次，我們論證了爲什麼可以將這種基於圖形的神經網絡模型用於圖中節點的快速可擴展半監督分類。對許多數據集進行的研究表明，我們的模型在分類準確性和效率（在掛鐘時間中測量）上與半監督學習的最新方法相比有優勢。

2. 圖的快速近似卷積

In this section, we provide theoretical motivation for a specific graph-based neural network model $f (X, A)$ that we will use in the rest of this paper. We consider a multi-layer Graph Convolutional Network (GCN) with the following layer-wise propagation rule:

在本節中，我們爲特定的基於圖的神經網絡模型 $f (X, A)$ 提供本文後面用到的理論機制，我們考慮具有以下分層傳播規則的多層圖形卷積網絡（GCN）：

H^{(l + 1)} = σ ({\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} H^{(l)} W^{(l)}) (2)

Here, $\tilde{A} = A + I_{N}$ is the adjacency matrix of the undirected graph $G$ with added self-connections. $I_{N}$ is the identity matrix, ${\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}$ and $W^{(l)}$ is a layer-specific trainable weight matrix. $σ (\cdot)$ denotes an activation function, such as the $R e L U (\cdot) = m a x (0, \cdot)$ . $H^{(l)} \in R^{N \times D}$ is the matrix of activations in the $l^{t h}$ layer; $H^{(0)} = X$ . In the following, we show that the form of this propagation rule can be motivated via a first-order approximation of localized spectral filters on graphs (Hammond et al., 2011; Defferrard et al., 2016).

這裏， $\tilde{A} = A + I_{N}$ 是帶有自環的無向圖的鄰接矩陣。 $I_{N}$ 是單位矩陣。 ${\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}$ 和 $W^{(l)}$ 是一個layer-specific的可訓練權重矩陣。 $σ (\cdot)$ 激活函數，例如Relu。 $H^{(l)} \in R^{N \times D}$ 是 $l^{t h}$ 層的激活矩陣； $H^{(0)} = X$ 。接下來，我們表明這種傳播規則的形式可以通過圖上的局部光譜過濾器的一階近似來激發。

2.1 光譜圖卷積(SPECTRAL GRAPH CONVOLUTIONS)

We consider spectral convolutions on graphs defined as the multiplication of a signal $x \in R_{N}$ (a scalar for every node) with a filter $g_{θ} = d i a g (θ)$ parameterized by $θ \in R_{N}$ in the Fourier domain, i.e.:
我們考慮信號 $x \in R_{N}$ 與參數爲 $θ \in R_{N}$ 的濾波器 $g_{θ} = d i a g (θ)$ 在傅里葉域的譜卷積。

g_{θ} * x = U g_{θ} U^{T} x (3)

where $U$ is the matrix of eigenvectors of the normalized graph Laplacian $L = I_{N} - D^{- 1 / 2} A D^{- 1 / 2} = U Λ U^{T}$ , with a diagonal matrix of its eigenvalues $Λ$ and $U^{T} x$ being the graph Fourier transform of $x$ . We can understand $g_{θ}$ as a function of the eigenvalues of $L$ , i.e. $g_{θ} (Λ)$ . Evaluating Eq. 3 is computationally expensive, as multiplication with the eigenvector matrix $U$ is $O (N^{2})$ . Furthermore, computing the eigendecomposition of $L$ in the first place might be prohibitively expensive for large graphs. To circumvent this problem, it was suggested in Hammond et al. (2011) that $g_{θ} (Λ)$ can be well-approximated by a truncated expansion in terms of Chebyshev polynomials $T_{k} (x)$ up to $K^{t h}$ order:

$U$ 是歸一化圖 $L = I_{N} - D^{- 1 / 2} A D^{- 1 / 2} = U Λ U^{T}$ 拉普拉斯算子的特徵向量矩陣，它的特徵值的對角矩陣是 $Λ$ ， $U^{T} x$ 是 $x$ 的傅里葉變換。我們可以認爲 $g_{θ}$ 是特徵值 $L$ 的一個函數，例如 $g_{θ} (Λ)$ 。式3的計算量很大，因爲特徵向量矩陣 $U$ 的複雜度是 $O (N^{2})$ 。此外，對於大型圖來說， $L$ 特徵值分解的計算量也很大。爲了解決這個問題，Hammond et al.(2011) 指出 $g_{θ} (Λ)$ 可以很好的通過Chebyshev多項式 $T_{k} (x)$ 到 $K^{t h}$ 階的截斷展開來近似。

g_{θ^{^{'}}} (Λ) \approx \sum_{k = 0}^{K} θ {^{'}}_{k} T_{K} (\tilde{Λ}) (4)

with a rescaled $\tilde{Λ} = 2 Λ / λ_{m a x} - I_{N}$ . $λ_{m a x}$ denotes the largest eigenvalue of $L$ . $θ^{^{'}} \in R_{K}$ is now a vector of Chebyshev coefficients. The Chebyshev polynomials are recursively defined as $T_{k} (x) = 2 x T_{k - 1} (x) - T_{k - 2} (x)$ , with $T_{0} (x) = 1$ and $T_{1} (x) = x$ . The reader is referred to Hammond et al. (2011) for an in-depth discussion of this approximation.

重新調整 $\tilde{Λ} = 2 Λ / λ_{m a x} - I_{N}$ 。 $λ_{m a x}$ 是 $L$ 的最大特徵值。 $θ^{^{'}} \in R_{K}$ 是切比雪夫係數的矢量。Chebyshev多項式遞歸定義爲 $T_{k} (x) = 2 x T_{k - 1} (x) - T_{k - 2} (x)$ ，其中 $T_{0} (x) = 1$ ， $T_{1} (x) = x$ 。讀者可以參考Hammond et al. (2011) 深入討論過的這種近似方法。

Going back to our definition of a convolution of a signal $x$ with a filter $g_{θ^{^{'}}}$ , we now have:

回到我們對信號 $x$ 與濾波器 $g_{θ^{^{'}}}$ 的卷積的定義，我們現在有：

g_{θ^{^{'}}} * x = \sum_{k = 0}^{K} θ {^{'}}_{k} T_{K} (\tilde{L}) x (5)

with $\tilde{L} = 2 L / λ_{m a x} - I_{N}$ ; as can easily be verified by noticing that $(U Λ U^{T})^{k} = U Λ^{k} U^{T}$ . Note that this expression is now $K$ -localized since it is a $K^{t h}$ -order polynomial in the Laplacian, i.e. it depends only on nodes that are at maximum $K$ steps away from the central node ( $K^{t h}$ -order neighborhood). The complexity of evaluating Eq. 5 is $O (| E |)$ , i.e. linear in the number of edges. Defferrard et al. (2016) use this $K$ -localized convolution to define a convolutional neural network on graphs.

其中 $\tilde{L} = 2 L / λ_{m a x} - I_{N}$ 。易證 $(U Λ U^{T})^{k} = U Λ^{k} U^{T}$ ，請注意，此表達式現在是 $K$ -localized，因爲它是拉普拉斯算子中的 $K^{t h}$ - 階多項式，即它僅取決於離中央節點( $K^{t h}$ 階鄰域)最大 $K$ 步的節點。式5的複雜度是 $O (| E |)$ ，即與邊數線性關係。Defferrard等人（2016）使用這個 $K$ - 局部卷積來定義圖上的卷積神經網絡。

2.2 線性模型（LAYER-WISE LINEAR MODEL）

A neural network model based on graph convolutions can therefore be built by stacking multiple
convolutional layers of the form of Eq. 5, each layer followed by a point-wise non-linearity. Now,
imagine we limited the layer-wise convolution operation to $K = 1$ (see Eq. 5), i.e. a function that is linear w.r.t. $L$ and therefore a linear function on the graph Laplacian spectrum.

因此可以通過堆疊多個形式爲式5的卷積層來建立基於圖卷積的神經網絡模型。現在，我們將分層卷積操作限制爲 $K = 1$ （式5），例如關於 $L$ 是線性的，因此在圖拉普拉斯譜上具有線性函數。

In this way, we can still recover a rich class of convolutional filter functions by stacking multiple such layers, but we are not limited to the explicit parameterization given by, e.g., the Chebyshev polynomials. We intuitively expect that such a model can alleviate the problem of overfitting on local neighborhood structures for graphs with very wide node degree distributions, such as social networks, citation networks, knowledge graphs and many other real-world graph datasets. Additionally, for a fixed computational budget, this layer-wise linear formulation allows us to build deeper models, a practice that is known to improve modeling capacity on a number of domains (He et al.,2016).

通過這種方式，我們仍然可以通過堆疊多個這樣的網絡層來恢復豐富的卷積濾波函數，但我們不限於由例如Chebyshev多項式給出的顯式參數化。我們直觀地預期，這樣的模型可以緩解節點度分佈非常廣的圖的局部鄰域結構過擬合問題，如社交網絡，引文網絡，知識圖譜和許多其他真實世界的圖形數據集。此外，對於有限的計算資源，這種分層線性公式允許我們建立更深的模型，這已經被證明可以在很多領域改進模型性能（He et al。，2016）。

In this linear formulation of a GCN we further approximate $λ_{m a x} \approx 2$ , as we can expect that neural network parameters will adapt to this change in scale during training. Under these approximations Eq. 5 simplifies to:

在GCN的這個線性公式中，我們進一步近似 $λ_{m a x} \approx 2$ , 因爲我們可以預計神經網絡參數將在訓練過程中適應這種規模變化。根據這些近似，式5簡化爲：

g_{θ^{^{'}}} * x \approx θ_{0}^{^{'}} x + θ_{1}^{^{'}} (L - I_{N}) x = θ_{0}^{^{'}} x - θ_{1}^{^{'}} D^{- 1 / 2} A D^{- 1 / 2} x (6)

with two free parameters $θ_{0}^{^{'}}$ and $θ_{1}^{^{'}}$ . The filter parameters can be shared over the whole graph. Successive application of filters of this form then effectively convolve the $k^{t h}$ -order neighborhood of a node, where $k$ is the number of successive filtering operations or convolutional layers in the neural network model.

有兩個自由參數 $θ_{0}^{^{'}}$ 和 $θ_{1}^{^{'}}$ 。濾波器參數可以被整個圖上共享。連續應用這種形式的濾波器，然後有效地卷積節點的 $k^{t h}$ - 階鄰域，其中 $k$ 是神經網絡模型中連續濾波操作或卷積層的數目。

In practice, it can be beneficial to constrain the number of parameters further to address overfitting and to minimize the number of operations (such as matrix multiplications) per layer. This leaves us
with the following expression:

實際上，進一步限制參數的數量以解決過擬合併最小化每層的操作數量（例如矩陣乘法）會是有益的。所以我們有以下表達式：

g_{θ} * x \approx θ (I_{N} + D^{- 1 / 2} A D^{- 1 / 2}) x (7)

with a single parameter $θ = θ_{0}^{^{'}} = - θ_{1}^{^{'}}$ . Note that $I_{N} + D^{- 1 / 2} A D^{- 1 / 2}$ now has eigenvalues in the range $[0, 2]$ . Repeated application of this operator can therefore lead to numerical instabilities and exploding/vanishing gradients when used in a deep neural network model. To alleviate this problem, we introduce the following renormalization trick: $I_{N} + D^{- 1 / 2} A D^{- 1 / 2} \to {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}$ , with $\tilde{A} = A + I_{N}$ and ${\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}$ .

只有一個參數 $θ = θ_{0}^{^{'}} = - θ_{1}^{^{'}}$ 。注意到 $I_{N} + D^{- 1 / 2} A D^{- 1 / 2}$ 是有範圍 $[0, 2]$ 的特徵值。因此，如果在深度神經網絡模型中使用該算子，則反覆應用該算子會導致數值不穩定性和梯度爆炸/消失。我們介紹下面的歸一化技巧： $I_{N} + D^{- 1 / 2} A D^{- 1 / 2} \to {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}$ , 其中 $\tilde{A} = A + I_{N}$ ， ${\tilde{D}}_{i i} = \sum_{j} {\tilde{A}}_{i j}$ .

We can generalize this definition to a signal $X \in R^{N \times C}$ with $C$ input channels (i.e. a $C$ -dimensional feature vector for every node) and $F$ filters or feature maps as follows:

我們可以將這個定義推廣到具有 $C$ 個輸入通道（即每個節點的 $C$ 維特徵向量）的信號 $X \in R^{N \times C}$ 和 $F$ 個濾波器或特徵映射如下：

Z = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2} X Θ (8)

where $Θ \in R^{C \times F}$ is now a matrix of filter parameters and $Z \in R^{N \times F}$ is the convolved signal matrix. This filtering operation has complexity $O (| E | F C)$ , as $\tilde{A} X$ can be efficiently implemented as a product of a sparse matrix with a dense matrix.

其中 $Θ \in R^{C \times F}$ 是一個濾波器參數矩陣， $Z \in R^{N \times F}$ 是卷積信號參數矩陣。這個濾波操作複雜度是 $O (| E | F C)$ ，因爲 $\tilde{A} X$ 可以有效地實現爲密集矩陣和稀疏矩陣的乘積。

3. 半監督節點分類（SEMI-SUPERVISED NODE CLASSIFICATION）

Having introduced a simple, yet flexible model $f (X, A)$ for efficient information propagation on graphs, we can return to the problem of semi-supervised node classification. As outlined in the introduction, we can relax certain assumptions typically made in graph-based semi-supervised learning by conditioning our model $f (X, A)$ both on the data $X$ and on the adjacency matrix $A$ of the underlying graph structure. We expect this setting to be especially powerful in scenarios where the adjacency matrix contains information not present in the data $X$ , such as citation links between documents in a citation network or relations in a knowledge graph. The overall model, a multi-layer GCN for semi-supervised learning, is schematically depicted in Figure 1.

介紹了一個簡單但靈活的可以在圖上有效地傳播信息模型 $f (X, A)$ ，我們可以回到半監督節點分類的問題上。如前言所述，我們可以通過調整我們的模型 $f (X, A)$ 來放鬆通常在基於圖的半監督學習中所做的某些假設，我們希望這種設置可以在鄰接矩陣種包含信息但數據 $X$ 沒有表現出來的情況下更有用，例如引用網絡中文檔之間的引用鏈接或知識圖譜中的關係。整體半監督學習的多層GCN模型，如圖1所示。

3.1 例子（EXAMPLE）

In the following, we consider a two-layer GCN for semi-supervised node classification on a graph with a symmetric adjacency matrix $A$ (binary or weighted). We first calculate $\hat{A} = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}$ in a pre-processing step. Our forward model then takes the simple form:

接下來，我們考慮一個兩層的半監督節點分類GCN模型，在對稱鄰接矩陣 $A$ $\hat{A} = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}$ in a pre-processing step. Our forward model then takes the simple form:

接下來，我們考慮一個兩層的半監督節點分類GCN模型，在對稱鄰接矩陣 $A$ (binary or weighted) 上操作。在預處理步驟中，我們首先計算 $\hat{A} = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}$ $\hat{A} = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}$ in a pre-processing step. Our forward model then takes the simple form:

接下來，我們考慮一個兩層的半監督節點分類GCN模型，在對稱鄰接矩陣 $A$ (binary or weighted) 上操作。在預處理步驟中，我們首先計算 $\hat{A} = {\tilde{D}}^{- 1 / 2} \tilde{A} {\tilde{D}}^{- 1 / 2}$ (binary or weighted) 上操作。在預處理步驟中，我們首先計算

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。