論文《node2vec: Scalable Feature Learning for Networks》提出了node2vec
算法,node2ve算法經過表徵2種頂點間的關係,獲得頂點的低維向量表示,node
homophily equivalence
代表直接相連的頂點或是在同一community中的頂點,其embeddings應該比較靠近,如圖所示,頂點$u$、$s_1$、$s_2$、$s_3$和$s_4$之間直接相連且屬於同一community,所以,這些頂點的embedding在特徵空間中比較靠近;structural equivalence
表面在圖中具備類似結構特徵的頂點(頂點間沒必要直接相連,能夠離得很遠),其embeddings應該比較靠近,例如,頂點$u$和$s_6$都是各自所在community的中心,具備形似的結構特徵,所以,頂點$u$和$s_6$的embedding在特徵空間中比較靠近。node2vec算法design a flexible neighborhood sampling strategy which allows us to smoothly interpolate between BFS and DFS。python
Node2vec算法但願,在給定頂點的條件下,其領域內的頂點出現的機率最大。即優化目標函數式(1),
$$\begin{equation}\max_f\sum_{u \in V}\log Pr(N_S(u)|f(u)) \tag{1}\end{equation}$$
對於每個源頂點$u\in V$,$N_S(u)$爲根據採樣策略$S$獲得的鄰域。
爲了簡化目標函數,論文提出了2個假設,git
假設1表示當源頂點$u$的特徵表示$f(u)$給定時,$Pr(n_i|f(u))$和$Pr(n_j|f(u))$無關($n_i\in N_S(u),n_j \in N_S(u),i\neq j$)。所以,$Pr(N_S(u)|f(u))$可寫爲式(2),
$$\begin{equation}Pr(N_S(u)|f(u))=\Pi_{n_i \in N_S(u)}Pr(n_i|f(u))\tag{2}\end{equation}$$
假設2說明源頂點和其鄰域內任一頂點,相互之間的影響是相同的。最天然的想法就是將$Pr(n_i|f(u))$寫爲式(3),
$$\begin{equation}Pr(n_i|f(u))=\frac{exp(f(n_i)^\top f(u))}{\sum_{v \in V}exp(f(v)^\top f(u))}\tag{3}\end{equation}$$
所以,node2vec算法就須要解決兩個問題,github
對於第二個問題,能夠參考基於negative sampling的skip-gram模型進行求解,關鍵是肯定採樣策略$S$。算法
node2vec算法提出了有偏的隨機遊走,經過引入2個超參數$p$和$q$來平衡BFS和DFS,從頂點$v$有作到頂點$x$的轉移機率爲式(4),segmentfault
$$ \begin{equation}p(c_i=x|c_{i-1}=v) = \begin{cases} \frac{\pi_{vx}}{Z}, if \quad (v,x)\in E \\ 0, otherwise \end{cases}\tag{4}\end{equation} $$app
其中,$x$表示遊走過程當中的當前頂點,$t$和$v$分別爲$x$前一時刻的頂點和下一時刻將要遊走到的頂點,$\pi_{vx}=\alpha_{pq}(t,x)\cdot w_{vx}$,$w_{vx}$爲邊(v,x)的權值,$\alpha_{pq}(t,x)$定義以下,
$$\begin{equation}\alpha_{pq}(t,x)=\begin{cases}\frac{1}{p}, if \quad d_{tx}=0\\1,if \quad d_{tx}=1\\ \frac{1}{q},if \quad d_{tx}=2\end{cases}\tag{5}\end{equation}$$
其中,$d_{tx}=0$表示頂點$t$和$x$相同,$d_{tx}=1$表示頂點$t$和$x$之間存在之間相連的邊,$d_{tx}=0$表示頂點$t$和$x$不存在直接相連的邊。
如圖所示,在一個無權圖中(能夠看做是全部邊的權值爲1),在一次遊走過程當中,剛從頂點$t$遊走到$v$,在下一時刻,能夠遊走到4個不一樣的頂點,$t$、$x_1$、$x_2$和$x_3$,轉移機率分別爲$\frac{1}{p}$、$1$、$\frac{1}{q}$和$\frac{1}{q}$。
超參數$p$和$q$ control how fast the walk explores and leaves the neighborhood of starting node $u$。$p$越小,隨機遊走採樣的頂點越可能靠近起始頂點;而$q$越小,隨機遊走採樣的頂點越可能遠離起始頂點。框架
import networkx as nx import numpy as np import random p = 1 q = 2 def gen_graph(): g = nx.Graph() g = nx.DiGraph() g.add_weighted_edges_from([(1, 2, 0.5), (2, 3, 1.5), (4, 1, 1.0), (2, 4, 0.5), (4, 5, 1.0)]) g.add_weighted_edges_from([(2, 1, 0.5), (3, 2, 1.5), (1, 4, 1.0), (4, 2, 0.5), (5, 4, 1.0)]) return g def get_alias_edge(g, prev, cur): unnormalized_probs = [] for cur_nbr in g.neighbors(cur): if cur_nbr == prev: unnormalized_probs.append(g[cur][cur_nbr]['weight']/p) elif g.has_edge(cur_nbr, prev): unnormalized_probs.append(g[cur][cur_nbr]['weight']) else: unnormalized_probs.append(g[cur][cur_nbr]['weight']/q) norm = sum(unnormalized_probs) normalized_probs = [float(prob)/norm for prob in unnormalized_probs] return alias_setup(normalized_probs) def alias_setup(ws): ''' Compute utility lists for non-uniform sampling from discrete distributions. Refer to https://hips.seas.harvard.edu/blog/2013/03/03/the-alias-method-efficient-sampling-with-many-discrete-outcomes/ for details ''' K = len(ws) probs = np.zeros(K, dtype=np.float32) alias = np.zeros(K, dtype=np.int32) smaller = [] larger = [] for kk, prob in enumerate(probs): probs[kk] = K*prob if probs[kk] < 1.0: smaller.append(kk) else: larger.append(kk) while len(smaller) > 0 and len(larger) > 0: small = smaller.pop() large = larger.pop() alias[small] = large probs[large] = probs[large] + probs[small] - 1.0 if probs[large] < 1.0: smaller.append(large) else: larger.append(large) return alias, probs def alias_draw(J, q): ''' Draw sample from a non-uniform discrete distribution using alias sampling. ''' K = len(J) kk = int(np.floor(np.random.rand()*K)) if np.random.rand() < q[kk]: return kk else: return J[kk] def alias_draw(alias, probs): num = len(alias) k = int(np.floor(np.random.rand() * num)) if np.random.rand() < probs[k]: return k else: return alias[k] def preprocess_transition_probs(g): ''' Preprocessing of transition probabilities for guiding the random walks. ''' alias_nodes = {} for node in g.nodes(): unnormalized_probs = [g[node][nbr]['weight'] for nbr in g.neighbors(node)] norm= sum(unnormalized_probs) normalized_probs = [ float(u_prob)/norm for u_prob in unnormalized_probs] alias_nodes[node] = alias_setup(normalized_probs) alias_edges = {} for edge in g.edges(): alias_edges[edge] = get_alias_edge(g, edge[0], edge[1]) return alias_nodes, alias_edges def node2vec_walk(g, walk_length, start_node, alias_nodes, alias_edges): ''' Simulate a random walk starting from start node. ''' walk = [start_node] while len(walk) < walk_length: cur = walk[-1] cur_nbrs = list(g.neighbors(cur)) if len(cur_nbrs) > 0: if len(walk) == 1: walk.append( cur_nbrs[alias_draw(alias_nodes[cur][0], alias_nodes[cur][1])]) else: prev = walk[-2] pos = (prev, cur) next = cur_nbrs[alias_draw(alias_edges[pos][0], alias_edges[pos][1])] walk.append(next) else: break return walk def simulate_walks(g, num_walks, walk_length, alias_nodes, alias_edges): ''' Repeatedly simulate random walks from each node. ''' walks = [] nodes = list(g.nodes()) print('Walk iteration:') for walk_iter in range(num_walks): print("iteration: {} / {}".format(walk_iter + 1, num_walks)) random.shuffle(nodes) for node in nodes: walks.append(node2vec_walk(g, walk_length=walk_length, start_node=node, alias_nodes=alias_nodes, alias_edges=alias_edges)) return walks if __name__ == '__main__': g = gen_graph() alias_nodes, alias_edges = preprocess_transition_probs(g) walks = simulate_walks(g, 2, 3, alias_nodes, alias_edges) print(walks) # Walk iteration: # iteration: 1 / 2 # iteration: 2 / 2 # [[5, 4, 1], [2, 3, 2], [4, 1, 2], [3, 2, 3], [1, 2, 3], [4, 1, 2], [3, 2, 3], [1, 2, 3], [2, 3, 2], [5, 4, 1]]