Graphs Networks

10 min readMar 9, 2021

之前寫過ㄧ些 graph networks 的東西，但發現自己一直沒有很懂（當然現在可能也是只懂皮毛），到最近工作上有需要才真的深入一點去研究，這篇文章是想整理比較不同的 graph networks，討論他們的特性和分類，才能針對不同的 graph structure 和用途選擇適合的 training network。

Graph 資料結構的特性

目前為止，我們熟知成熟的 ML model 大多是針對 euclidean data，像是 Images, text, audio 等，都是 sequence（1D）、grids（2D）的問題，而 Graph 結構則比起 1D 和 2D 的資料更為複雜，除了結構不固定之外，也不像 1D 有順序或像 2D 有上下左右之分。Graph 的表達更為抽象，也有更多變數，如 node degree、node edges、node 的鄰居結構…。Graph 可以表達的方法更為多樣，導致訓練模型也有更多種可能，整體而言 degree of freedom 高很多，也造成訓練模型的不易。

Graph Networks 不易分類

目前 Graph Networks 還是處在一個百家爭鳴的局面，最新的研究也不段推層出新，也沒有很有系統的方式去區別至今不同的 Graph Networks，名字也沒有很統ㄧ（所以會出現同一名詞在不同 paper 實際上代表不同東西，很讓人混淆），但可以參考 Google 這篇 survey paper 做的分類架構。paper 中把 Graph Networks 先用 unsupervised 還是 supervised 分、有沒有使用 node features 分、和使用哪種 encoder，主要大致可以分為四大類：(ㄧ) shallow embedding methods, (二) auto-encoding method, (三) graph regularization methods, 和 (四) graph neural networks (GNNs)。 graph neural networks 又可以分為使用 message passing 和 graph convolution，graph convolution 又可分為 Spectral Graph Convolutions（包含 spectrum-based methods、spectrum-free methods）和 Spatial Graph Convolutions，整理後如下圖：

當然，這些分類也不是很絕對， message passing 也可以被看成 spatial convolution 的一種，因為 message 就是在 local neighborhood 中被計算的；相對的，spatial convolutions 也可以說是使用 message passing 的 framework。

Graph Networks 主要目的

訓練 graph 模型的主要目的是學習 graph embeddings，也就是學習 mapping function 把一個 discrete graph 轉換成維度相對較小的連續空間。在這個學習的空間中，如果兩個 node 有相似的 connections（兩個 node 有 edge 連接，或是有共同鄰居），則他們會在這個 learned vector space 中更靠近。

分法一：Unsupervised 或 supervised network embedding

Unsupervised 顧名思義就是沒有 labels，純粹透過 graph 的結構（也許加上 node features）去學習。其中最直觀的做法，就是透過減少 reconstruction loss 來學習能夠 approximate input graph 的 function，也就是透過這個 function 去重建 graph，並讓他和原本 graph 盡量相似。

相反的，supervised network embedding 會有明確的學習目標，例如預測 node label，所以最後的 embeddings space 是針對這個學習目標優化的。

分法二：有無使用 node features

第二種分法是有無使用 node features，如果 node features 有被使用，那模型學習的 embeddings 除了 graph 的結構之外，還可以利用 semantic graph information 來做訓練。比較後期發展的 graph networks 都會包含 node features，像是 GCN、GraphSage 和 message passing 的模型。

相對的，如果沒有使用 node features，就單純依賴 graph 的結構的做學習，像是比較早出來的 DeepWalk 和 Node2Vec 都是用 random walk 來學習 graph 結構中 node 的遠近，來當作 embeddings 的依據。

分法三：Transductive 或 inductive network embedding

transductive 的模型中，目標是要 embed static graph 中所有的 nodes，學習後的 embeddings 可以用來預測 graph 中現存的 node。但是 transductive 的方法的壞處是，當有新的 node 進來的時候，必須要重新訓練整個 model，才能得到這個 node 的 embeddings。相反的，inductive 的方法可以 generalize 到之前沒見過的 node，透過他們的鄰居生成新 node 的 embeddings 即可。

分法四：用 Loss function 分

Loss function（或是說 objective function）決定了模型演算法 optimization 學習的方向：

Supervised loss：是最直觀的 loss function，計算方式是最小化 predicted labels yˆ 和 ground truth labels y 之間的差異，例如用 cross-entropy。

Graph regularization loss：是透過縮短 decoded similarity matrix Wˆ 和 target similarity matrix s(W) 的距離，來學習 graph structure 給模型參數施加的 regularization constraints。這種學習方法比較容易辨認 higher-order proximities，也就是比較 global 的 graph 結構，而比較不容易辨認 local 的 graph 結構。

Weight regularization loss：不同於前一個 Graph regularization loss，Weight regularization loss 是在模型參數上使用 regularization，用來避免模型 overfitting，最常見的 regularization 是用 L2 regularization (assumes standard Gaussian prior)

分法五：用 encoders 分

接下來的分類重點是用 encoders 分，encoders 算是整個模型的靈魂角色，不同的 encoder 和 loss function 會很大程度的影響模型學習到的參數。

Shallow embedding methods：這種的 encoder function 就只是很簡單的 embedding lookup 而已，也就是說模型的參數直接拿來當作 node embeddings 使用

Graph regularization methods：這種 encoder network 只用了 node features 當作 input，並透過 graph regularization loss 來 regularize node embeddings（對應到上面以 loss function 區分裡的 Graph regularization loss），也是只用在 loss function 中才使用到 graph 結構的資訊。

Graph auto-encoding methods：和 Graph regularization methods 相反，這種方法的 encoder function 只使用了 graph 結構的資訊 (no node features)

Neighborhood aggregation methods：這是目前比較新也比較主流的方法，包含 graph convolutional methods 都可以廣義的劃分為這種，透過使用 node features 加上 graph 結構的資訊，Neighborhood aggregation methods 厲害的地方是使用 graph structure 來 propagate node features across nodes 藉此學習到兼具代表 local 和 global graph 結構的 embeddings

其中 shallow embedding methods 和 graph auto-encoders 並沒有用到 node features 而是純粹根據 graph 結構來學習，並且是 transductive 的。因此這兩種 encoder 都不被用在 inductive 的問題上，因為 graph 是會一直變動的（想像一下 FaceBook 的朋友連結會不斷變動，如果今天因為某個人被 unfriend 了就要重新訓練模型，那是非常沒有效率的）。

以下方這個圖來總結一下目前為止，我們用 Unsupervised 或 supervised network embedding （第一層）、有無使用 node features（第二層）、使用哪種 encoder（第三層）區分的不同 graph networks：

接下來我們會針這幾種分類下，列出比較有名的模型做詳細介紹：

（ㄧ）Shallow Embedding Methods

DeepWalk

最常見的 Shallow Embedding Methods 就是 DeepWalk 了。DeepWalk 是最先應用本來使用在 language modeling 的 skip-gram models，到 graph-structured data 上。 DeepWalk 運用了 graph 和語言的相通性，也就是在 graph 中做 random walk，就能得到一串有順序性的 nodes，如同語言中的一個句子一樣。接下來就跟 skip-gram 的原理一樣，DeepWalk 訓練透過 maximize 在給定的 context 下預測 target node 出現的機率，來學習模型。

node2vec

node2vec 基本上和 DeepWalk 的基本原理是一樣的，也是 random-walk based，也是 unsupervised network embedding，只是 sampling strategy 有點不一樣 — 透用學習參數來控制 random walk 要走 breadth first search (BFS) 還是 depth first search (DFS)。這樣的好處是可以讓 node2vec embeddings 同時補捉 local structures 和 global community structures。

Large scale information network embedding (LINE)

LINE (Jian Tang et al) explicitly defines two functions; one for first order proximity and another for second order proximity. In the experiments conducted by the original research, second order proximity performed significantly better than first, and it was implied that including higher orders may level off the improvements in accuracy.

The goal of LINE is to minimize the difference between the input and embedding distributions. This is achieved using KL divergence:

LINE defines two joint probability distributions for each pair of nodes then minimizes the KL divergence of the distributions. The two distributions are the adjacency matrix and the dot product of node embedding. KL Divergence is an important similarity metric in information theory and entropy. The algorithm is used in probabilistic generative models like Variational Autoencoders, which embed inputs of an autoencoder into a latent space, which becomes the distribution.

(來源：https://towardsdatascience.com/overview-of-deep-learning-on-graph-embeddings-4305c10ad4a4）

（二）Auto-encoders Methods

Shallow embedding methods 很難學習到任何非線性的複雜 graph 結構資訊 Graph auto-encoders 的出現就是為了克服這個問題，透過 deep neural network 的 encoder 和 decoder functions 中多層的nonlinear functions 學習非線性的資訊。encoder 的 input 是只有 adjacency matrix W，接著模型會透過 minimize reconstruction error 學習。

Structural deep network embedding (SDNE)

learns auto-encoders that preserve first and second-order node proximity . The SDNE encoder takes as input a node vector: a row of the adjaccency matrix as they explicitly set s(W) = W, and produces node embeddings Z. The SDNE decoder return a reconstruction W􏰍, which is trained to recover the original graph adjacency matrix (Figure 7). SDNE preserves second order node proximity by minimizing the graph regularization term

SDNE does not use random walks. Instead, it tries to learn from two distinct metrics:

First-order proximity: two nodes are considered similar if they share an edge (pairwise similarity)
Second-order proximity: two nodes are considered similar if they share many neighboring/adjacent nodes

The Laplacian Eigenmap embedding algorithm applies a penalty when similar nodes are mapped far from each other in the embedded space, thus allowing for optimization by minimizing the space between similar nodes.

The second order proximity is preserved by passing the adjacency matrix of te graph through an unsupervised autoencoder which has a built in reconstruction loss function it must minimize.

(來源：https://towardsdatascience.com/overview-of-deep-learning-on-graph-embeddings-4305c10ad4a4）

（三）Graph neural networks

寫了這麼多，終於到本篇的重頭戲了。Graph neural networks（GNN）最特別的地方是 encoder function 同時使用了 graph 結構的資訊和 node features，GNN 又可以分為：

Spectral Methods: requires the use of eigen-stuff

和

Spatial Methods: don’t require the use of eigen-stuff

Spectral Methods 又可以細分為 Spectrum-based 和 Spectrum-free methods，再來複習一下前面放過的分類圖：

Spectral Methods

Spectral methods 在 graph Laplacian matrix 的 spectral domain 做convolution filters 運算（注意這邊的 convolution 跟 CNN 裡面的 convolution 不一樣，只是借用概念，意在 graph 上做 multiplication of a signal=node features by a kernel，類似於 CNN 裡 pixel value multiplied by a kernel value 的概念這樣而已）。這邊的 Spectral Methods 又可以細分為 Spectrum-based 和 Spectrum-free methods。Spectrum-based methods 的缺點是仰賴 spectrum of the graph Laplacian 因此非常 domain-dependent（不能應用在新的 graph 上），而且 spectral decomposition 的運算也是非常昂貴。Spectrum-free methods 好一點的地方就在於，不用直接計算 spectral decomposition，而是用 approximations for spectral filters 取代。

Spectrum-based methods

Spectrum-based methods 最早是源自 Michaël Defferrard Graph Signal Processing (GSP) 的研究。他結合了 graph theory 和 signal analytics 並自此發揚光大。簡單的來說：

GSP is the key to generalizing convolutions, allowing us to build functions that can take into account both the overall structure of the graph and the individual properties of the graph’s components.

GSP 使用 signal processing functions 像是傅立葉轉換/fourier transform，通常是用在 signals/frequencies 上面，但也能用在 graphs 上成為 graph fourier transform，轉換後的數值高低代表了 graph 的 “smoothness”，而 smoothness 指的是相似的 node 之間距離有多近，相較於其他不同類的 node。

簡單來說，GCN uses the graph 傅立葉轉換來 aggregate 鄰居的 node features ，這些 features 是用 signals 來表現，也就是 graph 的 Laplacian’s eigen decomposition。這邊借用的概念就是音波頻率是怎麼由 component frequencies 組成，而音波中的 component frequencies 就是 node features (as signals) ，用來組成 latent graph。

Spectrum-based methods 中的 node features 是用 “signals” 來表示，但通常 signal 不只是 node or edge feature 而已，而是一個 function applied to the features。

有了這個轉換以後，透過在 normalized Laplacian matrix 的 spectral domain 上學習 convolution filters，Spectral graph convolutions 可以把 convolutions 應用在 graphs 上。而 convolution filters 的 shape（paper 中有給一個名字叫 patch functions）可以用 graph normalized Laplacian 的 eigenvectors 來表示。

整個流程是：透過傅立葉轉換，將 graph signals 轉到 spectral domain 上 ( 所以就沒有基於空間的鄰居關係 )，再透過模型學習的 convolution filters 對 spectral domain signals 做處理，最後通過反向傅立葉轉換轉回原本的圖訊號，完成訊號的處理/聚合/分析。

Spectrum-free methods

我們剛剛提到 Spectrum-free methods 的好處在於，不用直接計算 Laplacian’s eigen decomposition，而是用 approximations for spectral filters 取代，更切確的來說，是用 polynomial expansions 來 approximate spectral filters。

Spectral Methods — Chebyshev networks

ChebNets is one of the first and most important papers on spectral graph learning. Spectral convolutions are defined as the multiplication of a signal (node features/attributes) by a kernel. The kernel used in a spectral convolution made of Chebyshev polynomials of the diagonal matrix of Laplacian eigenvalues. Chebyshev polynomials are a type of orthogonal polynomials with properties that make them very good at tasks like approximating functions. The kernel is represented by the equation:

Where gθ is a kernel (θ represents the vector of Chebyshev coefficients) applied to Λ, the diagonal matrix of Laplacian eigenvalues ( Λ˜ represents the diagonal matrix of scaled Laplacian eigenvalues). k represents the smallest order neighborhood, and K represents the largest order neighborhood. Finally, T stands for the Chebyshev polynomials of the kth order.

In plain english:

The kernel equals the sum of all Chebyshev polynomial kernels applied to the diagonal matrix of scaled Laplacian eigenvalues for each order of k up to K-1.

The original ChebNet paper also introduces pooling methods, another key component of the complete vanilla convolution, by using graph coursening. This additional step increases efficiency and brings graph convolutions a step closer to their vanilla cousins.

（來源：https://towardsdatascience.com/graph-convolutional-networks-for-geometric-deep-learning-1faf17dee008）

Spectral Methods — Graph Convolution Networks

Among the most cited works in graph learning is a paper by Kipf and Welling. The paper introduced spectral convolutions to graph learning, and was dubbed simply as “graph convolutional networks”, which is a bit misleading since it is classified as a spectral method and is by no means the origin of all subsequent works in graph learning.

In Kipf and Welling’s GCN, a convolution is defined by:

Where gθ is a kernel (θ represents the parameters) which is applied (represented by the star) to x, a graph signal. K stands for the number of nodes away from the target node to consider (the Kth order neighborhood, with k being the the closest order neighbor). T denotes the Chebyshev polynomials as applied to L̃ which represents the equation:

eqn. 2 (In the original paper, part of the simplification included assuming λ max = 2)

Where λ max denotes the largest eigenvalue of L, the normalized graph laplacian. Multiple convolutions can be performed on a graph, and the output is aggregated into Z.

Where Z is a matrix of convolved signals (from neighboring nodes) Ã is the adjacency matrix of the graph (plus the identity matrix), D̃ is the diagonal node degree from Ã, Θ is a matrix of kernel/filter parameters (can be shared over the whole graph), and X is a matrix of node feature vectors. The D to the power of -1/2 are parts of a renormalization trick to avoid both exploding or vanishing gradients.

Equation 1 and 3 are the the components that are wrapped in an activation function (Relu, Sigmoid, etc.). The combined result is the layer-wise propagation rule that makes up a single layer of a GCN. The key innovation is the use of the Fourier transform, a popular operation in quantum mechanics (in the study of Q-bits) and signal processing (in the study of waves).

ChebNets and GCNs are very similar, but their largest difference is in their choices for value K in eqn. 1. In a GCN, the layer wise convolution is limited to K = 1. This is intended to alleviate the risk of overfitting on a local neighborhood of a graph.

GCNs performed well in node classification tasks and other graph applications, but the main drawback is how eigenvalues tend to cluster together in a very small range, with large gaps in between each cluster. The same can be emphasized to a lesser degree for ChebNets. This is problem later solved by CayleyNets.

（來源：https://towardsdatascience.com/graph-convolutional-networks-for-geometric-deep-learning-1faf17dee008）

Spectral Methods — 總結

Spectrum-based methods 的缺點很明顯，就是因為要計算整個 graph 的Laplacian’s eigen decomposition 非常昂貴，也無法應用到 inductive settings. 而 spectrum-free methods 像是 GCNs 即便用了 polynomial expansions 來取代昂貴的 Laplacian’s eigen decomposition 運算，仍需要存取整個 graph 的 adjacency matrix。於是有研究員更進一從 Convolution Neural Networks 借鏡，創造了 Spatial methods。

Spatial Methods

相較於 Spectrum-based methods，Spatial methods 不需要存取整個 graph 的 adjacency matrix，大大增加了模型的 scalability。並且，spatial graph convolutions 用 neighborhood sampling 和 attention mechanisms 可以用來舒緩因為 graph 結構不規則，帶來模型訓練的挑戰。

Spatial Methods — GraphSage （sampling based method）

前面介紹的模型全部都是 transductive 學習，換句話說當graph變化時，需要重新學習。GraphSAGE是歸納式學習的一個代表，它不直接學習每個節點的表示，而是學習aggregation function，學到了聚合函數以後，對於新增的節點我直接生成它的embedding表示，而不需要重頭學習。

GraphSage的計算有兩個步驟:

鄰居採樣：因為每個節點的度是不一致的，為了計算效率，為每個節點分層(K層)採樣固定數量的鄰居（GraphSage 是固定第一層採 25 nodes，第二層採 10 個 nodes，所以 max = 250 nodes）。
鄰居特徵聚集（Aggregator）：通過聚集採樣到的鄰居特徵，更新當前的node 特徵，網絡第k層聚集到的鄰居即為BFS過程第k層的鄰居採樣。

Aggregator function要求要對順序不敏感，因為random選擇的node本身是無序的，可以有以下幾種選擇，當然也可以自己設計:

Mean aggregator: 最直觀的，平均所有鄰居 node 的 features
LSTM aggregator: LSTM輸入是有序的，每次採樣的node需要shuffle一下
Pooling aggregator: 同樣也很直觀，例如用 max pooling（取最大值）
GCN aggregator: 圖卷積聚合器

Spatial Methods — Graph Attention Networks（Attention-based method）

如果說 GraphSage 是透過採樣固定數量的鄰居簡單化， Graph Attention Networks（GAT）就是新增了學習怎麼用 attention 機制採樣鄰居重要性。attention 機制的目的是要學習所有鄰居的 weighted sum，自然的運算會比 GraphSage 和 Graph Convolution Networks（GCN）來得高。GCN 其實也可以看成是 GAT 的特例（也就是所有的 1 hop 鄰居都給相等的 attention weight）。

補充：比較 Spectral VS Message passing network

https://towardsdatascience.com/a-tale-of-two-convolutions-differing-design-paradigms-for-graph-neural-networks-8dadffa5b4b0

後記

這篇其實從二月初就開始寫了，但後來因為各種事情，加上主管離職、隔離一度很沒有動力，半推半就的終於把文章寫完了。直到現任主管離職我才真切的感覺到過去很幸運，可以在工作中做很多學習和嘗試，終歸上面一直有人撐著讓我有空間學自己想學的技術（又不用浪費很多時間在無謂的跨部門跨時區溝通上）。隨著前主管去年中離職、在英國合作密切、且性質相似的組員也相繼離職、前組員晉升成現任主管後也離職後，公司可能也沒有人再費心替我們製造這樣的空間了吧（嘆）。

Links

Videos

https://www.youtube.com/watch?v=iWq5XKtJodU
https://www.youtube.com/watch?v=LvNu_xRlXBE
https://www.youtube.com/channel/UCj8shE7aIn4Yawwbo2FceCQ

Lectures/medium notes

https://csustan.csustan.edu/~tom/Clustering/GraphLaplacian-tutorial.pdf
http://snap.stanford.edu/proj/embeddings-www/files/nrltutorial-part1-embeddings.pdf
https://cs.stanford.edu/~mpkim/notes/lec6.pdf
https://towardsdatascience.com/overview-of-deep-learning-on-graph-embeddings-4305c10ad4a4
https://towardsdatascience.com/graph-convolutional-networks-for-geometric-deep-learning-1faf17dee008
https://blog.usejournal.com/attention-and-gates-for-geometric-deep-learning-435deb9df9b
http://www.norbertwiener.umd.edu/Research/lectures/2014/MBegue_Prelim.pdf
https://machinelearningmastery.com/introduction-to-eigendecomposition-eigenvalues-and-eigenvectors/
https://gordicaleksa.medium.com/how-to-get-started-with-graph-machine-learning-afa53f6f963a

Code

https://github.com/dmlc/dgl/blob/master/examples/pytorch/gat/gat.py
https://docs.dgl.ai/en/latest/tutorials/models/4_old_wines/7_transformer.html#sphx-glr-tutorials-models-4-old-wines-7-transformer-py
https://docs.dgl.ai/en/0.4.x/tutorials/models/5_giant_graph/1_sampling_mx.html?highlight=graphsage

Papers

Wavelets on Graphs via Spectral Graph Theory
Neural Message Passing for Quantum Chemistry