NeurIps Day 1 Graph Mining at Scale Workshop

10 min readDec 23, 2020

The workshop on Sunday hosted by Google research went over a list of talks on graphs and how to scale them. I will pick the talks that I learnt most out of and some slides from the talk to demonstrate their idea.

Talk 1: Grale: Learning Graphs

space: semi-supervised graph learning
focus: scalability, multi-modal graph
techniques: LHS, neural network/gradient boosted decision tree

How can we find the right graph for semi-supervised learning? In real world applications, the choice of which edges to use for computation is the first step in any graph learning process. There are often many types of similarity available to choose as the edges between nodes, and the choice of edges can drastically affect the performance of downstream semi-supervised learning systems.

Real-world applications tend to be multi-modal, and so this lack of guidance is especially problematic. Rather than there being just one simple feature space to choose a similarity on, we have a wealth of different modes to compare similarities in. For example, video data comes with visual signals, audio signals, text signals, and other kinds of metadata — each with their own ways of measuring similarity. The right answer is to choose a combination of these.

Grale operates by fusing together different measures of (potentially weak) similarity to create a graph which exhibits high task-specific homophily between its nodes.

The Graph Design Problem

Grale is proposed to solve the graph design problem in multi-modal feature space. In particular, how to select the right graph? The goal is for the given graph in the slide below is to find an algorithm that can separate out the green and red nodes into two clusters by calculating each node pair’s similarity score.

The goal here is to weight nodes on the log loss to determine if two nodes are the same or different.

Grale does this in a scalable fashion with two steps:

Step 1: Candidate pairs generation vis LHS

First step is to bucket out the nodes via locality sensitive hashing (LSH) to do nearest neighbors search fast. LHS function works by projecting nodes to the line and and slicing the line into different buckets with associate ids. Closer nodes will be in the same bucket compare to far nodes.

Step 2: Train the similarity model by scoring the candidate pairs

The second step is to train a pairwise model to predict node pairs that are classified as same class or similar. The idea is to back propagate the similarity loss to the embedding layers obtaining an embedding tailored for the task.

In the slide below shows the model structure: the gray nodes are inputs, the blue are hidden layers, and red is the output. The network architecture combines a standard two-tower model with natural distances in the input feature spaces. Weights are shared between the two towers. The Hadamard product (point-wise multiplication) of the two towers is used to give us a pairwise embedding. We treat this as an additional set of distance features to augment the input distance features κ1(xi , xj), . . . ,κd (xi , xj). This combined set acts as an input to the second part of the model which computes a single similarity score.

We can now label nodes by computing a score for each class as weighted nearest neighbors score, weighting by the log of the similarity scores. We choose the class with the highest score as the predicted label.

In summary, Grale is:
(1) An algorithm and model for learning a task specific graph for semi-supervised applications.
(2) An efficient algorithm for building a nearest neighbor graph with such a model.
(3) A demonstration of the effectiveness of this approach for fighting abuse on YouTube, with hundreds of millions of items.

If you want to read more, you can refer to the paper.

Talk 2: Label Propagation by Semi-supervised Learning

space: semi-supervised graph learning
focus: label propagation
techniques: neural network

Semi-supervised Learning

is to learn nodes (<10%) by running label propagation. It starts from seed nodes and propagates the labels until they overlap as shown in the slide below (where label ‘4’ and ‘9’ overlaps). After propagating the labels, the reserved labels is used to validate the prediction, and set threshold to use as precursor for downstream supervised training:

Compute precision and recall by separating nodes for training and validation (as ground truth):

The semi-supervised learning method is very general so many applications, just need label weights and edge of the graph:

There’s also a rich representation to support different classification problems. For example, if it’s for binary classification, you can have positive weights for True labels and negative for False labels.

To help scale the semi-supervised learning with large number of potential labels, it’s proven that only keeping top k labels helps with scalability.

The goal of iterative propagation is to minimize the loss function, which has three parts: label loss from ground truth, loss between seed node and its neighbors, and prior loss that compares the predicted label to prior distribution for each node.

For more information, goto Goole AI blog, Google cloud AI experiment.

Talk 3: GNN and Graph Embeddings

space: graph embeddings learning
focus: GCN, taxonomy
techniques: GCN

Graph embeddings is to map discrete graph data to a lower dimension continuous space and hope that it represents the graph data:

Unsupervised graph embeddings is to encode the graph as it is, such as some examples in the slide below: undirected graph to directed graph; zoom out graph structures to learn better embeddings; and apply graph attention on distances to reconstruct the graph best.

Once you have the graph encoded by embeddings, you can apply models that works the same for image or text embeddings.

GCN step by step

GCN is represented by adjacency matrix, where it can keep aggregating neighbors with scoring function until whole graph get to a new state of representation that describes the entire graph. The down stream task can then use learnt node embeddings to classify nodes or do link predictions.

The following is the step by step guide for how GCN works:

The feature vectors could be embeddings of the node such as text of the website and the links would be references on the website:

2. You would take the feature vector of the seed node and its neighbors, and project them into a new latent space to learn a new vector.

3. Do this for every node in the graph, until the whole graph is projected into a new space. Repeat several times, now every node has a new latent space learnt by combining features from the neighbors and use the labels to actually train the network.

4. You have node features that you can use for predictions, which can use to label the nodes when you have missing ones, and you can learn the embeddings jointly with the classification.

You can also training similar node embeddings closer to do link prediction with unsupervised learning, or do pooling over again and again until you get the vector that represents the whole graph and do classification or learn a graph embedding.

GCN in formality

average with neighbors by: adj matrix * feature vectors = new latent space H(0) and pass through non-linearity ReLU function
repeat several times until you get a graph embeddings that could be used for tasks such as node labeling
perform semi-unsupervised classification

More general version of GCN

In a more general sense, message passing and aggregation function can be more flexible then simple average on all neighbors like GCN does. You can design different message passing mechanism tailored to your use case as long as it learns how message should be generated based on mine and my neighbors’ features:

message passing to each node from it’s neighbors

updated node final embedding states ready for training

There’s so much more!

Common elements: encoder/decoder paradigm

If you want to read more, head on to the survey paper here.

Challenges for GCN

GCN will fail when the edges doesn’t match with the nodes you care about, like the slide shows below while white node is surrounded by blue nodes and vice versa.

The challenges despite it being the new focus of research is still there: how to make GNN train faster? how to debias a GNN? And how to disentangle graphs with complex clusters. The following three talks in the workshop help to address these three challenges, being:

PPRGo: GNNs at Scale (for speed up training)
Debiasing GNNs (MONET for debiasing Graph Embeddings)
Learning Multiple Embeddings (for disentangle clusters)

I’m going to cover two of the talks PPRGo and Learning Multiple Embeddings.

Talk 4: PPRGo

PPRGo is designed to solve two problems of GCN where it doesn’t scale to real world large graphs by:

separate aggregation calculation beforehand
use weight nodes by importance instead of blindly aggregate neighbors by an average

The way it solves these two problems is by first performing a random walk to identify the importance of nodes as weights from frequency distribution as a proxy.

scaling https://arxiv.org/pdf/2007.01570.pdf

The offline calculation is using ACL’s algorithm

PPRGo precompute nodes and sort nodes while nodes appear more in random walk are scored higher in importance. It aggregates the most important nodes in the N hop neighborhood but 1 hop computation in runtime.

The slide below shows how PPRGo runs everything end-to-end at scale:

Offline compute PPR weight vectors offline at scale with ACL’s algorithm

Talk 5: Learning Multiple Embeddings

The common way to perform node embeddings is via random walks which treats the node paths same as a sentences and can embed them into vectors just like how words are embedded.

But a word can have different context causing different representations, same as nodes in graphs. Nodes in different part of the graph can mean different things.

So we want to embed multiple node embeddings based on context

The solution for this problem is Ego-net, which partitions the graph into non-overlapping clusters to train embeddings. The intuition is shown in following slides where a node representing two roles/concepts are duplicated so each node can be represent by multiple positions in the embedding: