Node Classification
Descriptionβ
Node classification is the problem of finding out the right label for a node based on its neighborsβ labels and structure similarities.
More formally, let G=(V, E)
be a graph with vertices (nodes) V
and edges (relationships) E
, S
the set of all labels, and f : V β S
a vertex labeling function that returns the correct label for each node.
From a given partial function g : Vβ β V β S
(Vβ
being training examples),
a node classification model aims to minimize the loss of the label prediction function gβ : V β S
.
The motivation for node classification models is the sociological concept of homophily. In social networks, individuals who associate and bond with each other tend to have similar properties. In graph structure, homophily means that neighboring nodes tend to have the same labels and similar connections.
In real-life datasets, the set V
can be a lot of things, often scientific publications, proteins, Reddit posts, etc. In the same respective order,
set E
can be a citation of publications, protein connections, or connection of posts commented by the same user. Therefore, node classification helps the user to classify publications
relevant to its field, check whether protein helps cure new diseases, or simply find out which subreddit a post naturally belongs to.
On incompletely labeled graphs, node classification seeks to find the right labels for unlabeled nodes.
Solutionβ
Solving methods range from algorithmic to machine learning-based. Both models work on top of the feature which describes each node. Together features and relationships between nodes help the algorithm decide what is the correct label for each node.
The key difference is when using traditional methods, an engineer must create distinguishing features for each node himself. On the other hand, the machine learning model learns and adapts almost any set of randomly initialized sets of numbers that represent node features.
For a traditional approach, extracting such a feature is not an easy problem, since there are many options to choose from β node properties, node adjacency, or the structure of the neighborhood. Traditional methods of extracting knowledge from nodes include measures of centrality, importance, or feature structure such as graphlets.
Graph machine learning models can work based on almost arbitrary features. Finding similar features for nodes with the same label, then learning and mapping such features as distinguishing ones is their biggest advantage. ML models are constantly developed. Here are some which are usually used in node classification tasks:
Previously labeled nodes can be used to determine the class of unclassified ones
Materialsβ
Implementationβ
Machine learning-powered node classification is provided within Memgraph MAGE. We encourage you to test it yourself and take a look at the implementation.
Use casesβ
Classifying nodes based on previously labeled data can help investigators to find out fraudsters in finance systems. Since fraudsters can often act very quickly with their tricks, applying a streaming-based node classification can help up building a real-time fraud detection system.
Fraudsters usually behave differently than common users. This way a system can be observed and once an unusual behavior is spotted it can immediately be revealed. This can help build a bulletproof computer security system.