Skip to main content

kmeans

The k-means algorithm clusters given data by trying to separate samples in n groups of equal variance by minimizing the criterion known as within-the-cluster sum-of-squares. To learn more about it, jump to the algorithm page.

docs-source

TraitValue
Module typemodule
ImplementationPython
Graph directiondirected/undirected
Edge weightsweighted/unweighted
Parallelismsequential
Too slow?

If this algorithm implementation is too slow for your use case, contact us and request a rewrite to C++ !

Procedures​

info

If you want to execute this algorithm on graph projections, subgraphs or portions of the graph, be sure to check out the guide on How to run a MAGE module on subgraphs.

get_clusters(n_clusters, embedding_property, init, n_init, max_iter, tol, algorithm, random_state)​

For each node, this procedure returns what cluster it belongs to.

Input:​

  • n_clusters : int ➑ Number of clusters to be formed.
  • embedding_property : str (default: "embedding") ➑ Node property where embeddings are stored.
  • init : str (default: "k-means") ➑ Initialization method. If k-means++ is selected, initial cluster centroids are sampled per an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence and is theoretically proven to be O(logk)-optimal. If random, n_clusters observations (rows) are randomly chosen for the initial centroids.
  • n_init : int (default: 10) ➑ Number of times the k-means algorithm will be run with different centroid seeds.
  • max_iter : int (default: 10) ➑ Length of sampling walks.
  • tol : float (default: 1e-4) ➑ Relative tolerance of the Frobenius norm of the difference of cluster centers across consecutive iterations. Used in determining convergence.
  • algorithm : str (default: "auto") ➑ Options are lloyd, elkan, auto, full. Description here.
  • random_state : int (default: 1998) ➑ Random seed for the algorithm.

Output:​

  • node: mgp.Vertex ➑ Graph node.
  • cluster_id: mgp.Number ➑ Cluster ID of the above node.

Usage:​

 CALL kmeans.get_clusters(2, "embedding", "k-means++", 10, 10, 0.0001, "auto", 1) YIELD node, cluster_id
RETURN node.id as node_id, cluster_id
ORDER BY node_id ASC;

set_clusters( n_clusters, embedding_property, cluster_property, init, n_init, max_iter, tol, algorithm, random_state)​

Procedure sets for each node to which cluster it belongs to by writing cluster id to cluster_property.

Input:​

  • n_clusters : int ➑ Number of clusters to be formed.
  • embedding_property : str (default: "embedding") ➑ Node property where embeddings are stored.
  • cluster_property: str (default: "cluster_id") ➑ Node property where cluster_id will be stored.
  • init : str (default: "k-means") ➑ Initialization method. If k-means++ is selected, initial cluster centroids are sampled per an empirical probability distribution of the points’ contribution to the overall inertia. This technique speeds up convergence and is theoretically proven to be O(logk)-optimal. If random, n_clusters observations (nodes) are randomly chosen for the initial centroids.
  • n_init : int (default: 10) ➑ Number of times the k-means algorithm will be run with different centroid seeds.
  • max_iter : int (default: 10) ➑ Length of sampling walks.
  • tol : float (default: 1e-4) ➑ Relative tolerance of the Frobenius norm of the difference of cluster centers across consecutive iterations. Used in determining convergence.
  • algorithm : str (default: "auto") ➑ Options are lloyd, elkan, auto, full. Description here.
  • random_state : int (default: 1998) ➑ Random seed for the algorithm.

Output:​

  • node: mgp.Vertex ➑ Graph node.
  • cluster_id: mgp.Number ➑ Cluster ID of the above node.

Usage:​

 CALL kmeans.set_clusters(2, "embedding", "cluster_id", "k-means++", 10, 10, 0.0001, "auto", 1) YIELD node, cluster_id
RETURN node.id as node_id, cluster_id
ORDER BY node_id ASC;

Example​