Extracting ground-truth clusters from sample_sbm result

Hello!

Firstly, really, really, really wonderful package; thank you so much for creating/maintaining it!

I’ve run into an issue using igraph on R that I can’t quite figure out.

I generate a graph using sample_sbm with 100 nodes and 2 clusters and intra-cluster edge probability greater than inter-cluster edge probability. That is, I tell sample_sbm to generate a graph with two clusters by passing a 2-by-2 matrix into the pref.matrix parameter. My resulting graph looks like this:

Is there some way to recover the ground-truth clusters (i.e., the clusters that sample_sbm presumably creates using pref.matrix)? It seems using cluster_leading_eigen worked for this particular case, but I was wondering if it’d be possible to recover the information used to generate the network in the first place. I’m having some trouble finding a solution in the documentation, etc.

If this is not possible, which of the clustering algorithms that igraph provides would be best suited (in terms of accuracy and time)? I’m working with larger networks with more clusters as well, so any way to recover this information efficiently would be fantastic!

Thank you!

If I understand the documentation correctly, the nodes are simply assigned to a block in consecutive order. That is, all nodes 1 to block.sizes[1] are in block 1 and all nodes in block.sizes[1] + 1 to block.sizes[2] are in block 2, and so on. So, in your example, nodes 1 - 50 should be in block 1 while nodes 51-100 are in block 2. Does that answer your question?

For extracting communities from the graph there are quite a number of different options, see all cluster_* methods. In my experience cluster_infomap and cluster_louvain work quite well. Hopefully soon the cluster_leiden method will also become available in R, which improves on the cluster_louvain method (disclaimer: I am the author of the Leiden algorithm).

Ah, thank you!

Are you getting that from the description of the block.sizes argument?

Using cluster_louvain on another dataset, this indeed seems to be the case:

[1]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  [35]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  2  2  2  2  2
  [69]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
 [103]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2
 [137]  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2 11 11 11
 [171] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
 [205] 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
 [239]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
 [273]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5  5  5  5  5  5
 [307]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5
 [341]  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5

Thank you so much for your help!

Well, I must admit that the documentation is a bit unclear at this point, which should be improved. I just opened an issue for that on GitHub.

The C documentation and the Python documentation do mention the consecutive node order.

Ah, amazing; the documentation for Python and C clear things up significantly. I should have thought to look there. Thank you so much! Looking forward to the R implementation of cluster_leiden!

1 Like