There are a handful parallel algorithms for community detection, for example this paper and this paper. Are there plans to implement any?
Our member who is most closely involved with community detection methods is not around at the moment. I suspect he might respond when he is back.
I assume that the reason you are asking is that you need speed: handle larger networks faster. Can you explain what limits you ran into? How big (and of what sort) is the network which you could not handle with existing methods? Which methods did you try?
I don’t really have a hard block. I’ve been running experiments with the SNAP datasets, and in the future I will likely run more of size similar to the livejournal data. I basically tried every single algorithm from igraph and networkx, and only infomap, louvain, leiden and fluid community could finish overnight. The quality of the detection was not good enough, so I will need to rerun with different parameters, or just find better datasets. I understand parallelism, even if possible, may not help other algorithms because of the computational complexity. But if only the tractable algorithms can be sped up it’d accelerate my experiments.
Maybe label propagation also finished, but iirc the result quality was really bad (a few huge communities and tons of single-node communities).
Hi @remysucre, there are no plans to introduce parallelizations of community detection algorithms, at least not from my side. Although the performance may increase, as you indicate, there are a number of algorithms that do finish (even if it is over night). I believe that the Leiden algorithm should be one of the fastest among those (disclaimer: I am one of the authors of that algorithm).
What “quality” was not good enough?
The Leiden algorithm supports also a different objective function (CPM, instead of Modularity), and you may also tune the resolution parameter. If your problem is that you need to explore many different resolution parameters, that is something that is actually trivial to parallelize. This will also be a much more efficient gain compared to trying to parallelize the algorithms themselves.
I am not sure what the goal is of your analysis? Of course, one dataset is not better than another dataset just because a community detection algorithm works “better” somehow.
For the context, I’m trying to exploit community structure to improve data analytics. I can share more details offline (via email etc.), but here are my main constraints:
- There should be few cross-community edges compared to intra-community edges. Between any pair of communities, there should be far fewer edges (like 1000X) than within a community.
- There are a few communities of similar size, instead of many communities that differ too much in size. Even better if I can specify the number of communities. Usually this number is under 100 and can go under 10.
Note that 2 can be achieved even if the “ground truth” is many communities of different sizes. In that case just merge small ones and sacrifice the intra-community density (I’m fine with that). Given these, I’ve been looking into tools that perform k-partition on graphs, e.g. KaHyPar, which seem to fit my needs better. But I will still consider community detection.
And now to answer your questions @vtraag:
Leiden is indeed the fastest, awesome work!
The implemented algorithms usually return a few giant communities together with many many tiny communities, which doesn’t meet my constraints. Also there are just too many cross edges.
That’s absolutely right.
Since my goal is to leverage community structure for data analytics, it might be the case that the datasets I have simply don’t exhibit strong enough community structure. I got some good results from taking the ground truth from the synthetic LFR benchmarks which have really strong communities.