Adding a new community detection algorithm

Hi, I have been a part of developing a new community detection algorithm, SpeakEasy2 (SE2, currently under review for publication). It was originally written in matlab but for performance reasons, I have ported it to C using igraph. The code for the C library is on github. Having read through the contribution pages on the igraph repository, I know there’s still some work that’s needed to be done before it would be ready to be merged in so I wanted to ask if this is something there would be interest in before going forward.

Additionally, as part of writing SE2, I wrote a k-smallest/largest algorithm for igraph vectors. This could be used to make a median function. If that’s something that may be useful, I could break that out into igraph as well.

It’s great to see that igraph is used in this way!

I would say, to be able to make a decision about inclusion in igraph, we’d need at least the following:

  • See an explanation of how this new method relates to older ones, and what it brings over them.
  • See the paper, with a full explanation of the method. Perhaps it’s better to wait until peer review is passed.

Ultimately we’d like to avoid including methods that are unlikely to gain any traction within the community, but this is difficult to judge.

I looked at your repo very briefly and incidentally noticed that the skewness function may be incorrect. value_sq is initialized to 0 then it is repeatedly multiplied by other values. It also tries to return false when the return type is igraph_real_t.

That’s understandable. And thanks, I have fix both those mistakes in skewness.

I can update when we hear more about the publication status. It is currently “accepted pending minor revisions” at genome biology.

For now, we can email a preprint and I can provide a quick summary of the methods goals and how it works, and provide some comparisons with other methods.

SpeakEasy2 is an extension of the speaker-listener propagation algorithm with the intent of producing reproducible partitions even in networks with high levels of cross community edges (common in biological networks). The combination of reproducible and stochastic is intended to instill confidence the node membership is not a coincident of the initial conditions and random seed. Additionally, the algorithm can be used to calculate overlapping partitions (though I haven’t added that to the C version yet, and I don’t think igraph has support for overlapping partitions?*)

Method

For most the iterations, SE2 behaves similar to a basic speaker-listener propagation algorithm (SLPA) but there are some tweaks to improve its performance. Decisions about labeling nodes are based on the expected number of neighbors with a given label rather than the absolute of number of neighbors with a label, there are three modes in addition to the traditional SLPA mode (referred to as the typical mode) to help with convergence and add randomness to step out of local minima, and the algorithm is performed multiple times, selecting the partition that is most representative of all partitions (as determined by pairwise normalized mutual information).

Instead of classifying each node based on the label the highest number of neighbors are tagged with, we calculate the expected number of neighbors with each label based on the global distribution of labels and look for the label with the highest number of neighbors relative to the expected number. This helps develop smaller community despite a small number of total edges compared to nearby larger communities. And intuitively, the labels that are over-represented in the neighborhood are more meaningful than the labels that are simply common.

Outside of typical mode, we perform merges and bubbling. Merges combine communities with abnormally high amounts of inter-community edges (again using expected vs observed, to prevent large communities from attracting other communities based on shear number of edges alone). Bubbling does the opposite, it splits up “poor quality” communities, or more specifically, it randomly relabels nodes that have the worst score of observed vs expected. This is an annealing type method where we repeatedly break up communities and allow them to reform. Then there is a nurturing mode, performed after bubbling, which focuses on finding newly relocated nodes a good community.

Throughout the process, we sample partitions at times where the community structure is deemed stable. Generally, if there are no good merges to be made, this partition is stable, it’s sampled, and then it will be bubbled to destabilize it and the process is continued until a specified number of samples have been made. This whole method is also performed many independent times, starting with different random initial conditions and a different random seed to ensure the final partition is robust. This final partition is selected based on which partition has the greatest overall similarity with all the other partitions based on NMI.

Performance

We have run this method on a large set of LFR benchmark graphs with varying proportions of cross-community edges and compared its performance against popular community detection algorithms (we are adding Leiden to the list but I do not have the results yet). LFR benchmarks have ground truth partitions that are used to generate the benchmark, which we are comparing the detected partitions to in NMI and ARI based comparisons. The results show that while most algorithms have little trouble with networks that have little crosstalk, once the crosstalk approaches 0.7, i.e. 30% of a nodes neighbors are in the same community and 70% of edges are to a separate community, SE2 starts to perform significantly better than previous methods at recapturing the ground truth.

* As an aside, would overlapping partitions be something that would be useful? I have written functions for calculating overlapping versions of modularity based on Extension of modularity density for overlapping community structure, which produce the same results as the crisp versions when a crisp partition is provided, I can try to translate that to igraph as well if there’s interest in overlapping communities.

Hey, as an update, we have published the algorithm. Please let me know your thoughts. If you don’t think it is a fit for igraph, I’ll start packaging externally.