That’s understandable. And thanks, I have fix both those mistakes in skewness
.
I can update when we hear more about the publication status. It is currently “accepted pending minor revisions” at genome biology.
For now, we can email a preprint and I can provide a quick summary of the methods goals and how it works, and provide some comparisons with other methods.
SpeakEasy2 is an extension of the speaker-listener propagation algorithm with the intent of producing reproducible partitions even in networks with high levels of cross community edges (common in biological networks). The combination of reproducible and stochastic is intended to instill confidence the node membership is not a coincident of the initial conditions and random seed. Additionally, the algorithm can be used to calculate overlapping partitions (though I haven’t added that to the C version yet, and I don’t think igraph has support for overlapping partitions?*)
Method
For most the iterations, SE2 behaves similar to a basic speaker-listener propagation algorithm (SLPA) but there are some tweaks to improve its performance. Decisions about labeling nodes are based on the expected number of neighbors with a given label rather than the absolute of number of neighbors with a label, there are three modes in addition to the traditional SLPA mode (referred to as the typical mode) to help with convergence and add randomness to step out of local minima, and the algorithm is performed multiple times, selecting the partition that is most representative of all partitions (as determined by pairwise normalized mutual information).
Instead of classifying each node based on the label the highest number of neighbors are tagged with, we calculate the expected number of neighbors with each label based on the global distribution of labels and look for the label with the highest number of neighbors relative to the expected number. This helps develop smaller community despite a small number of total edges compared to nearby larger communities. And intuitively, the labels that are over-represented in the neighborhood are more meaningful than the labels that are simply common.
Outside of typical mode, we perform merges and bubbling. Merges combine communities with abnormally high amounts of inter-community edges (again using expected vs observed, to prevent large communities from attracting other communities based on shear number of edges alone). Bubbling does the opposite, it splits up “poor quality” communities, or more specifically, it randomly relabels nodes that have the worst score of observed vs expected. This is an annealing type method where we repeatedly break up communities and allow them to reform. Then there is a nurturing mode, performed after bubbling, which focuses on finding newly relocated nodes a good community.
Throughout the process, we sample partitions at times where the community structure is deemed stable. Generally, if there are no good merges to be made, this partition is stable, it’s sampled, and then it will be bubbled to destabilize it and the process is continued until a specified number of samples have been made. This whole method is also performed many independent times, starting with different random initial conditions and a different random seed to ensure the final partition is robust. This final partition is selected based on which partition has the greatest overall similarity with all the other partitions based on NMI.
Performance
We have run this method on a large set of LFR benchmark graphs with varying proportions of cross-community edges and compared its performance against popular community detection algorithms (we are adding Leiden to the list but I do not have the results yet). LFR benchmarks have ground truth partitions that are used to generate the benchmark, which we are comparing the detected partitions to in NMI and ARI based comparisons. The results show that while most algorithms have little trouble with networks that have little crosstalk, once the crosstalk approaches 0.7, i.e. 30% of a nodes neighbors are in the same community and 70% of edges are to a separate community, SE2 starts to perform significantly better than previous methods at recapturing the ground truth.
* As an aside, would overlapping partitions be something that would be useful? I have written functions for calculating overlapping versions of modularity based on Extension of modularity density for overlapping community structure, which produce the same results as the crisp versions when a crisp partition is provided, I can try to translate that to igraph as well if there’s interest in overlapping communities.