I’m very new to network science and would greatly appreciate some guidance with community detection techniques.
Mine is a very simple, undirected edge-weighted graph that I seek to cluster and find communities within. I understand that there are a number of ways to perform community detection and that each way has its own favourable and unfavourable qualities.
The essence of my question is this:
- Is there some metric that can help determine optimal community detection (such as silhouette score in conventional clustering)?
- Is there an ensemble method of community detection that applies multiple approaches and automatically selects the best clustering based on the above-mentioned metric?
I would be very grateful for any thoughts or assistance.
The following reviews / summaries are often recommended:
The classic metric used to characterize the quality of a partitioning is modularity, which is, roughly speaking, the fraction of edges (or fraction of total edge weight) contained within communities, relative to a null model.
Many community detection methods aim to maximize modularity in some manner.
Its good to point out that while using a mathematically well-defined quantity like modularity is convenient, it shouldn’t be taken as the last word on measuring the “quality” of partitionings. For example, it is clearly sensitive to how edge weights are chosen. Suppose you find that your weights are log-normally distributed so you take their logarithm. This change of weight representation will completely change what modularity maximizing methods would produce. There are methods which explicitly choose other quality measures. See e.g. Infomap, or the multilevel and Leiden algorithms in igraph which can use a modified definition of modularity that includes a resolution parameter. This parameter allows for tuning the number of communities the method detects, and makes it possible to resolve even small communities in a large network.
My final advice is that when the python-igraph documentation seems a bit sparse, also take a look a the corresponding function in the docs of the C library: igraph Reference Manual
To add to what @szhorvat already said: there is not a single universally agreed upon definition of the “quality” of a partition. Indeed modularity is used in many contexts, but it also suffers from some flaws.
In addition to the community detection methods that are implemented in
igraph itself, there are also other methods implemented in other packages, such as
networkx, or even other specialised packages. There is a package called
CDLib which tries to provide a central unified interface to all of these different community detection methods. In addition, it also provides some convenience functions for ensemble methods and evaluation / benchmarking. This might prove useful if you want to explore a broad range of techniques.
Great replies: I was going to recommend Fortunato & Hric’s “user guide” as well.
One more thing to add, since you use the word “optimal”. Under metrics such as modularity finding a truly optimal partition that maximizes that metric is “NP Hard”, meaning in practice that solving it optimally on any large graph would take a very long time (months, years or more depending on the size of the graph). Thus most community detection methods try to heuristically approximate optimality. – Dan
Thank you all very much for your incredibly thorough guidance! I can’t think of too many communities as invested in collaboration and helping each other out as this one.
When I finally get my hands on the dataset I described, I intend to work through the suggestions you’ve all made and see what works best. I’m especially interested in the ensemble methods implemented in CDLib, but I want to work through the component community detection algorithms first.
Your point about weight transformation, @szhorvat, conveniently leads me to another question: is there a weight range that works best for these community detection algorithms?
The edge weights in my network will indicate vote counts, with higher integer-valued weights indicating stronger connections between nodes. Is it preferable to consider distance (where a high weight would be transformed to a short distance, for example), or is connection strength equally viable provided a suitable normalisation is performed?