Hi there,
recently, I have been comparing implementations of various centralities in Python graph libraries and it has been surprisingly difficult to obtain identical results.
I particular, I would like to obtain similar results for closeness centrality for directed graphs, where some nodes may have zero in or out-degrees.
Depending on the definition, one can decide to have undefined (nan) or zero centralities for such nodes.
However, igraph currently replaces the distance of these nodes by the number of vertices in the graph. While this is one choice among others, it does not necessarily fit the userās needs.
How difficult would it be to propose a user-defined choice for this length? (e.g. between nan, 0, and using the number of vertices).
I have given a quick look at the implementation but I must say that, for somebody discovering igraphās internals for the first time, the algorithm is not very easy to readā¦
In principle, I do not think the proposed change is difficult to implement and I may be able to propose an implementation if somebody can give me a brief overview of the algorithm but Iām afraid I do not have time to try to figure it out on my own.
In addition to this, a similar question would be whether it is possible to implement the harmonic definition of the closeness centrality:
c_i = \frac{1}{N-1} \sum_{j=1}^N\frac{1}{d_{ij}}
EDIT: to contextualize, I think this feature would provide several benefits,
more versatile
provide ways to get similar results between graph-tool, igraph, and networkx in python
eventually remove the need for the runtimewarning for ādisconnected graphā (they are generally not) if the closeness returned for nodes with zero in/out-degree is NaN by default (could be done in a few releases if deemed desirable); a warning can be overlooked while a NaN tends to stand out
I fully agree that the handling of non-connected graphs is currently not very good. We are looking for solutions to this, and we welcome ideas (and references) on how to handle such graphs.
If I understand you correctly, you are talking only about the case of zero-degree vertices (or zero in- or out-degree ones). This is just one special case of a non-connected graph, and I do not see why it should get special treatment. A proper solution would address all non-connected graphs.
Can you clarify why you are looking to treat zero-degree vertices specially?
Personally, I would say that closeness centrality simply does not make sense for non-connected graphs. No matter what distance is chosen for non-connected pairs of vertices, the interpretation of the result would be dubious.
Non-connected graphs and zero-degree nodes:
Iām not really talking about āspecial-treatmentā for zero degree nodes; properly defining the distance value for disconnected nodes is a large part of how non-connected graphs are handled with respect to closeness.
Both networkx and graph-tool use, by definition, d_{ij} = 0 if there is no path for i to j, unfortunately networkx does not follow the logic to the end and sets the centrality of fully disconnected nodes to zero, while graph-tool consistently sets there centralities to NaN.
With this choice of distance, the last question is how to normalize, which both nx and gt do via the size out the nodeās out-component.
However, networkx does provides a better normalization procedure via the wf_improved argument, which uses the the Wasserman and Faust formula.
Harmonic closeness:
I think the easiest is just to look at the wikipedia section which has all relevant references.
To me, the harmonic implementation is much more intuitive and clean since, taking d_{ij} = \infty if there is no path for i to j, there is no special case and there is a single way to normalize, which is by N - 1, regardless of the number of components.
Since Iām not supposed to open feature requests there, I would open an issue regarding closeness centrality in graphs with ānon-connectedā graphs and propose the different definition of distance, the networkx renormalization, and the harmonic centrality as potential solutions, is that what you meant?
The issue for closeness in disconnected graphs is here:
Since you are already looking at what various other packages do, it would be useful if you could summarize your findings and post them either here or in that issue. You already did that for networkx and graph-tool. Thanks!
Just a small note: Regarding your description of what networkx does. You focus on what the distance to non-reachable nodes should be. I found this way of putting it quite confusing. It does not take the distances to non-reachable nodes to be zero. Instead, it simply ignores non-reachable nodes. There is no distance to those included in the calculation.
This is one reason why I think that asking āwhat the distance to non-reachable nodes should beā is the wrong approach.
BTW Mathematica does the same: it ignores non-reachable vertices. Of course, this is not a very useful thing to do because it results in large centralities for vertices in small connected components. The Wasserman paper tries to address this, but what they do seems rather ad-hoc ā¦ I should read the paper to see if they have any well-founded justification.
OK, Iāll update them with some data; Iām afraid Iām only using networx, igraph, and graph-tool, though, so i already told you all I know regarding what happens in other libraries.
Regarding youāre remark on the distance, for the arithmetic definition of closeness, ignoring the distance or setting it to zero is almost exactly the same, except for the fact that defining the distance gives a non-arbitrary value of the closeness for fully disconnected nodes (NaN), which I think provides a consistent answer the ānon-connectedā issue.
As for the normalization, I agree, the formula is somehow ad-hoc, however (at least to me) the whole arithmetic definition of closeness is somewhat āawkwardā, and this is the best option Iāve found so far, apart from moving to the harmonic definition altogetherā¦ which would be the better option as far as Iām concerned.
Ah, maybe my formulation was not clear enough: for the arithmetic definition, the normalization is done by n - 1 where n is the number of nodes in the out/in-component of the nodes (depending on the mode).
In your examples, that would be respectively 2 and 10.
Since this is not necessarily meaningful, the Wasserman and Faust formula tries to improve this slightly.
But I agree that all of this leads to convoluted and somewhat awkward definitions.
Yet, this is, to my knowledge, the least inconsistent way of defining arithmetic closeness.
The harmonic definition, however, directly uses N - 1, where N is the total number of nodes in the graph (here, indeed, non-connected graphs and two separate graphs give different centralities, which is normal to me).
Great, thank you!
One small question: will it then automatically be included in the Python bindings or should I ask for it on the Python tracker?
Iām afraid Iām used to bridging C and Python through Cython, which does not seem to be the case for igraph so I donāt know how to do it myselfā¦ is there some explanation somewhere in a ādeveloper spaceā part of the doc?
Yes, I plan to do so. Please open an issue for it in the issue tracker of the Python interface nevertheless so I donāt forget about it.
(Note to self: eventually we should re-consider how we implement the C glue code for new functions as it is becoming tedious to hand-craft a new binding for every new function we add to the C core).