Here the current code for igraph_edge (as in PR #1418 - https://github.com/igraph/igraph/pull/1418, the develop version is equivalent but a little different on the eye):
As you can see, igraph does a perhaps unexpected swapping in the case of undirected graphs, using IGRAPH_FROM to fill the to variable and vice versa.
@tamas has done some digging and it seems that for undirected graphs only igraph stores the larger vertex in igraph->from (meaning it comes out with IGRAPH_FROM), the smaller in igraph->to (i.e. IGRAPH_TO), so this function might be swapping them to satisfy a natural user expectation that if you ask for an edge, the smaller vertex id gets put into from.
The same happens with igraph_edges which returns a list of size 2 * m where the to vertex always precedes the from vertex.
I’m wondering whether we should do anything about this. In any case, this was surprising to me as I was using igraph_edge and IGRAPH_FROM/TO interchangeably in our internal functions and clearly they are not. So you might want to pay attention out there.
I think this is rather unexpected. Although it may make sense to have some natural order, it does not make any sense to swap them all the time. Especially because the output of igraph_edge is now inconsistent with IGRAPH_FROM and IGRAPH_TO. If we want this order, why don’t we then simply store the larger in igraph->to instead of igraph->from?
Perhaps there are reason to keep this for now, but I think that this should be corrected in the non-development release (1.0).
Gábor has provided an explanation on Github: we cannot change this without breaking the R interface because in R the user is allowed to save the graph data structure as-is to disk, which basically means that the raw vectors inside the igraph_t are serialized and saved.
If we want to fix this, we need to provide an alternative implementation of type_indexededgelist.c and allow the user to select which internal graph imlpementation she would like to use at compile time. It also opens up the possibility of testing alternative graph data structures (native adjacency / incidence lists for instance) if we also add a comprehensive benchmark suite.
For legacy reason, something like that could be maintained (e.g. igraph_read_old or something).
Generally, I think that we should prefer that a 1.0 version does not suffer from legacy issues anymore. That is, we may still provide some legacy compatibility functionality, such as the one you suggested now, but not more. That way, we can design the 1.0 version to be as we would like it to be.
Sure, we just have to ensure that we update the R interface the same time we change the representation in the C core. Although, as Gábor mentioned, there is probably no performance benefit in simply swapping the treatment of from and to in the internal representation; if we are messing around with the internal representation, then probably there are other changes that we could make with a larger impact. In other words, if we are breaking things anyway, why not break them in a way that yields a significant performance boost as well.
Yes, indeed, the same holds for the Python interface. However, since we are working with submodules, it is not too problematic to continue using a previous version. But I agree, it would be best to release a new version of all interfaces, especially with each new minor release (i.e. not just a bugfix release).
Perhaps we need to mimick the C branch structure also in the different interfaces at one point, so that we can work on integrating the next development version in the different interfaces.
Yes, all for that! I’m not sure if you immediately have other possibilities in mind?
The Python interface uses a serialization format that is independent from the underlying C interface (basically we serialize the constructor args that are needed to reconstruct the graph exactly).
Not yet; first we need a good benchmark suite that resembles real-world use-cases, and then we need to identify the hotspots. I’d rather not start any work on optimizations first without knowing where the biggest bottlenecks are (unless we find some low-hanging fruits). One thing that personally bothers me is the frequent back-and-forth conversions to adjacency or incidence lists in some of the algorithms, but I don’t know what the real performance implications are.
Yes, for many of the static analyses, it should perhaps be made easier to iterate over all neighbours/edges, without the need to require such an adjacency/incidence list each time.
Of course, some algorithms are changing the graph, in which case an adjacency list might be more appropriate. As stated in the SNAP paper
iGraph emphasizes performance at the expense of the flexibility of the underling graph data structure. Nodes and edges are represented by vectors and indexed for fast access and iterations over nodes and edges. Thus, graph algorithms in iGraph can be very fast. However, iGraph’s representation of graphs is heavily optimized for fast execution of algorithms that operate on a static network. As such, iGraph is prohibitively slow when making incremental changes to the graph structure, such as node and edge additions or deletions.
I’m not sure to what extent it would be possible to have both a fast updatable graph (for which we now typically use adjacency lists) and fast read-access to all necessary vertices and edges.
But it might be worth thinking about the possible data structures in more detail indeed.