Based on some earlier discussions (e.g. igraph_edge and from/to swapping), I wanted to propose a new graph data structure. To be clear, this is a proposal for the long-term, only for a version 1.0.
Background and motivation
The current graph data structure has two main drawbacks:
- The structure is costly to change, adding or removing edges is very costly.
- Iterating over edges/neighbours is typically done in other data structures (i.e.
igraph_inclist
,igraph_adjlist
). A similar thing holds for constructing graphs, essentially due to the first problem.
The problem with 1 is that most people will use igraph in a higher-level language, and in that higher-level language, it is not possible to change the graph efficiently. This exclude a lot of possible graph generating algorithm from being implemented easily in the higher-level language and requires other data structures. Additionally, some algorithm (that may require deleting edges/vertices for example), will never function efficiently in that higher-level interface. This limits the capabilities of igraph I think, and I would favour a graph data structure that does allow this.
The problem with 2 is that we now have several other data types to iterate efficiently over edges/vertices/neighbours/incident edges, etc… Although this is not necessarily a problem, it makes the implementation less efficient, since graphs always have to be translated into a more efficient data structure. Moreover, it increases the maintenance costs of the project, since additional graph types have to be maintained again. Finally, it may make it more difficult for people to start contributing to igraph, wondering what data type they may need for this purpose. Quoting Zen of Python here, “there should be one—and preferably only one—obvious way to do it.” The edge/vertex iteration is already well-implemented, and I think that should be the universal way to loop over edges/vertices instead of using the igraph_inclist
/igraph_adjlist
approach we often use currently).
Proposed data structure
The graph structure I have in mind is the following:
igraph_vector_int_t from; // From node
igraph_vector_int_t to; //To node
igraph_vector_ptr_t in_edges // Vector of vectors pointing to incoming edges
igraph_vector_ptr_t out_edges // Vector of vectors pointing to outgoing edges
igraph_vector_int_t in_idx; // Index in incoming edges
igraph_vector_int_t out_idx; // Index in outgoing edges
igraph_vector_char_t is_sorted; // Keep track of whether incident edges are sorted
The edges are stored (in arbitrary order) in the two vectors from
and to
. This vector is not sorted in any way. In other words, edge e
runs from from[e]
to to[e]
(ignoring the VECTOR
notation).
The two vectors in_edges
and out_edges
contain for each node the edge indices of the incident edges that are incoming or outgoing, respectively. That is, for node i
, its incident incoming edges are listed in in_edges
, so that in_edges[i][0]
is the first incident incoming edge, and that hence, it is pointed to by from[in_edges[i][0]]
, and that to[in_edges[i][0]] == i
. Something similar holds for the outgoing incident edges of course.
The two vectors in_idx
and out_idx
point to the indices in in_edges
/out_edges
in which each edge is pointed to. In other words, suppose that edge e
runs between i
and j
, i.e. from[e] == i
and to[e] == j
, then in_idx[e]
is the index k
in in_edges[j]
, such that in_edges[j][k] == e
. In other words, in_idx
points to the index in the incident edges that is pointing towards that specific edge. The same thing holds for out_idx
, but for the outgoing edges. Each edge will always be listed only twice of course, so these two vectors are all that is needed. It will become apparent later why this is needed.
Finally, the vector is_sorted
keeps track of whether the incident edges are sorted in order of increasing neighbour. That is, if is_sorted[i]
is true, it means that in_edges[i]
and out_edges[i]
are sorted in order of increasing order. Hence, for such a node i
, we then have that from[in_edges[i][k]] <= from[in_edges[i][k+1]]
for each k
and that to[out_edges[i][k]] <= to[out_edges[i][k+1]]
.
Operations
First some general remarks.
Any change in the graph structure will no longer ensure that the incident edges are sorted, so is_sorted
always needs to be set to false for the appropriate nodes.
Iterating over nodes/neighbours should preferably be done via the iteration interface. It is of course also possible to obtain neighbours or incidents edges using specific functions, but in order to guarantee integrity of the graph object (i.e. essentially encapsulating the graph object), these results would need to be copies.
Adding edges
Adding edges simply amounts to adding an edge to the from
and to
vector, adding appropriate references in in_edges
and out_edges
, with appropriate index references from in_idx
and out_idx
.
Complexity: O(1)
Adding nodes
Adding nodes simply corresponds to increasing the length of the in_edges
and out_edges
, and making sure that everything is appropriately initialised to empty vectors.
Complexity: O(1)
Removing edges
Removing an edge is slightly more difficult. Suppose that we have m edges, which are hence indexed from edge 0, ..., m - 1 and that we want to remove edge e. Instead of removing e from the vectors, which would require shifting all edges, we will simply relabel edge m - 1 as e, and then drop edge m - 1. This requires one change in from
and to
, which is efficient and takes O(1). The change in in_edges
and out_edges
is also efficient, because we can efficiently locate the relevant index of edge e
using in_idx[e]
and out_idx[e]
, which hence takes O(1). Finally, after in_edges
and out_edges
are updated, we of course also need to update in_idx
and out_idx
, which also takes O(1).
Complexity: O(1)
Removing nodes
For removing a node we will use the same trick of relabeling as above. That is, suppose we have n nodes, indexed from 0, ..., n - 1, and that we want to remove node i. We first need to remove each edge in in_edges
and out_edges
, which takes O(d_i) where d_i is the degree (in + out) of node i. We then need to relabel node n-1 as node i, which requires updating d_{n-1} edges. It is slightly cheaper than removing an edge though, since only the vectors from
and to
would require updating (the indices can remain the same).
Complexity: O(d_i + d_{n-1}).
Iterating over edges
Iterating over all edges just requires looping over the from
and start
edges, and requires no special structures.
Complexity: O(m), where m is the total number of edges.
Iterating over incident edges
Iterating over incident edges of a node i can simply be done immediately using the vectors in_edges
and out_edges
. Iterating over all edges (i.e. incoming and outgoing) simultaneously can also be done efficiently by simply concatenating the iteration. If the iteration needs to be sorted (i.e. if is_sorted[i]
is true), it would be possible to iterate over all incident edges in order by merging the two sorted vectors on the fly.
Complexity: O(d_i), where d_i is the number of incident (incoming/outgoing/all) edges.
Iterating over neighbours
Iterating over neighbours of a node i is slightly slower than iterating over incident edges, because it requires to fetch the “other” node (which is the from
node for in_edges
and the to
node for out_edges
).
Complexity: O(d_i), where d_i is the number of incident (incoming/outgoing/all) edges.
Finding edge
Getting the from
and start
nodes for a specific edge e
is efficient and simple, but the inverse is more costly. This is also the case in the current implementation, but because everything is sorted in the current implementation it can be done efficiently using binary search, and takes only O(\log d), see igraph_get_eid
. In this proposal, nothing is sorted by default, and hence this will take linear time O(d). However, if the incident edges are sorted (which takes time O(d \log d)), this is of course reduced to O(\log d). Hence, if igraph_get_eid
is expected to be called more than a certain number of times, it might be more efficient to first sort it. This feature is not frequently used I think, but in principle it would allow for a similarly complexity as the current setup.
Complexity: O(d) or O(\log d) if sorted.
Open questions
Undirected graphs
At the moment, the whole data structure is presented as a directed graph. There are multiple possible ways how to handle undirected graphs. We could for example only use in_edges
for simply all edges, and have in_idx
and out_idx
point to the two in_edges
where they are listed. An alternative would be to simply use in_edges
and out_edges
consistent with the from
and to
definition, but to simply ignore the directionality in relevant parts of the code.
Loops
We should discuss how to deal with loops. In this proposal, each loop would be included in the from
and start
vector only once (as it is now). Each loop would be included in the in_edges
once and once in the out_edges
once. This is also further discussed in issue #1193.
32-bits and 64-bits
Edge indices would need to be 64-bits in order to support large graphs (as discussed in issue #1450). However, there is still is the question of whether we would also need to support 64-bits node indices. If we only support 32-bits node indices this would also halve the memory of the graph object. There is a certain trade-off here: in theory, allowing 64-bits node indices may allow larger graphs, but in practice, the limitation will often be limited memory, in which case having 32-bits node indices would actually allow for larger graphs (i.e. for a given memory size). For ease and consistency, it would probably be the easiest to simply use 64-bits indices throughout.
Curious to hear what you all think. Clearly, this would require quite some effort, but I think it would make igraph
more useful to the higher-level interfaces and would reduce maintenance costs in the long run.