New graph data structure proposal

Based on some earlier discussions (e.g. igraph_edge and from/to swapping), I wanted to propose a new graph data structure. To be clear, this is a proposal for the long-term, only for a version 1.0.

Background and motivation

The current graph data structure has two main drawbacks:

  1. The structure is costly to change, adding or removing edges is very costly.
  2. Iterating over edges/neighbours is typically done in other data structures (i.e. igraph_inclist, igraph_adjlist). A similar thing holds for constructing graphs, essentially due to the first problem.

The problem with 1 is that most people will use igraph in a higher-level language, and in that higher-level language, it is not possible to change the graph efficiently. This exclude a lot of possible graph generating algorithm from being implemented easily in the higher-level language and requires other data structures. Additionally, some algorithm (that may require deleting edges/vertices for example), will never function efficiently in that higher-level interface. This limits the capabilities of igraph I think, and I would favour a graph data structure that does allow this.

The problem with 2 is that we now have several other data types to iterate efficiently over edges/vertices/neighbours/incident edges, etc… Although this is not necessarily a problem, it makes the implementation less efficient, since graphs always have to be translated into a more efficient data structure. Moreover, it increases the maintenance costs of the project, since additional graph types have to be maintained again. Finally, it may make it more difficult for people to start contributing to igraph, wondering what data type they may need for this purpose. Quoting Zen of Python here, “there should be one—and preferably only one—obvious way to do it.” The edge/vertex iteration is already well-implemented, and I think that should be the universal way to loop over edges/vertices instead of using the igraph_inclist/igraph_adjlist approach we often use currently).

Proposed data structure

The graph structure I have in mind is the following:

igraph_vector_int_t from; // From node
igraph_vector_int_t to; //To node

igraph_vector_ptr_t in_edges // Vector of vectors pointing to incoming edges
igraph_vector_ptr_t out_edges // Vector of vectors pointing to outgoing edges

igraph_vector_int_t in_idx; // Index in incoming edges
igraph_vector_int_t out_idx; // Index in outgoing edges

igraph_vector_char_t is_sorted; // Keep track of whether incident edges are sorted

The edges are stored (in arbitrary order) in the two vectors from and to. This vector is not sorted in any way. In other words, edge e runs from from[e] to to[e] (ignoring the VECTOR notation).

The two vectors in_edges and out_edges contain for each node the edge indices of the incident edges that are incoming or outgoing, respectively. That is, for node i, its incident incoming edges are listed in in_edges, so that in_edges[i][0] is the first incident incoming edge, and that hence, it is pointed to by from[in_edges[i][0]], and that to[in_edges[i][0]] == i. Something similar holds for the outgoing incident edges of course.

The two vectors in_idx and out_idx point to the indices in in_edges/out_edges in which each edge is pointed to. In other words, suppose that edge e runs between i and j, i.e. from[e] == i and to[e] == j, then in_idx[e] is the index k in in_edges[j], such that in_edges[j][k] == e. In other words, in_idx points to the index in the incident edges that is pointing towards that specific edge. The same thing holds for out_idx, but for the outgoing edges. Each edge will always be listed only twice of course, so these two vectors are all that is needed. It will become apparent later why this is needed.

Finally, the vector is_sorted keeps track of whether the incident edges are sorted in order of increasing neighbour. That is, if is_sorted[i] is true, it means that in_edges[i] and out_edges[i] are sorted in order of increasing order. Hence, for such a node i, we then have that from[in_edges[i][k]] <= from[in_edges[i][k+1]] for each k and that to[out_edges[i][k]] <= to[out_edges[i][k+1]].

Operations

First some general remarks.

Any change in the graph structure will no longer ensure that the incident edges are sorted, so is_sorted always needs to be set to false for the appropriate nodes.

Iterating over nodes/neighbours should preferably be done via the iteration interface. It is of course also possible to obtain neighbours or incidents edges using specific functions, but in order to guarantee integrity of the graph object (i.e. essentially encapsulating the graph object), these results would need to be copies.

Adding edges

Adding edges simply amounts to adding an edge to the from and to vector, adding appropriate references in in_edges and out_edges, with appropriate index references from in_idx and out_idx.

Complexity: O(1)

Adding nodes

Adding nodes simply corresponds to increasing the length of the in_edges and out_edges, and making sure that everything is appropriately initialised to empty vectors.

Complexity: O(1)

Removing edges

Removing an edge is slightly more difficult. Suppose that we have m edges, which are hence indexed from edge 0, ..., m - 1 and that we want to remove edge e. Instead of removing e from the vectors, which would require shifting all edges, we will simply relabel edge m - 1 as e, and then drop edge m - 1. This requires one change in from and to, which is efficient and takes O(1). The change in in_edges and out_edges is also efficient, because we can efficiently locate the relevant index of edge e using in_idx[e] and out_idx[e], which hence takes O(1). Finally, after in_edges and out_edges are updated, we of course also need to update in_idx and out_idx, which also takes O(1).

Complexity: O(1)

Removing nodes

For removing a node we will use the same trick of relabeling as above. That is, suppose we have n nodes, indexed from 0, ..., n - 1, and that we want to remove node i. We first need to remove each edge in in_edges and out_edges, which takes O(d_i) where d_i is the degree (in + out) of node i. We then need to relabel node n-1 as node i, which requires updating d_{n-1} edges. It is slightly cheaper than removing an edge though, since only the vectors from and to would require updating (the indices can remain the same).

Complexity: O(d_i + d_{n-1}).

Iterating over edges

Iterating over all edges just requires looping over the from and start edges, and requires no special structures.

Complexity: O(m), where m is the total number of edges.

Iterating over incident edges

Iterating over incident edges of a node i can simply be done immediately using the vectors in_edges and out_edges. Iterating over all edges (i.e. incoming and outgoing) simultaneously can also be done efficiently by simply concatenating the iteration. If the iteration needs to be sorted (i.e. if is_sorted[i] is true), it would be possible to iterate over all incident edges in order by merging the two sorted vectors on the fly.

Complexity: O(d_i), where d_i is the number of incident (incoming/outgoing/all) edges.

Iterating over neighbours

Iterating over neighbours of a node i is slightly slower than iterating over incident edges, because it requires to fetch the “other” node (which is the from node for in_edges and the to node for out_edges).

Complexity: O(d_i), where d_i is the number of incident (incoming/outgoing/all) edges.

Finding edge

Getting the from and start nodes for a specific edge e is efficient and simple, but the inverse is more costly. This is also the case in the current implementation, but because everything is sorted in the current implementation it can be done efficiently using binary search, and takes only O(\log d), see igraph_get_eid. In this proposal, nothing is sorted by default, and hence this will take linear time O(d). However, if the incident edges are sorted (which takes time O(d \log d)), this is of course reduced to O(\log d). Hence, if igraph_get_eid is expected to be called more than a certain number of times, it might be more efficient to first sort it. This feature is not frequently used I think, but in principle it would allow for a similarly complexity as the current setup.

Complexity: O(d) or O(\log d) if sorted.

Open questions

Undirected graphs

At the moment, the whole data structure is presented as a directed graph. There are multiple possible ways how to handle undirected graphs. We could for example only use in_edges for simply all edges, and have in_idx and out_idx point to the two in_edges where they are listed. An alternative would be to simply use in_edges and out_edges consistent with the from and to definition, but to simply ignore the directionality in relevant parts of the code.

Loops

We should discuss how to deal with loops. In this proposal, each loop would be included in the from and start vector only once (as it is now). Each loop would be included in the in_edges once and once in the out_edges once. This is also further discussed in issue #1193.

32-bits and 64-bits

Edge indices would need to be 64-bits in order to support large graphs (as discussed in issue #1450). However, there is still is the question of whether we would also need to support 64-bits node indices. If we only support 32-bits node indices this would also halve the memory of the graph object. There is a certain trade-off here: in theory, allowing 64-bits node indices may allow larger graphs, but in practice, the limitation will often be limited memory, in which case having 32-bits node indices would actually allow for larger graphs (i.e. for a given memory size). For ease and consistency, it would probably be the easiest to simply use 64-bits indices throughout.


Curious to hear what you all think. Clearly, this would require quite some effort, but I think it would make igraph more useful to the higher-level interfaces and would reduce maintenance costs in the long run.

1 Like

Is (D.E. Knuth, The art of computer programming, Vol 1, Addison-Wesley, 1969, p.409) still relevant ? Or any other source ?

Could you perhaps elaborate on what aspect of Knuth is relevant here? I don’t have a copy lying around here, so I don’t immediately know what you are talking about.

I would have thought this problem has been solved many times. So any literature list is helpful.

Hallo Vincent,

Please find enclosed copies of a few pages from The Art of Computer Programming (Vol 1). I just wondered if there is any relevant research regarding graph representations.

Regards,

Kees Pippel

It has, but the devil is in the details, and in choosing the right tradeoffs for the specific application. There is no perfect representation that one can just lift from a textbook. Each type of representation will have advantages and disadvantages, and each can be implemented in multiple ways with their own tradeoffs.

If you look how various algorithms are implemented in a typical graph library, you will see that they often convert the graph to a representation that is best suited to the specific algorithm. What @vtraag is trying to address here is:

  • We noticed that the representation that appears to be most commonly used by igraph’s algorithms at this moment is not the same as the one used by igraph_t. Therefore, there’s a lot of back-and-forth conversion going on.

  • It is typical in high-level interfaces to make many small incremental modifications to a graph, and with the current representation this is costly.

@vtraag Thanks for all your work on this! Responding will take quite a bit of time. I just wanted to indicate that I saw the thread. It will be useful to make a comparison with the current representation and identify where the new one is worse.

One more idea I wanted to throw out there:

Mathematica is closed source, so I don’t know what it does exactly internally, but I can tell that it uses multiple different representations. Given that the best representation is going to be different for each algorithm, a possible design is to allow several representations, and convert between them based on need. I.e., once an adjacency-list-type representation has been created, it would not be thrown away. It could be used by the next function call, if that function also works with adjacency lists.

I am not saying that we should do this (in fact I think we should not, at least not in its most extreme form), but it is useful for brainstorming to phrase this idea explicitly. Allowing both a sorted and a non-sorted representation, as you propose, is one variation on this idea.


Anyway, I really need to spend at least an hour or two going through the proposal, and thinking things through in detail, before I can respond more usefully.

Just adding one point of discussion here which has come up on multiple occasions: parallel edges and self-loops. In the existing implementation, self-loops appear doubly when looping over incident edges/neighbours. This is because a self-loop appears both as an “in” and as an “out” edge for the same vertex. Hence, when gathering all incident edges (for IGRAPH_ALL) self-loops will appear twice. To be sure, each self-loop will appear twice. For some purposes, having self-loops included twice makes sense, for other purposes this makes less sense. Similarly, some applications might want to include parallel edges, whereas others would like to exclude them. We should explicitly consider this aspect in the design of any new graph data structure.

In the proposal above, we would by default have unsorted edges. In principle, this implementation would show a rather similar behaviour, as each self-loop would appear in both as an “in” and as an “out” edge. Similarly, parallel edges would also appear multiple times (as they should of course), but it is a relevant question how we could exclude those duplicates efficiently.

I would propose that we have a “scratchpad” “fixed set”. With a “fixed set” I mean a structure that can track whether an integer up to n is included in the set. This can be efficiently implemented using a igraph_vector_int_t in_set and a igraph_vector_int_t set, where in_set[i] >= 0 denotes that integer i is in the set, in which case integer i is included in the vector set. In particular, we have set set[in_set[i]] = i. In other words, in_set[i] contains the reference to where integer i is located in the vector set. This would hence support O(1) operations of adding and removing integers up to n from a set. Hence, clearing a set of size d would be O(d). With a “scratchpad” “fixed set” I mean that we keep one such “fixed set” in memory, so that we can use it for whatever purpose we need it instead of repeatedly allocating and freeing the necessary memory. This should make it more efficient for operations that are expected to be often repeated (but we can debate about this).

In a function akin to igraph_incident we could then easily keep track of which neighbours are already returned earlier by using this “fixed set”. In this way, we can efficiently de-duplicate incident edges without requiring them to be sorted in any way. Hence, we could efficiently filter out parallel edges “on the fly”. Filtering out self-loops to ensure they are included only once (or twice) can be done even easier, by only include the self-loops from the “in” side (in_edge in the proposed data structure), even when requesting all edges (i.e. IGRAPH_ALL). Something similar can of course also be done in the current data structure (cf. the discussion in issue #1474)

To be sure, whether parallel edges should be filtered out, or whether self-loops should be filtered out, included once, or included twice should be left to the user I think. Hence, users should indicate what incident edges they would like to receive.

Are there any other aspects of this particular point that would require attention? Happy to discuss this point further.

Just adding one other possible alternative for dealing with self-loops. Instead of having in_edges and out_edges, we could also have self_edges. This would allow us to easily and efficiently include or exclude self-loops or count them once or twice in various settings.

For example, if we want to calculate the degree, and make it possible to count self-loops not at all/once/twice, we would need to loop over the actual edges always to determine their status as self-loop. For example, without considering self-loops, the in-degree for node i would simply be the number of element in in_edges[i]. If we don’t register the self-loops separately, looping over all edges would mean that getting the degree would no longer be an O(1) operation, but an O(d) operation.

The in_idx and out_idx could then simply point to the index of the self-edge in the self_edges, in case it is a self-edge (which is easily verified from from[e] == to[e] obviously). The in_idx and out_idx are only used when adding or removing edges. Hence, the need to check whether an edge is a self-loop, and is hence listed in self_edges or not, is only relevant in those cases, and should not further deteriorate performance when simply working with an existing graph.