Based on some earlier discussions (e.g. igraph_edge and from/to swapping), I wanted to propose a new graph data structure. To be clear, this is a proposal for the long-term, only for a version 1.0.

# Background and motivation

The current graph data structure has two main drawbacks:

- The structure is costly to change, adding or removing edges is very costly.
- Iterating over edges/neighbours is typically done in other data structures (i.e.
`igraph_inclist`

,`igraph_adjlist`

). A similar thing holds for constructing graphs, essentially due to the first problem.

The problem with 1 is that most people will use igraph in a higher-level language, and in that higher-level language, it is not possible to change the graph efficiently. This exclude a lot of possible graph generating algorithm from being implemented easily in the higher-level language and requires other data structures. Additionally, some algorithm (that may require deleting edges/vertices for example), will never function efficiently in that higher-level interface. This limits the capabilities of igraph I think, and I would favour a graph data structure that does allow this.

The problem with 2 is that we now have several other data types to iterate efficiently over edges/vertices/neighbours/incident edges, etcâ€¦ Although this is not necessarily a problem, it makes the implementation less efficient, since graphs always have to be translated into a more efficient data structure. Moreover, it increases the maintenance costs of the project, since additional graph types have to be maintained again. Finally, it may make it more difficult for people to start contributing to igraph, wondering what data type they may need for this purpose. Quoting Zen of Python here, â€śthere should be oneâ€”and preferably only oneâ€”obvious way to do it.â€ť The edge/vertex iteration is already well-implemented, and I think that should be the universal way to loop over edges/vertices instead of using the `igraph_inclist`

/`igraph_adjlist`

approach we often use currently).

# Proposed data structure

The graph structure I have in mind is the following:

```
igraph_vector_int_t from; // From node
igraph_vector_int_t to; //To node
igraph_vector_ptr_t in_edges // Vector of vectors pointing to incoming edges
igraph_vector_ptr_t out_edges // Vector of vectors pointing to outgoing edges
igraph_vector_int_t in_idx; // Index in incoming edges
igraph_vector_int_t out_idx; // Index in outgoing edges
igraph_vector_char_t is_sorted; // Keep track of whether incident edges are sorted
```

The edges are stored (in arbitrary order) in the two vectors `from`

and `to`

. This vector is not sorted in any way. In other words, edge `e`

runs from `from[e]`

to `to[e]`

(ignoring the `VECTOR`

notation).

The two vectors `in_edges`

and `out_edges`

contain for each node the edge indices of the incident edges that are incoming or outgoing, respectively. That is, for node `i`

, its incident incoming edges are listed in `in_edges`

, so that `in_edges[i][0]`

is the first incident incoming edge, and that hence, it is pointed to by `from[in_edges[i][0]]`

, and that `to[in_edges[i][0]] == i`

. Something similar holds for the outgoing incident edges of course.

The two vectors `in_idx`

and `out_idx`

point to the indices in `in_edges`

/`out_edges`

in which each edge is pointed to. In other words, suppose that edge `e`

runs between `i`

and `j`

, i.e. `from[e] == i`

and `to[e] == j`

, then `in_idx[e]`

is the index `k`

in `in_edges[j]`

, such that `in_edges[j][k] == e`

. In other words, `in_idx`

points to the index in the incident edges that is pointing towards that specific edge. The same thing holds for `out_idx`

, but for the outgoing edges. Each edge will always be listed only twice of course, so these two vectors are all that is needed. It will become apparent later why this is needed.

Finally, the vector `is_sorted`

keeps track of whether the incident edges are sorted in order of increasing neighbour. That is, if `is_sorted[i]`

is true, it means that `in_edges[i]`

and `out_edges[i]`

are sorted in order of increasing order. Hence, for such a node `i`

, we then have that `from[in_edges[i][k]] <= from[in_edges[i][k+1]]`

for each `k`

and that `to[out_edges[i][k]] <= to[out_edges[i][k+1]]`

.

# Operations

First some general remarks.

Any change in the graph structure will no longer ensure that the incident edges are sorted, so `is_sorted`

always needs to be set to false for the appropriate nodes.

Iterating over nodes/neighbours should preferably be done via the iteration interface. It is of course also possible to obtain neighbours or incidents edges using specific functions, but in order to guarantee integrity of the graph object (i.e. essentially encapsulating the graph object), these results would need to be copies.

## Adding edges

Adding edges simply amounts to adding an edge to the `from`

and `to`

vector, adding appropriate references in `in_edges`

and `out_edges`

, with appropriate index references from `in_idx`

and `out_idx`

.

Complexity: O(1)

## Adding nodes

Adding nodes simply corresponds to increasing the length of the `in_edges`

and `out_edges`

, and making sure that everything is appropriately initialised to empty vectors.

Complexity: O(1)

## Removing edges

Removing an edge is slightly more difficult. Suppose that we have m edges, which are hence indexed from edge 0, ..., m - 1 and that we want to remove edge e. Instead of removing e from the vectors, which would require shifting all edges, we will simply relabel edge m - 1 as e, and then drop edge m - 1. This requires one change in `from`

and `to`

, which is efficient and takes O(1). The change in `in_edges`

and `out_edges`

is also efficient, because we can efficiently locate the relevant index of edge `e`

using `in_idx[e]`

and `out_idx[e]`

, which hence takes O(1). Finally, after `in_edges`

and `out_edges`

are updated, we of course also need to update `in_idx`

and `out_idx`

, which also takes O(1).

Complexity: O(1)

## Removing nodes

For removing a node we will use the same trick of relabeling as above. That is, suppose we have n nodes, indexed from 0, ..., n - 1, and that we want to remove node i. We first need to remove each edge in `in_edges`

and `out_edges`

, which takes O(d_i) where d_i is the degree (in + out) of node i. We then need to relabel node n-1 as node i, which requires updating d_{n-1} edges. It is slightly cheaper than removing an edge though, since only the vectors `from`

and `to`

would require updating (the indices can remain the same).

Complexity: O(d_i + d_{n-1}).

## Iterating over edges

Iterating over all edges just requires looping over the `from`

and `start`

edges, and requires no special structures.

Complexity: O(m), where m is the total number of edges.

## Iterating over incident edges

Iterating over incident edges of a node i can simply be done immediately using the vectors `in_edges`

and `out_edges`

. Iterating over all edges (i.e. incoming and outgoing) simultaneously can also be done efficiently by simply concatenating the iteration. If the iteration needs to be sorted (i.e. if `is_sorted[i]`

is true), it would be possible to iterate over all incident edges in order by merging the two sorted vectors on the fly.

Complexity: O(d_i), where d_i is the number of incident (incoming/outgoing/all) edges.

## Iterating over neighbours

Iterating over neighbours of a node i is slightly slower than iterating over incident edges, because it requires to fetch the â€śotherâ€ť node (which is the `from`

node for `in_edges`

and the `to`

node for `out_edges`

).

Complexity: O(d_i), where d_i is the number of incident (incoming/outgoing/all) edges.

## Finding edge

Getting the `from`

and `start`

nodes for a specific edge `e`

is efficient and simple, but the inverse is more costly. This is also the case in the current implementation, but because everything is sorted in the current implementation it can be done efficiently using binary search, and takes only O(\log d), see `igraph_get_eid`

. In this proposal, nothing is sorted by default, and hence this will take linear time O(d). However, if the incident edges are sorted (which takes time O(d \log d)), this is of course reduced to O(\log d). Hence, if `igraph_get_eid`

is expected to be called more than a certain number of times, it might be more efficient to first sort it. This feature is not frequently used I think, but in principle it would allow for a similarly complexity as the current setup.

Complexity: O(d) or O(\log d) if sorted.

# Open questions

## Undirected graphs

At the moment, the whole data structure is presented as a directed graph. There are multiple possible ways how to handle undirected graphs. We could for example only use `in_edges`

for simply all edges, and have `in_idx`

and `out_idx`

point to the two `in_edges`

where they are listed. An alternative would be to simply use `in_edges`

and `out_edges`

consistent with the `from`

and `to`

definition, but to simply ignore the directionality in relevant parts of the code.

## Loops

We should discuss how to deal with loops. In this proposal, each loop would be included in the `from`

and `start`

vector only once (as it is now). Each loop would be included in the `in_edges`

once and once in the `out_edges`

once. This is also further discussed in issue #1193.

## 32-bits and 64-bits

Edge indices would need to be 64-bits in order to support large graphs (as discussed in issue #1450). However, there is still is the question of whether we would also need to support 64-bits node indices. If we only support 32-bits node indices this would also halve the memory of the graph object. There is a certain trade-off here: in theory, allowing 64-bits node indices may allow larger graphs, but in practice, the limitation will often be limited memory, in which case having 32-bits node indices would actually allow for larger graphs (i.e. for a given memory size). For ease and consistency, it would probably be the easiest to simply use 64-bits indices throughout.

Curious to hear what you all think. Clearly, this would require quite some effort, but I think it would make `igraph`

more useful to the higher-level interfaces and would reduce maintenance costs in the long run.