I think this is very useful @szhorvat, and a very important discussion. I agree that indexing vertices in igraph
should be made easier, simpler, and more intuitive. Having something like VName(x)
sounds useful, but I’m not sure whether this is very practical.
It might be good to check out how similar problems are solved in other libraries. Most notably, in pandas
they had a similar problem, where indices could represent either row numbers (i.e. integer numbers) or labels (which can be any Python hashable object). At some point, they decided to switch to having two separate index functions, namely a specific integer based one, selecting rows, called iloc
and a specific label based one, called loc
.
I would like to propose a similar logic for igraph
. Many operations in the Python interface work on vertex sequences. I would propose two separate indexing functions for obtaining a vertex sequence. We would have ivs
for the integer-based vertex sequence and vs
for the index-based vertex sequence. The integer-based vertex sequence ivs
only accepts integers and always interprets them as the underlying integer representation of the vertices. The label-based vertex sequence vs
accepts any hashable python object, functioning essentially similar to a dictionary. Hence, when being passed a single hashable object, it should interpret it as a single index/label. Additionally, both vertex sequences should accept iterables, so that it’s easy to simply pass a list of vertices to select them, e.g. G.ivs[[0, 3, 5]]
or G.vs[['Bob', 'Maria', 'Celine']]
. Both ivs
and vs
should return a derived type of VertexSequence
, so that both can be accepted in functions as a valid vertex sequence.
The question about whether tuples represent a list or a label can be solved consistent with the typical approach in Python. Tuples are non-mutable, therefore hashable, and should therefore be allowed as vertex indices. When vs
is passed a non-mutable, hashable, iterable (e.g. a tuple
) it should be interpreted as a vertex label. When vs
is passed a non-hashable iterable (e.g. a list
), it should be interpreted as a list of vertex indices. Since a list is non-hashable, this provides an easy way to distinguish between the two. Hence, G.vs[(0, 2)]
should be interpreted as a single node with label (0, 2)
, while G.vs[ [(0, 2), (1, 3)] ]
should be interpreted as a list of two node indices (0, 2)
and (1, 3)
. I am not 100% sure about other relevant iterables. For example, we should make sure that a str
is interpreted as a vertex label, not as an iterable of characters, but str
is hashable, so I guess that’s consistent with this general rule. There might be some issues with other iterators, I’m not sure. @tamas, @iosonofabio, you perhaps know more about this.
For any function that accepts a list of vertices, I would propose that it should accept anything that represents a VertexSequence
. If it is not a VertexSequence
, any such function should assume it represents something that should be passed to vs
in order to obtain a proper VertexSequence
. That is, it should default to an label-based vertex sequence. This way, you should be able to call for example G.induced_subgraph(['John', 'Marie'])
, where ['John', 'Marie']
is then passed to G.vs
in order to obtain a valid VertexSequence
. If people want to use integer-based selection, they should always do so explicitly. This way, we avoid forcing people to always explicitly have to call vs
or ivs
, but the rule is clear.
Additionally, there is the question of using vertex attributes. At the moment, they can be accessed using vs
, for example G.vs['name'
] for the vertex attribute name
. Although this is convenient, it also complicates the usage: is what is being passed a vertex label or a vertex attribute? Again, looking to pandas
for inspiration, they have row and column based selection. That is, pandas
uses df.loc[rows, columns]
to index rows
and columns
. We don’t need to index multiple columns simultaneously, I think. But it could be worthwhile to explicitly distinguish between vertex indices and vertex attributes. I would therefore propose that we use the syntax vs[vertex_indices, vertex_attribute]
to select vertex indices and a vertex attribute. We can use the same syntax for ivs
. Note that I think we could limit support to just using a single vertex_attribute
, not multiple vertex attributes. If you are only interested in a single vertex attribute, but don’t want to make a selection on any vertices, you could use G.vs[,vertex_attribute]
.
One nice thing about this proposal, I think, is that it then mimics the pandas
interface. A lot of people are already familiar with working with pandas
I believe. In a sense, the results that you get by using G.vs
are then similar to what you would get by using G.get_vertex_dataframe().loc
. There is hence a certain degree of consistency between the various functions. Of course, we would only support a subset of the indexing logic that is in pandas
.
We sometimes explicitly want only a single vertex. At the moment, there is the function vs.find
which returns a single (i.e. the first) matching vertex. I think this is perhaps not the most useful function. Instead, I would propose again two functions for accessing individual nodes directly, namely at
. This function should accept a single vertex index/label and returns the requested vertex. When called on ivs
this is interpreted as a integer based index, while on vs
this is interpreted as a a vertex label. That is, you call ivs.at(3)
to get vertex number 3, or ivs.at('Bob')
to get vertex with label Bob
. Again, this is similar to the functionality of at
and iat
in pandas
respectively.
While we’re at it, at the moment there is the select
function of a VertexSequence
. First of all, I think it would make more sense to call it filter
. Secondly, the syntax is really not that great. Of the existing options, I think only the callable object makes sense, i.e. G.vs.filter(lambda v: v['weight'] < 5)
. In addition, it might make sense to support an iterable of Booleans, indicating for each vertex whether it should be included or not, i.e. G.vs.filter([True, False, False, True])
would include only the first and last vertex. In pandas
this is actually also supported in the loc
function, but this is actually making things confusing I think.
Presumably, using only callable functions in filter
is rather inefficient when using vertex attributes, and I therefore understand that perhaps some alternative should be offered. Again, looking to pandas
for inspiration, the notation that is used in query
is perhaps most sensible. This would allow a user to simply specify a query in a rather natural language. For example, G.vs.query("age > 18 and employed")
would return all vertices where age > 18
and where employed
is True
.
At the moment, the existing select
function accepts integers as representing the “current vertex set”. That is, concatenating various calls to select
you cannot use the integer representation of the vertices as present in the graph, and you have to refer to the integer indices as present in the “current vertex set”. This is very confusing and makes it difficult to work with.
The proposed filter
and query
are identical for both ivs
and vs
, since no reference to vertex indices is made directly. For edge sequences a similar syntax could be used. Since we have no edge indices (and I don’t think this is necessary), we can simply stick to having only es
instead of having also ies
, with the difference that es
always only accepts integer-based edge indices.
Clearly, all functions should accept vertex arguments in identical ways. Right now, some functions accept only integer vertices, some accept a Vertex
, others accept a name index, and it is not always clear to users what should be used.
Finally then, I think it would be worthwhile to take our time when thinking about this. This is a very fundamental part of the library, and makes a big difference in user experience. If this works intuitively, it greatly facilitates working with igraph
. If this does not work intuitively, it makes it much more of a pain to work around. So, let’s keep discussing this, and if anybody has other proposals, feel free to contribute to the discussion!