Making it possible to use arbitrary vertex names

Currently, python-igraph has good support for referring to vertices by index (as an int starting at 0) or by a string-name (a str). Other Python objects are not well-supported as vertex names.

It would be useful to make it possible (and convenient) to use other objects as well. Tuples in particular are very useful for many graph-theory problems. Consider e.g. a Kautz graph where the vertices are in fact tuples. There are countless applications. Numbers are also useful as vertex names (not indices), for example for keeping track of which vertex was which after having deleted a number of vertices.

The the problem is: how to make it clear which of the following an object represents?

  • A vertex index
  • A vertex name
  • A list of vertex indices
  • A list of vertex names

For example, does (1,2) refer to two vertices with indices 1 and 2? Two vertices named 1 and 2 (where their index may be different from their name)? A single vertex named (1,2)?

Here’s my proposal for solving this problem. When reading this, keep in mind that I am not very well educated in Python, so I may be missing basic things.

Let us introduce a “wrapper” named VName (or whatever). Then VName(x) will always refer to a single vertex named x.

This way, the following interpretations can be kept as they are:

  • Numbers are interpreted as vertex indices, strings are interpreted as vertex names
  • Iterables of numbers or strings are interpreted as lists of vertex indices or names, respectively

But additionally:

  • VName(x) is always interpreted as a single vertex with name (never index!) x, regardless of whether x is a number, string, other Python object, or iterable.

This way, we can get the degree of a single vertex named (1,2) as g.degree(VName((1,2))). We can select vertices whose “names” (not necessarily strings) are stored in mylist as g.vs[(VName(v) for v in mylist)]. It would become easy to use numbers of vertex names, which is currently not very practical.

I think this is very useful @szhorvat, and a very important discussion. I agree that indexing vertices in igraph should be made easier, simpler, and more intuitive. Having something like VName(x) sounds useful, but I’m not sure whether this is very practical.

It might be good to check out how similar problems are solved in other libraries. Most notably, in pandas they had a similar problem, where indices could represent either row numbers (i.e. integer numbers) or labels (which can be any Python hashable object). At some point, they decided to switch to having two separate index functions, namely a specific integer based one, selecting rows, called iloc and a specific label based one, called loc.

I would like to propose a similar logic for igraph. Many operations in the Python interface work on vertex sequences. I would propose two separate indexing functions for obtaining a vertex sequence. We would have ivs for the integer-based vertex sequence and vs for the index-based vertex sequence. The integer-based vertex sequence ivs only accepts integers and always interprets them as the underlying integer representation of the vertices. The label-based vertex sequence vs accepts any hashable python object, functioning essentially similar to a dictionary. Hence, when being passed a single hashable object, it should interpret it as a single index/label. Additionally, both vertex sequences should accept iterables, so that it’s easy to simply pass a list of vertices to select them, e.g. G.ivs[[0, 3, 5]] or G.vs[['Bob', 'Maria', 'Celine']]. Both ivs and vs should return a derived type of VertexSequence, so that both can be accepted in functions as a valid vertex sequence.

The question about whether tuples represent a list or a label can be solved consistent with the typical approach in Python. Tuples are non-mutable, therefore hashable, and should therefore be allowed as vertex indices. When vs is passed a non-mutable, hashable, iterable (e.g. a tuple) it should be interpreted as a vertex label. When vs is passed a non-hashable iterable (e.g. a list), it should be interpreted as a list of vertex indices. Since a list is non-hashable, this provides an easy way to distinguish between the two. Hence, G.vs[(0, 2)] should be interpreted as a single node with label (0, 2), while G.vs[ [(0, 2), (1, 3)] ] should be interpreted as a list of two node indices (0, 2) and (1, 3). I am not 100% sure about other relevant iterables. For example, we should make sure that a str is interpreted as a vertex label, not as an iterable of characters, but str is hashable, so I guess that’s consistent with this general rule. There might be some issues with other iterators, I’m not sure. @tamas, @iosonofabio, you perhaps know more about this.

For any function that accepts a list of vertices, I would propose that it should accept anything that represents a VertexSequence. If it is not a VertexSequence, any such function should assume it represents something that should be passed to vs in order to obtain a proper VertexSequence. That is, it should default to an label-based vertex sequence. This way, you should be able to call for example G.induced_subgraph(['John', 'Marie']), where ['John', 'Marie'] is then passed to G.vs in order to obtain a valid VertexSequence. If people want to use integer-based selection, they should always do so explicitly. This way, we avoid forcing people to always explicitly have to call vs or ivs, but the rule is clear.

Additionally, there is the question of using vertex attributes. At the moment, they can be accessed using vs, for example G.vs['name'] for the vertex attribute name. Although this is convenient, it also complicates the usage: is what is being passed a vertex label or a vertex attribute? Again, looking to pandas for inspiration, they have row and column based selection. That is, pandas uses df.loc[rows, columns] to index rows and columns. We don’t need to index multiple columns simultaneously, I think. But it could be worthwhile to explicitly distinguish between vertex indices and vertex attributes. I would therefore propose that we use the syntax vs[vertex_indices, vertex_attribute] to select vertex indices and a vertex attribute. We can use the same syntax for ivs. Note that I think we could limit support to just using a single vertex_attribute, not multiple vertex attributes. If you are only interested in a single vertex attribute, but don’t want to make a selection on any vertices, you could use G.vs[,vertex_attribute].

One nice thing about this proposal, I think, is that it then mimics the pandas interface. A lot of people are already familiar with working with pandas I believe. In a sense, the results that you get by using G.vs are then similar to what you would get by using G.get_vertex_dataframe().loc. There is hence a certain degree of consistency between the various functions. Of course, we would only support a subset of the indexing logic that is in pandas.

We sometimes explicitly want only a single vertex. At the moment, there is the function vs.find which returns a single (i.e. the first) matching vertex. I think this is perhaps not the most useful function. Instead, I would propose again two functions for accessing individual nodes directly, namely at. This function should accept a single vertex index/label and returns the requested vertex. When called on ivs this is interpreted as a integer based index, while on vs this is interpreted as a a vertex label. That is, you call ivs.at(3) to get vertex number 3, or ivs.at('Bob') to get vertex with label Bob. Again, this is similar to the functionality of at and iat in pandas respectively.

While we’re at it, at the moment there is the select function of a VertexSequence. First of all, I think it would make more sense to call it filter. Secondly, the syntax is really not that great. Of the existing options, I think only the callable object makes sense, i.e. G.vs.filter(lambda v: v['weight'] < 5). In addition, it might make sense to support an iterable of Booleans, indicating for each vertex whether it should be included or not, i.e. G.vs.filter([True, False, False, True]) would include only the first and last vertex. In pandas this is actually also supported in the loc function, but this is actually making things confusing I think.

Presumably, using only callable functions in filter is rather inefficient when using vertex attributes, and I therefore understand that perhaps some alternative should be offered. Again, looking to pandas for inspiration, the notation that is used in query is perhaps most sensible. This would allow a user to simply specify a query in a rather natural language. For example, G.vs.query("age > 18 and employed") would return all vertices where age > 18 and where employed is True.

At the moment, the existing select function accepts integers as representing the “current vertex set”. That is, concatenating various calls to select you cannot use the integer representation of the vertices as present in the graph, and you have to refer to the integer indices as present in the “current vertex set”. This is very confusing and makes it difficult to work with.

The proposed filter and query are identical for both ivs and vs, since no reference to vertex indices is made directly. For edge sequences a similar syntax could be used. Since we have no edge indices (and I don’t think this is necessary), we can simply stick to having only es instead of having also ies, with the difference that es always only accepts integer-based edge indices.

Clearly, all functions should accept vertex arguments in identical ways. Right now, some functions accept only integer vertices, some accept a Vertex, others accept a name index, and it is not always clear to users what should be used.

Finally then, I think it would be worthwhile to take our time when thinking about this. This is a very fundamental part of the library, and makes a big difference in user experience. If this works intuitively, it greatly facilitates working with igraph. If this does not work intuitively, it makes it much more of a pain to work around. So, let’s keep discussing this, and if anybody has other proposals, feel free to contribute to the discussion!

1 Like

Let’s discuss this in person.

pandas loc/iloc/ix/__getitem__/at is a total, utter mess. Not blaming them: they became too popular too fast and have to respect the API, but it’s a good lesson for us. None of my data science collaborators uses it consistently, and most people routinely rely on crazy corner cases of the implementation.

Tuples are another mess: they are already tricky to use in numpy, but in pandas they are almost impossible to get right. Again, not blaming anyone, I understand it’s hard. In pandas, it can mean a multiindex, a row/col name (“it’s a hashable right?”), a sequence of rows/columns, a sequence of entries, etc. depending on subtle syntax details.

One library I tried to get inspiration from in the past is xarray, we could learn something from them too.

The most important piece of pandas is not the accessor functions, but rather the fwd dictionary names → indices. We could implement that without the rest of the API and we’d be 75% of the way there.

Responding to @vtraag. When reading my answer, please keep in mind that this is coming from a Python novice.


Regarding the proposal to have .ivs and .vs:

Right now, we have many functions that can take either a single vertex or a list of vertices. One example is .degree(). How do you propose we proceed with such functions?

For example, g.degree(1) gets the degree of vertex 1. g.degree(['Bob', 'Jane']) get the list of vertices of Bob and Jane. We need to distinguish between: (1) single vertex or multiple vertices (2) vertex index or vertex name.

For (1), you propose checking if the object is hashable. I will discuss this later. But what about (2)? How do we distinguish between vertex name and indices? How do I get the degree of the vertex named 1? Do we do them g.degree(g.vs[1])?

One drawback of this compared to VName is that now we have to refer to the graph object twice, e.g. g1.degree(g1.vs[1]) and g2.degree(g2.vs[2]). One of the advantages of using integers as names is so that we can keep track of which vertex is which after having deleted (or added) some vertices. It’s nice to be able to say VName(1) and have it refer to vertex 1 in both g1 and g2, without having to mention the graph twice.

Note: If we have to use a vertex selector every time we refer to vertices by name in a function/method, then .vs and .ivs won’t be more concise than VName.


Regarding determining whether an object represents a single vertex or a list:

You propose using the hashable property. As a Python novice, I find this extremely unintuitive. It is quite clear even to beginners that both (1,2,3,4) and [1,2,3,4] represent a sequence of things. But that one is “hashable” and the other is not is going to sound like an arcane property to many academic users who just want to get network analysis done but aren’t really programmers (our main target audience). If one behaves differently from the other, that is unintuitive. Can it be learned? Sure it can. But isn’t it better if things work the way people expect them to instead of having to learn about them?


Regarding the fact that g.vs[string] refers to an attribute and not a vertex name. Very good point! This is an inconsistency, and it would be nice to do something about it.

If we go with the VName solution, it would be nice if any “vertex reference” would behave the same in all contexts. I.e. if "Bob" refers to a vertex in g.degree("Bob"), it should also refer to a vertex in g.vs["Bob"]. There are different ways to ensure full consistency, but all of them are breaking changes … Breaking changes are fine in general, but reinterpreting a certain syntax can be extremely confusing to users. From this perspective, having g.vs["Bob"] return a vertex is not a great thing. Maybe we can require g.vs[VName("Bob")], but somehow I don’t like the idea to requiring VName even for strings in every context.

P.S. I agree with Fabio about discussing it in person, but I think it’s nice to have some things written down first. This is why I responded here.

Yes, I agree, and I wouldn’t want to mimic the complete API of pandas or anything. In my opinion pandas tries to do too much with various variants of indexing, especially how it relates to hierarchical indices. Luckily, we don’t have to deal with that. However, it used to be an even bigger mess. Somewhere before 0.20 (I think) they mixed integer indices and label indices, which was even more confusing. The clear separation between integer indices and label indices is I think a step forward. This issue is similar for igraph I think, and we should clearly separate integer indices from label indices I think.

Yes, indeed in pandas the indexing is sometimes quite confusing, especially when it comes to slicing notation with hierarchical indices and all that and tuples are not necessarily improving things. However, I’m not sure if this is necessarily a problem for igraph: we would only use tuples as labels for nodes, not for anything else. In a sense, I think this is similar to the typical Python dictionary: if you want you can use a tuple as a key for a dictionary, I don’t think that causes a lot of confusion.

Yes, I would be happy to learn from that! If you have some concrete proposal based on that, let us know.

I’m not sure if I understand this correctly. The whole problem is exactly about getting some vertex names to integer indices, isn’t it? I’m not sure what you consider the rest of the API? At any rate, I would prefer to discuss and think about this for a longer time rather than to start implementing something (and exposing it) which requires a change a few versions later again.

What I would propose is that any argument to a function that takes a VertexSequence that is not a VertexSequence should be piped through g.vs. That is, you would be able to write g.degree(1) and internally it would pass 1 to obtain the VertexSequence returned by g.vs[1]. I would propose that this is always piped through vs not through ivs, so that you by default select on the name of the node, not the integer index of the node.

No, as explained previously, I propose to always pipe the default argument to g.vs, meaning that g1.degree(1) and g2.degree(1) would both obtain the degree of the vertex named 1.

What is perhaps unclear in my original proposal is that integers are explicitly allowed as vertex names. Hence, g.ivs refers to using the underlying integer index, while g.vs refers to the vertex names (which can be integers, string or tuples or anything hashable).

Yes, I understand that the difference is perhaps not immediately obvious. However, if you ever use a dictionary where you want to use a list as a dictionary you quickly learn that you cannot use a list for indexing, exactly because it is not hashable. I’m not sure how “advanced” this knowledge is. If there’s a clearer distinction between the two, I would have no problem with it.

The problem I have with VName is that you would be forced to write it if you want to select a list of such vertices. Suppose that I have a list of names v = ["Bob", "John", "Marie"]. In my proposal, you could simply do g.degree(v) and you would obtain the degrees of the nodes named "Bob", "John" and "Marie". If you have a list of names v = [1, 3, 8] doing g.degree(v) would also get you the degrees of the nodes named 1, 3 and 8. If you would use VName to make the distinction, you would somehow need to do g.degree([VName(x) for x in v]), which, personally, I find rather distracting and not intuitive to use.

Yes, I agree. Such a breaking change is indeed very confusing, and therefore I think it should not be done lightly. To be clear, I wouldn’t want to implement something like this on a short notice. If we ever make such a breaking change, we should preferably not have to change it again ever.

Yes, it might be helpful to discuss this in person indeed! But I think it’s also useful so have something in writing about this. Perhaps there are also some igraph users who can chip in?

1 Like

I’m reading the discussion with great interest; however, at some point we would need to start working on a proposal instead of throwing ideas around. One way to do it would be similar to PEPs (i.e. Python Enhancement Proposals); we could write one or two enhancement proposals (maybe one based on the VName idea and one based on the vs/ivs idea), complete with code examples and then vote based on the complete proposals. One advantage of this would be that we would be forced to think through all the implications of a proposed change while writing the enhancement proposal.

I’d also like to throw a radical idea into the mix: what if we completely do away with vertex indices? After all, it’s an implementation detail that stems from the fact that the core of igraph is written in C. However, the user does not need to know about it. I’m not sure about it, actually, but I wonder what it would take to say that each vertex is identified by an object, and in the absence of user-defined identifiers we simply use integers from [0; |V|-1]. I highly doubt that the users would actually want to mix index-based and name-based vertex access.

2 Likes

Ha, yes, indeed something along those lines would be a good idea. Not sure how to set this up? Also not sure if we are already there yet.

I agree that at some point we need to make things concrete. However, I think it would be good to brainstorm, let ideas float around and ponder on things for a while. I don’t think we’re in a rush with this.

Yes, this sounds very appealing! The only reason I though that integer indices might still be necessary was efficiency and clarity about the ordering of vertices. But other than that, I agree that it would in principle be better to only use labels, and never underlying integer indices, if possible.

About relying on the hashable property to distinguish between single vertices or sets of vertices:

There is a significant difference between how one works with dictionaries, and what is being proposed here. It’s the same kind of difference I was concerned about when discussing breaking changes in my post above:

Trying to use a non-hashable key with a dictionary results in an error. No damage was done. The error clearly mentions the term “unhashable”, so a novice user has something to go on.

Thus for dictionaries we have: hashable → works, unhashable → doesn’t work. This is fine in my opinion.

What is not fine, IMO, is this: hashable → interpretation 1, unhashable → interpretation 2. I do not like the idea of relying on an invisible and unintuitive property to decide what behaviour to follow.

I did not miss this, and it’s a good point. Strings demonstrate that we can’t just blindly check if something is iterable to decide if something is a list of vertices (vs single vertex). Still, strings don’t break intuition because they don’t “look like” lists of things, while tuples do.

I wanted to note that it is now becoming clear that the two problems we are discussing are fairly independent, and do not need to be solved in the same way:

  1. Decide if something represents a vertex name or vertex index.
  2. Decide if something represents a list of vertices or a single vertex.

It’s possible to mix and match the various solutions proposed here for problems 1 and 2.

I am wondering, is there precedent in the Python world for using the “hashable” property to decide whether something represents a collection of objects vs a single object?

Are there containers which are reasonable to be used to store vertices, yet are hashable? Perhaps sometimes we need to store vertices in a hashable container. Last year, at the CZI presentation, I showed Tamás’s clique percolation implementation, which relies on building a “meta-graph” of intersecting cliques. The vertices of this “meta-graph” are themselves lists of vertices (cliques), i.e. need to be some kind of hashable container (in that case we used frozenset).

This example shows that if we use the hashable property for distinguishing between lists vs single objects, at some point igraph users will explicitly need to start converting between hashable and non-hashable containers, just to be able to change the interpretation from “vertex” to “list of vertices”. The problem with this is that what is actually being done (convert between hashable/unhashable) does not represent the intention well (convert between unit/list of units).

What I am looking for here is to be able to have code that clearly communicates intent, even to someone who is not an expert.

While I originally suggested a VName wrapper to distinguish between names and indices, similar wrappers can also be used to distinguish between lists and units. Here’s a rough example (not to be taken as a concrete proposal): We can have Vertex(v) always represent a single vertex, regardless of what v is (I know that Vertex is a taken name, this is just an example). Or we can have VertexList(vl) always represent a list, and never a single vertex. For example, if I have a frozenset called fs, writing set(fs) to treat it as a sequence is not nice because it does not communicate intent well. Writing VertexList(fs) is much clearer. That said, it seems to me that disambiguating single vertices (Vertex) may be more useful than disambiguating lists (VertexList).

Yes, this is indeed the crux. The reason I suggested doing this based on the hashable property is that you can then easily enter a single vertex or a list without using separate notation. However, I do agree that it might also make things less intuitive, instead of more intuitive.

Yes, I like the idea of separately having to refer to a single vertex or a vertex sequence. Perhaps we could use G.v to refer to a single vertex, similar to how we use G.vs to refer to a vertex sequence. That is, we could then refer to G.v["Bob"] to get the single vertex named "Bob". This should then work identical to a dictionary indeed, if the argument is non-hashable is simply raises an error. G.vs would then always expect an iterable, so you would be able to pass a tuple or frozenset as iterables. If you then enter G.vs["Bob"] it will be interpreted as the sequence ["B" , "o", "b"].

One question for clarification @szhorvat. You mention VName as something that is separate from a graph G. This is presumably intentional, you want to be able to simply mark something that can then be used in multiple graphs. But how do you see this work to get at the actual vertices to read/write vertex attributes for example?

Yes, in my original proposal this is simply a disambiguation wrapper, not a functional object. Unlike a “vertex object” (like the current Vertex), it is not tied to a graph in any way. It just communicates to the lookup functions how to interpret an object (name vs index, or unit vs list). Importantly, it does not get invalidated when the graph is changed. This means that one can continue to use the same v = VName(14) to refer to “a vertex called 14” in any graph object, or in a graph before/after a modification.

The same way as we use vertex names now. Before a vertex attribute can be manipulated, we need to get a handle to the vertex, i.e. the vertex has to be looked up. VName simply communicates to whatever lookup function is being used (e.g. g.vs) how the lookup should be done (i.e. “this is a name” vs “this is an index”).

When I say “disambiguation”, I also mean that we might choose not to make the use of such wrappers mandatory. For example, we can have a default interpretation for each kind of object, but also a way to force a particular, non-default interpretation though the wrapper. Is this a good idea? I don’t know—it could be dangerous. It could lead to people assuming a certain interpretation in their code that won’t always be valid.

For example, suppose that we interpret all iterables as lists by default. A user writes a function where a function argument is understood to refer to a single vertex. But they naively omit the wrapper, as in most interactive use the wrapper was not necessary:

def setRoot(tree, vertex):
    tree.vs[vertex]["root"] = True

Now suppose someone passes in a tree like this:

image

using setRoot(tree, (1,2)). The intention is to interpret (1,2) as a single vertex, after all the function appears to promise that the second argument is only for single vertices. But the implementation is “flawed”, so it ends up treating it as two vertices, without issuing an error.


Side note: Yes, it makes sense to have a single graph where some vertices are tuples of integers, and some are integers. Consider for example a bipartite graph of sets and elements where edges represent inclusion in a set.


My point here is just that we have to be careful. It might still work well if we do it right. As Tamás said, we need a large number of code example for each proposal. I intentionally wrote the above setRoot example assuming that iterables always represent lists of objects (except maybe for strings). What if we go the hashable/unhashable route instead? Well, a nice thing about that is that non-hashables cannot possibly be vertices at this moment. So let me rephrase your hashable/unhashable proposal Vincent:

  • An object will be interpreted as a single vertex if it could, potentially, represent a vertex in some graph.

Now it makes a bit more sense. Hashability just happens to be a requirement for objects to be usable as vertices. Perhaps this is what you were thinking of from the beginning, I just didn’t get it until now.

With the hashable/unhashable proposal, there would be no problem at all with the above setRoot example.


Finally, I think we should definitely have multiple proposals, for comparison, and we should all contribute to creating code examples for all of them, to see the consequences. I am certainly not wed to my original proposal, just trying to figure out what works within each of them.

see why I wanted to discuss this in person? :slight_smile:

@vtraag re pandas API and dictionary, I’m afraid I wasn’t very clear. The most useful piece of pandas's API in that respect, at least in my hands, is fast, vectorized dictionary lookup, e.g.:

index = pd.Series(np.arange(5), index=['a', 'b', 'c', 'd', 'e'])
index[['b', 'd', 'a']] <-- fast vectorized lookup

This simple operation is much easier on the eyes than a list comprehension and it’s faster too, so it’s really useful. Having something like that for vertex/edge names would already be a great first step for this:

g = ig.Graph.Full(5)
g.vs['name'] = ['a', 'b', 'c', 'd', 'e']
g.vindex[['b', 'd', 'a']] --> [1, 3, 0]

The remaining part of the pandas API, in addition to the fast numpy-style vectorized dictionary, is the multi-indexing and the square brackets of course. In pandas, you can:

df[index]
df.loc[index]
df.iloc[index]
df.loc[:, index]
df[:, index]
df.ix[index]
df.at[index]

Now notice that all of these require square brackets although everyone on Earth would expect most of these to take round (i.e. function-call) brackets.

How does that affect us? Well, I guess it’s important for us to keep in mind that select currently uses round brackets (functional notation), similar to pandas's own DataFrame.query, but Graph.vs takes square brackets. If we want to inject a seamless dictionary lookup in those Graph.vs brackets, we should be careful not to end up like pandas i.e. with 95% of users expecting the other thing.

Re @tamas 's proposal to get rid of indices at all, aligning with networkx. pandas has a default behaviour to use integers from 0 up as an index if you don’t specify one as construction time. We could do the same: have some kind of index argument at construction time (or index_col, or index_attribute, etc. depending on the exact classmethod) that defaults to integers. But yes, having a proposal would be better than just ranting.

Great discussion, eager to see what comes of it!

Two small contributions:

  • Recently, NetworkX started a formal “enhancement proposal” process: NXEPs — NetworkX 2.6.2 documentation.

  • As for examples for indexing tabular data in Python-land, there’s also datatable, which aims to have a nicer API than pandas.

Hi,

I’ve started skimming this issue. I haven’t had time to ingest the full thing, and I’m not familiar w/ some of the context, but here are some thoughts.

  • 1 on this. I do not see any scenario in which I’d want to use the underlaying integer indices.
    Like @vtraag says below, maybe performance? But I think it makes more sense to focus on the performance of using labels. For example, iterating over labels / translating them to indices in C and not Python.

Although I like having two separate interfaces for integer indices and labels more than having one interface that “guesses”, I think not even having both concepts is even better.

Regarding passing a sequence of vertex labels vs. a label that is a sequence. I think the least surprising option is best. I agree that using the hashable vs. non hashable distinction is too complex. A couple ideas:

  1. A list indicates a sequence of vertices" since a list can never be a vertex label. Anything else should be a hashable label.
  2. Two different functions, w/ the single vertex one internally just wrapping the multi-vertex one.
  3. A sentinel wrapper like class VertexSequence(List[Hashable]): ..... This will also make IDEs/linters error in some scenarios (a good thing).

I particularly like (1). It’s easy to document (users understand what is a list and what is not) and there is no ambiguity (since a list as a label would be an error). Real typed example.

First of all, nice to see the engagement on this topic!

This is indeed essentially what I had in mind. Note that a string is hashable, so will never be interpreted as an iterable. Indeed, a set will also be seen as an iterable, because it is not hashable, while a frozenset will be interpreted as a a vertex label. However, it might still make sense to make it possible to explicitly separate identification of multiple vertices (i.e. G.vs) and single vertices (i.e. G.v), just to make it more explicit for a user.

Great, then I think we’re on the same page here!

The other indexing operation of pandas, with regards to ix, xs etc, are not necessary I think. We do have one additional notation, namely G['graph_attribute'] to store graph attributes (i.e. something associated with the graph as a whole). We might also rethink whether it makes sense to keep that notation, although I don’t see any harm immediately.

Yes, we do need to consider this, but I think the array indexing notation is actually quite natural for many of these operations.

Yes, indeed, I think some default like that would be useful. Such defaults should obviously not propagate if we take subgraphs or something, because keeping the original labels is then exactly what makes it more useful. We should also probably use the actual underlying integer representation as default labels.

Great, thanks for the pointer! That looks like a useful starting point to explore the possibilities further.

Again, thanks for the pointer! It might be interesting to explore that further as well indeed. I do see that they actually mix integer indices and non-integer labels, but it seems difficult to use integers as labels. But I’ll have to explore it further. Did you have any experience with this package that you found particularly easy or useful?

Yes, it would be nice to ensure to keep good performance while using labels!

However, I’m not sure if is possible to get rid of them entirely, indeed also for purposes of efficiency. There are situations in which there could be a certain “order” of the vertices, for which there still is some idea of underlying integer indices. For example, suppose we call G.degree(), which returns the degree for each node. If there is no concept of an underlying integer index, it would need to return the node label for each node along with the degree (indeed this is what networkx does). Just returning a list of degrees would then not be possible, because it is not clear what position would correspond to what node without having an actual integer index. If we want to use more efficient representations for such results (e.g. numpy arrays), not having integer indices at all would limit the possibilities indeed. We could of course resort to pandas Series to represent such results (i.e. essentially a numpy array with a labeled index), but I’m not sure if we want to create such a foundational dependency on pandas.

Yes, I would agree indeed.

Just so that I understand this correctly: you would propose that only a list would be interpreted as a sequence of vertices? How about other non-hashable objects such as a set or some iterator? These cannot be used as labels, but you would propose that they should also not be interpreted as sequences of vertices? Or would they be allowed as iterables? Would the distinction then not essentially come down to the distinction hashable/non-hashable: anything that is hashable is always interpreted as a node label.

Nice to see a concrete example! Just to make sure I understand this correctly: this would require all nodes to always use the same type as the label, wouldn’t it?

This is perhaps going somewhere.

pandas Series are essentially ordered fast dictionaries. Basically, that’s the compromise between no-order networkx-style and the current situation. If we don’t want to depend on pandas, we could reimplement this particular data structure ourselves in CPython, with limited accessor functions (i.e. just a vectorized get that goes sequence of objects → sequence of hashes → sequence of indices), none of the other 100 nice methods of pandas.Series.

The C/Python boundary if you want to find one, of course, would be that C never knows about Python objects, but rather knows about their string/bytestring hashes only.

set/frozenset and list/tuple: let’s not go there. Most users will mess up. Constantly. Think about when people will start using numpy arrays, or pandas.Index for this… we really ought not to silently take a stance but neither want the user to copy the whole array into a list because we only accept list or tuple and no other sequence.

Honestly, rather than distinguishing between list and tuple, which creates infinite headaches in numpy, I’d be in favour of stricter typing for function signatures: if the function needs a sequence of vertices, you can’t put a single one and we are nice and wrap it in a singleton list. You would now get an error message instead or, if you used a tuple to mean a single vertex, an error that those vertices don’t exist.

I think this is by far the least surprising approach, indeed better than only accepting a list as the iterable container.
The con is that you now need to maintain 2 methods for everything. If you want to support all combinations of sequences, labels and integer indices that’s now an exponentially growing combination of methods.
But at least the sequence vs. single item methods can be easily be done in Python (i.e. have a method that takes a single item, wraps it in a list and calls the method that takes a sequence).

It could be a union type (example). But I don’t want to derail this convo into typing Python any more than to say that whatever solution is landed upon, a good benchmark for the quality of the API may be how easily it can be understood by MyPy.