How to create a graph from pandas dataframes?

I have two pandas dataframes, nodes_df and edges_df. The nodes_df contains node identifiers and attributes, while the edges_df contains the source and target nodes and edge attributes (in my case, weights and edge type).

I would like to create an ig.Graph (I use import igraph as ig) from these two dataframes, but I am not sure how to best do this. I started with a blank graph G = ig.Graph(), and then add edges using G.add_edges(edges_df), but I then receive an error

TypeError: iterable must return pairs of integers or strings

How should I proceed? I am using python-igraph 0.7.1 in Python 3.7.3.

Update

You can now directly construct a graph based on a pandas DataFrame using Graph.DataFrame.

Old answer

The easiest is to use Graph.DictList as follows:

G = ig.Graph.DictList(
          vertices=nodes_df.to_dict('records'),
          edges=edges_df.to_dict('records'),
          directed=True,
          vertex_name_attr='id',
          edge_foreign_keys=('source', 'target'));

Here, the vertex_name_attr refers to the columns of nodes_df that contains the node identifier (which is assumed to be id here). The edge_foreign_keys refer to the columns of edges_df that contains the source and target identifier of the edges. All other attributes are automatically added as node or edge attributes. For example, if nodes_df had a column group, it will become accessible as G.vs['group'], and if edges_df had a column weight, it will become available as G.es['weight']. This assumes that no relevant columns are being used as indices. The downside of this is that it is somewhat slow.

A somewhat faster variant is provided by:

G = ig.Graph.TupleList(edges_df.values, 
                       weights=True, directed=True,
                       edge_attrs=edges_df.columns[3:])

This assumes that the first three columns of edges_df are respectively the source, the target and the edge weight. The names of the columns are irrelevant in this case, it only matters they are the first three columns, in this exact order. Any remaining edge attributes can then be provided in the edge_attrs argument, and it now simply uses all remaining columns.

We then have to add the node attributes stil separately. The ig.Graph.TupleList automatically kept track of the node identifiers in the node attribute name, which we can use to assign the proper attribute names. Assuming you have an index set on the node identifier, you can do the following

for column in nodes_df:
  G.vs[column] = nodes_df.loc[G.vs['name'],column]

If you have to set the node identifier as an index, you can do this as follows: nodes_df = nodes_df.set_index('id'), assuming the node identifier is the column id.

If you still encounter issues, let us know.

Hey, sorry, but could you provide a reproducible example? The answer is otherwise difficult to understand.

What is exactly not reproducible about this? If you have a node dataframe with a column named id, and an edge dataframe with two columns named source and target, the code is immediately usable.

A post was split to a new topic: Problem creating multiple layer graphs