Double-weighted directed graph adjacency list

I have a graph theoretical problem, which is unfortunate for me given I know nothing about graph theory. Apologies in advance for butchering the terminology.

Suppose I have a directed graph (with edges (A,B) and (B,A) both possible and distinct from each other), with vertices and edges assigned positive integer-valued weights. The vertex weights will always be greater than or equal to the sum of the edge weights that connect with other vertices.

My question concerns how such a graph would be represented in an adjacency list. I’ve seen weighted graphs represented in adjacency lists with the edges and their weights noted, but I’ve not encountered a way to represent both edge weights and vertex weights in an adjacency list. How would I do this in igraph?

I would be very grateful for any guidance anyone could offer.

It feels like there is another question hidden behind this post, which doesn’t come through as you did not explain the motivation. See:

It’s not really clear what you are looking for when you say “how to represent it as an adjacency list”. igraph’s main feature is that it offers a graph data structure which can store arbitrary graphs, and assign arbitrary attributes to their vertices and edges. It can store the data that you describe.

Since you tagged this post as Python, the python-igraph tutorial shows how to set attributes: Tutorial

Fair point, there is another question hidden beyond my first one.

Given my lack of familiarity with graph theory, I’m hoping to ask the community how I might both cluster such a graph (as in, what would be the most appropriate clustering algorithm for this positive integer-valued, double-weighted directed graph: HCS or Kernel K-Means, for example), and further, how to determine the strongest path within each cluster.

I have a metric in [0,1] that quantifies the strength of each path, with values closer to 1 indicating preferred paths. I’m not sure whether I can modify an existing path search algorithm (such as Dijkstra’s) or if I have to iteratively try every path in the cluster to determine the one with the greatest score.

I know that’s a bit more involved than what I originally wrote, but I would be incredibly grateful for any guidance.

Since you say you are not familiar with graph theory, I’ll try to give you a few pointers and keywords. Dealing with real world network datasets is the domain of network science (this is a keyword). Graph theory is used to describe a branch of pure mathematics.

First, you need to formulate your problem in mathematically precise terms.

A well-understood problem is finding shortest paths. The length of a path is the sum of the lengths of the edges along it. Think about whether you can transform your weights into lengths in such a way so that looking for shortest paths is useful. igraph has several shortest path algorithms.

One can also look for paths that maximize or minimize other properties. igraph 0.10 has an implementation of searching for widest paths. The width of a path is the width of the narrowest edge along it.

Clustering of network data is usually referred to as community detection. After looking at the relevant Wikipedia page, you can check the following papers for an overview:

igraph contains several community detection algorithms. They may use “weights” in different ways. Most use the concept of modularity in some way. Thus, once again, the first step is to try to make your problem more precise and think about how the edge and vertex weights you have should affect clustering. Then look for a suitable algorithm, and a suitable way to transform your weights for that algorithm. For example: do larger weights represent stronger or weaker connections? What is the meaning of vertex weights, and how do you want them to affect clustering? Most available algorithms do not use vertex weights as their interpretation is not obvious, unlike that of edge weights.

I hope this helps a bit.

Thank you very much for your incredibly thorough and helpful response, I really appreciate it!

I’m working my way through the resources you’ve offered and am beginning to make sense of the field. However, I still think that the only way to represent my network is with both vertex and edge weights, though I’d be grateful for a push in the right direction if there’s a way to represent the problem in another, more conventional way.

Suppose I’m interested in analysing internet traffic on and between social media sites. Visits to a single social media site (like Facebook) would be represented by node weights, whereas the transit from one social media site to another (from Facebook directly to Twitter, for example) would be represented by edge weights. I consider the node weights as distinct from the edge weights since the latter signifies transit from one social media site to another, while node weights are concerned solely with visits to a single specific social media site without traversing another social media site, perhaps coming directly to the site or via another non-social media site.

When I say I would like to determine the strongest path within this graph, I want to determine the most well-travelled path between social media sites as demonstrated by node and edge visitor counts–I want to know the behaviour of the average social media user. Both node and edge weights will be positive integers, so I had thought to measure path ‘strength’ by calculating the average number of node visits across all nodes, and the average number of edge traversals across all edges, and calculating the Poisson cumulative distribution function value for each. For example, if the average number of direct social media visits across all such sites is 100, and I find that Facebook’s visit count is 120, the path ‘strength’ for Facebook alone is 0.9773. The higher, the better for this application.

Where I think my application differs from the shortest path, widest path, and other path search optimisations discussed in the literature is that I don’t care where the path begins and ends. It may be that the strongest path, as determined by the above metric, progresses from Facebook to Twitter to TikTok in that specific order, and that preceding and subsequent nodes (such as MySpace or Google+) do not significantly contribute to the path strength as determined by the combination of their node and edge visit counts. Or, similarly, Facebook alone may so significantly dominate the entire social media landscape that any path to or from it only detracts from that strength metric, in which case the entire social media industry is basically driven by Facebook and its community alone; I consider an analogous case among cryptocurrencies where Bitcoin is essentially the prime mover of the entire market.

I know that description moves away from the mathematical precision you’ve suggested and I agree that I need to frame the problem in terms of, but I’m struggling to find comparable situations that already exist in the literature. Without wanting to take advantage of your generosity, I’d be very grateful for any further guidance you can offer.

Jeff one quick comment on this:

My question concerns how such a graph would be represented in an adjacency list. I’ve seen weighted graphs represented in adjacency lists with the edges and their weights noted, but I’ve not encountered a way to represent both edge weights and vertex weights in an adjacency list. How would I do this in igraph?

An adjacency list is a representation of edges. You cannot represent node attributes in an adjacency list.

Make a separate tabular data structure (e.g., csv) for node attributes, and include the node weights there as an attribute. Make sure that every node ID that appears in the adjacency list also appears in this node list, no more and no less. Otherwise you get errors in the final step.

Read them in as tables (data frames or tibbles, for example using read.csv or the tibble equivalent). Then use graph_from_data_frame to construct the igraph from the edge list and vertex data.

1 Like

For me this seems like the problem of finding the stationary distribution of a random walker on the graph that starts from a node picked arbitrarily at random (with probabilities proportional to the direct visitor counts of the social media sites) and then follows links to other social media sites, picking outbound edges proportional to the observed transit counts from one social media site to another until a certain stopping condition is met, at which point the random walker picks another start node at random. You need to decide the condition that triggers a “reset” to a randomly picked node; it could be based on a uniform probability at each node (say, the random walker gets bored with following links with a probability of 0.1 after arriving at a new site), or you could trigger a reset only when the random walker gets stuck in a sink node with no outbound edges, or you could invent even more complex conditions (like varying teleportation probabilities for different social media sites). If you stick to a standard uniform teleportation probability, then the so-called personalized PageRank score of each node is what you are looking for, with a personalization vector where each entry is proportional to the direct visitor count.

Once you have the personalized PageRank score of each vertex, this gives you the probability of finding the random walker at a given vertex after an infinite number of steps. If you want to derive “most likely” paths from this, you can simulate lots of walks using igraph_random_walk() where the start vertex is picked according to the personalized PageRank distribution, but I think that in non-trivial graphs you are not likely to get too many identical paths.

I had thought it might be possible to list edges between a node and itself in the adjacency list, but if that’s not possible in igraph then it’ll be just as simple to maintain that information in a separate table. Thanks for your advice!

I really like the idea of simulating lots of walks to find the most well-travelled one, but I suspect the paths found in non-trivial graphs will be too variable, as you suggest.

I think I need to firm up my definition of the strongest path as determined by the visit counts. As discussed in another post, if I have a decent metric then I should be able to adapt something like the Floyd-Warshall algorithm to find the strongest path in the graph. Would you agree?

If so, perhaps the metric should combine the stationary distribution with the total number of visits across nodes in the path to calculate a probability-weighted total visit count–the higher the better? I would be very grateful for your thoughts on the matter.

It’s possible, but I believe that the quantity you mention does not belong to an artificial loop edge on the node but it’s a property of the node itself. For a loop edge, I think the semantics would be something like “the number of times the user travelled from one page within that social network to another page within the same social network”, which is a completely different thing.

Exactly, but really this indicates that there isn’t really such thing as the “most well-travelled” path as there are so many possibilities. Personally, I would consider the stationary distribution of the random walk on the nodes as a more accurate representation. If you want to have a quantity that relates to the edges, you can assign a probability to each edge by calculating what is the probability of seeing the random walker traversing that edge (which can be derived from the probability of seeing the random walker at the source of the edge times the probability that the random walker picks that particular edge, given that it is already at the source). Then you can pick the top k edges with the highest probabilities, choosing k in a manner that yields a “backbone” of the graph.

I think this is a wonderful idea and exactly the kind of guidance I was hoping for. I hadn’t really considered the graph to be a Markov chain, so I didn’t fully grasp the implication of your initial suggestion to determine the stationary distribution. As you say, picking the k edges with the highest probabilities will expose the backbone of the graph–it’s just a matter of determining the optimal k. Thanks very much for your help!