Implementing the LDBC Graphalytics benchmark

Dear igraph forum members,

I’d like to start an implementation of the LDBC Graphalytics benchmark on top of igraph’s C API.

The typical challenges for implementing this benchmark have mostly do to with the specific algorithms prescribed by the specification. The ones that could work out of the box:

  • Connected components (WCC), single-source shortest path (SSSP) are quite unambigious in general so they are usually trivial to implement.
  • The BFS algorithm requires the “levels” of the nodes. I think this is called rank in igraph_bfs.

The ones that are more problematic:

  • The PageRank algorithm treats dangling vertices different from others, following the approach of the “Ranking the Web Frontier” paper. This seems unsupported by igraph.
  • The LCC (local clustering coefficient) algorithm, a.k.a. transitivity, has a directed variant in Graphalytics. I think this is not supported in igraph.
  • The CDLP (community detection using label propagation) algorithm uses a deterministic variant which requires the selection of the “min mode” label (i.e. the smallest value of the most common label) among the neighbours. I don’t think this is currently supported in igraph_community_label_propagation.

Additionally, loading the data from the vertex/edge files usually takes a bit of “data hammering”.

I’d like to start this project in September, preferably as a student project. Let me know if you have any comments or suggestions.

Gabor Szarnyas

2 Likes

Do you mean the level in the BFS tree, i.e. distance from root? Is that not the same as (unweighted) SSSP?

There is more than one way to generalize this to directed graphs, and we have not yet decided which variant(s) to include. If you have any thoughts on this, let us know (here or in https://github.com/igraph/igraph/issues/1218)

@vtraag, any input on this?

Yes it is the same problem. What’s the preferred way of doing this in igraph?

I will comment on the transitivity issue.

You can use igraph_shortest_paths. I would expect it to be more efficient than igraph_bfs, which retains more information about the BFS tree (I have not tested their performance!).

I don’t understand the reason for doing this. It of course is a clear way to break ties, but it is also more likely to lead to overly large communities (although that is a general problem of LPA anyway). Do you have any particular reference that this is a good idea and preferred over breaking the tied randomly?

Another consideration is that the current implementation is asynchronous, and nodes are updated in a random order. Hence, even if breaking ties would be changed, it would still not be deterministic. Of course, this can be solved by seeding the RNG. However, from your description, the objective might actually be to get identical results across implementations. That would then require identical RNG implementations, but that might be going to far.

Can you elaborate on the requirement of making it deterministic?

So the SSSP and the BFS are not run as separate benchmarks then? After all, they are identical problems.

I don’t understand the reason for doing this. It of course is a clear way to break ties, but it is also more likely to lead to overly large communities (although that is a general problem of LPA anyway). Do you have any particular reference that this is a good idea and preferred over breaking the tied randomly?

The tie-breaking rule for selecting the minimum value isn’t there to ensure better communities. Instead, its sole purpose is to ensure determinism. It is not very realistic and it also gave me some headache during the GraphBLAS implementation.

See Section 2.3.4 Community Detection using Label Propagation (CDLP) in the spec for the formal specification of the propagation rule.

So the SSSP and the BFS are not run as separate benchmarks then? After all, they are identical problems.

SSSP is only executed for data sets with edge weights. For data sets without edge weights, the SSSP algorithm is skipped.

Asynchronous updates were suggested in the preprint of the label propagation algorithm on arXiv - I don’t know whether the final publication is identical, but most likely it is. Basically, the authors argue that synchronous updates are discouraged because it may cause oscillations if the graph contains (near-)bipartite subgraphs.

Yes, this was also my understanding indeed.

From the graphylitics specs it seems that a synchronous version would also be required (in addition to the tie breaking requirement). Perhaps I don’t fully understand it, but why should a benchmark include an algorithm that in, that particular implementation, is actually never going to be used?

Well, on one hand I totally understand the need for eliminating randomness from benchmarks where possible – it easily becomes unmanageable (in terms of time / computational effort) if you need to average the runtime of an algorithm over, say, 10000 runs in order to get a reliable estimate of its real performance.

On the other hand, there are graph algorithms where randomness is inherently needed for the algorithm to work as intended. The original label propagation algorithm of Raghavan et al is one such example. In the manuscript, the authors even propose to run the algorithm several times in order to discover alternative solutions when the community structure is not clear-cut (which also implies that there is not even a single “correct” solution that one could use to validate an implementation of the algorithm). In my opinion, these algorithms are not well-suited for benchmarks because you lose the very essence of the algorithm if you eliminate non-determinism.

Label propagation was probably chosen because it scales up to larger graphs easily due to its nearly linear nature. I’m saying “nearly” because you need an extra BFS step if you want to ensure connectedness for the detected communities. And also because it is probably much easier to implement in a distributed manner over large graphs.

Yes, I agree.

Exactly! It seems a bit strange to me to require an algorithm that is only implemented for the sole purpose of the benchmark.

This is probably another reason why a synchronous implementation would be required.

Regarding determinism, I understand your stance. Adding a new algorithm, optimizing it, then maintaining it is probably too much of a burden if its sole purpose is to be used in a benchmark.

I wasn’t part of the team when the benchmark was designed in 2014-2016 but since then, I’ve had many discussions with members of the original team. They have emphasized that one of their guiding principles for designing the benchmark was ensuring deterministic behaviour because it makes both the performance comparison and the validation of results easier. This is so important that the fact that the algorithms in benchmark are deterministic is mentioned right away in the abstract of the Graphalytics VLDB paper. The team also told me that some algorithms, such as forest fire and a spring-based graph layout (visualization) algorithm were discarded from the benchmark as they couldn’t be made deterministic in a meaningful way.

From a benchmarking perspective, CDLP is an interesting algorithm because it collects the labels from the neighbours of a vertex and pushes them through a non-trivial aggregation function, “random mode value”, which makes for a more complicated access pattern than the one used in BFS/SSSP/PageRank. By making it deterministic, this aggregation function becomes “min mode value” which is even more difficult. For reference, the matrix-based GraphBLAS implementations of BFS/SSSP/PageRank use matrix-vector multiplication; the non-deterministic CDLP would need an adjacency matrix-diagonal matrix multiplication with row-wise reduce (but I have not yet implemented this), while the deterministic CDLP implementation also needs sorting; and LCC needs matrix-matrix multiplication.

It may very well be the case that deterministic CDLP algorithm does not make a whole lot of sense for practical graph analytics – after all, there was no network scientist involved in the design of the benchmark. According to the Graphalytics VLDB paper, the algorithms in the benchmarks were selected after surveying 168 “graph analysis articles published in ten representative conferences on databases, high-performance computing, and distributed systems (e.g., VLDB, SIGMOD, SC, PPoPP)”, so there’s a clear bias for algorithms that are used in core system engineering (DB/HPC, parallel programming) papers.

In the benchmark implementation efforts so far, libraries either didn’t have a CDLP algorithm or had fairly simple ones. When a library didn’t have a CDLP algorithm, we just implemented a deterministic one. For the ones with an existing CDLP/LPA algorithm, we could adjust them to ensure deterministic behaviour, typically by adding an if condition and a few lines of code. It seems that neither is the case for igraph, which uses a fairly sophisticated CDLP algorithm.

So… for the time being, we might opt to skip this algorithm and focus on the other algorithms instead. The Graphalytics specification is still released as “v1.0 draft” and there are no audited benchmarks yet, so, in theory, the specification can be adjusted. This would need thorough discussions with the Graphalytics team and approval from the LDBC board but it is possible. We would still need some way of validating results, e.g. the researcher/auditor running the benchmark should sample the nodes and ascertain whether the community structure is roughly as expected. Let us know if you have any suggestions for this.

PS: I found the comment about the oscillation very interesting. I have actually witnessed this when running the deterministic CDLP algorithm in small graphs, now I know why it happens.

@szarnyasg igraph 0.9 is expected to be released in less than 2 week. It would be nice to look into benchmarking after that. Do you have an update on this, did you find an interested student?

@szhorvat unfortunately, I haven’t found an interested student. I drafted a Python script to implement the benchmark in December. I got this far: igraph-graphalytics.py · GitHub

BFS, SSSP, and WCC work fine. The rest would require some development - CDLP: deterministic choice of label, LCC: support a directed definition, PageRank: different stopping criterion (fixed number of iterations vs. reaching a fixed point). Adjusting these seems doable but non-trivial.

Do you see any of changes regarding these algorithms in v0.9?

PS: the Graphalytics specification is now on arXiv: [2011.15028] The LDBC Graphalytics Benchmark

I’d say the first priority would be to set up a reliable benchmarking framework and methodology. Benchmarking is not easy. At the moment we all work on laptops, which are often unstable. For some reason, with igraph’s existing benchmark programs I was having a lot of difficulty getting consistent results, and I’m not sure why (see tests/benchmarks/igraph_random_walk.c). Perhaps my laptop is just usntable (thermal throttling?) But that won’t explain why this benchmark behaves worse than others …

Since you have have worked extensively with benchmarks, some advice/guidance/tips would be very welcome! I assume that benchmarking on laptops is just not the right way.

Regarding the algorithms, from the above discussion it seems in order to be able to support consistent and deterministic benchmarking, Graphalytics had to choose some algorithm variants which are not the best for practical use. Therefore the way forward is not to include them into igraph directly, but to implement them using igraph (in C of course) as a separate program. I think that they would still be a very useful benchmark for us as they exercise parts of the library.

  • CDLP — According to the discussion above, the Graphalytics version is not justified from a usefulness/practicality (as opposed to benchmarking) perspective. Thus this will need a separate implementation.

  • LCC — There are many possible generalizations of the clustering coefficient to directed graphs. We simply need more time come up with a sufficiently general approach that will allow computing multiple reasonable variants. Likely the Graphalytics version will be among these too.

    But perhaps including a directed clustering coefficient in Graphalytics is not a best choice … I expect different existing libraries won’t agree on their interpretation of this concept (and if they do, it will be because they copied it from each other, not because their variant is justified from a network analysis perspective).

  • PageRank: Graphalytics seems to use power iteration, so this too needs to be implemented separately. Power iteration is not the best or most robust way to compute PageRank (at least for graphs that fit in memory). igraph does not use it anymore (it was removed from 0.9). The old (now discarded) implementation can be used as a starting point for writing a separate benchmarking-specific version.

    Regarding the specific interpretation of PageRank and the handling of dangling nodes: after a quick glance at the Graphalytics specification, it seems to me that igraph uses the very same definition. However: We now have two implementation of PageRank: one uses PRPACK and use uses ARPACK. Until now PRPACK version did not handle sink nodes (“dangling nodes”) the same way as the ARPACK version. I fixed this a couple of weeks ago. It would be great if you could verify that both give the result you expect in the development version, or if you could give me an expected result for a small graph to compare to.

    About the paper you linked (ref [14] in the Graphalytics specs): I took a quick look, and I do not like the description there very much … while it is heavy on notation, it does not seem to be unambiguous. E.g. they talk about partitioning the graph into strongly connected components and nodes with zero out-degee. But the graph may have nodes which are neither in a (larger than size one) connected component, nor have zero out-degree—there is no comment on this case. They also describe adding an (n+1) th “virtual node” with extra connections to create an amended, connected graph. IMO this just muddies the waters and potentially confuses the reader: in the end the PageRank definition for non-connected graphs still won’t be (or should not be) the same as the classical PageRank on this amended, connected graph, because of the possibility of teleportation. I would not choose to cite this paper.

I have recently discussed microbenchmarking with the author of the Efficient Java Matrix Library. Basically, using a well-cooled desktop/workstation machine is the minimum required setup for meaningful microbenchmarks. The ideal way to go would be a bare-metal server with Xeon/EPYC CPUs.

The cited PageRank paper is indeed somewhat ambiguous. However, I haven’t been able to find a better paper to cite for the Graphalytics benchmark. (The Graphalytics VLDB paper only cited the original PageRank paper which handles dangling vertices in a crude way by removing them during computation.) Another citation I can think of is the Adaptive Methods for the Computation of PageRank paper.

I agree that having Graphalytics-specific implementations is a good approach and I’m glad to assist with these, I might be able invest some time in coding these during the spring. I have a bunch of deadline in the next two weeks but I will pick up this thread afterwards.

1 Like