Implementing the LDBC Graphalytics benchmark

Dear igraph forum members,

I’d like to start an implementation of the LDBC Graphalytics benchmark on top of igraph’s C API.

The typical challenges for implementing this benchmark have mostly do to with the specific algorithms prescribed by the specification. The ones that could work out of the box:

  • Connected components (WCC), single-source shortest path (SSSP) are quite unambigious in general so they are usually trivial to implement.
  • The BFS algorithm requires the “levels” of the nodes. I think this is called rank in igraph_bfs.

The ones that are more problematic:

  • The PageRank algorithm treats dangling vertices different from others, following the approach of the “Ranking the Web Frontier” paper. This seems unsupported by igraph.
  • The LCC (local clustering coefficient) algorithm, a.k.a. transitivity, has a directed variant in Graphalytics. I think this is not supported in igraph.
  • The CDLP (community detection using label propagation) algorithm uses a deterministic variant which requires the selection of the “min mode” label (i.e. the smallest value of the most common label) among the neighbours. I don’t think this is currently supported in igraph_community_label_propagation.

Additionally, loading the data from the vertex/edge files usually takes a bit of “data hammering”.

I’d like to start this project in September, preferably as a student project. Let me know if you have any comments or suggestions.

Gabor Szarnyas

2 Likes

Do you mean the level in the BFS tree, i.e. distance from root? Is that not the same as (unweighted) SSSP?

There is more than one way to generalize this to directed graphs, and we have not yet decided which variant(s) to include. If you have any thoughts on this, let us know (here or in https://github.com/igraph/igraph/issues/1218)

@vtraag, any input on this?

Yes it is the same problem. What’s the preferred way of doing this in igraph?

I will comment on the transitivity issue.

You can use igraph_shortest_paths. I would expect it to be more efficient than igraph_bfs, which retains more information about the BFS tree (I have not tested their performance!).

I don’t understand the reason for doing this. It of course is a clear way to break ties, but it is also more likely to lead to overly large communities (although that is a general problem of LPA anyway). Do you have any particular reference that this is a good idea and preferred over breaking the tied randomly?

Another consideration is that the current implementation is asynchronous, and nodes are updated in a random order. Hence, even if breaking ties would be changed, it would still not be deterministic. Of course, this can be solved by seeding the RNG. However, from your description, the objective might actually be to get identical results across implementations. That would then require identical RNG implementations, but that might be going to far.

Can you elaborate on the requirement of making it deterministic?

So the SSSP and the BFS are not run as separate benchmarks then? After all, they are identical problems.

I don’t understand the reason for doing this. It of course is a clear way to break ties, but it is also more likely to lead to overly large communities (although that is a general problem of LPA anyway). Do you have any particular reference that this is a good idea and preferred over breaking the tied randomly?

The tie-breaking rule for selecting the minimum value isn’t there to ensure better communities. Instead, its sole purpose is to ensure determinism. It is not very realistic and it also gave me some headache during the GraphBLAS implementation.

See Section 2.3.4 Community Detection using Label Propagation (CDLP) in the spec for the formal specification of the propagation rule.

So the SSSP and the BFS are not run as separate benchmarks then? After all, they are identical problems.

SSSP is only executed for data sets with edge weights. For data sets without edge weights, the SSSP algorithm is skipped.

Asynchronous updates were suggested in the preprint of the label propagation algorithm on arXiv - I don’t know whether the final publication is identical, but most likely it is. Basically, the authors argue that synchronous updates are discouraged because it may cause oscillations if the graph contains (near-)bipartite subgraphs.

Yes, this was also my understanding indeed.

From the graphylitics specs it seems that a synchronous version would also be required (in addition to the tie breaking requirement). Perhaps I don’t fully understand it, but why should a benchmark include an algorithm that in, that particular implementation, is actually never going to be used?

Well, on one hand I totally understand the need for eliminating randomness from benchmarks where possible – it easily becomes unmanageable (in terms of time / computational effort) if you need to average the runtime of an algorithm over, say, 10000 runs in order to get a reliable estimate of its real performance.

On the other hand, there are graph algorithms where randomness is inherently needed for the algorithm to work as intended. The original label propagation algorithm of Raghavan et al is one such example. In the manuscript, the authors even propose to run the algorithm several times in order to discover alternative solutions when the community structure is not clear-cut (which also implies that there is not even a single “correct” solution that one could use to validate an implementation of the algorithm). In my opinion, these algorithms are not well-suited for benchmarks because you lose the very essence of the algorithm if you eliminate non-determinism.

Label propagation was probably chosen because it scales up to larger graphs easily due to its nearly linear nature. I’m saying “nearly” because you need an extra BFS step if you want to ensure connectedness for the detected communities. And also because it is probably much easier to implement in a distributed manner over large graphs.

Yes, I agree.

Exactly! It seems a bit strange to me to require an algorithm that is only implemented for the sole purpose of the benchmark.

This is probably another reason why a synchronous implementation would be required.

Regarding determinism, I understand your stance. Adding a new algorithm, optimizing it, then maintaining it is probably too much of a burden if its sole purpose is to be used in a benchmark.

I wasn’t part of the team when the benchmark was designed in 2014-2016 but since then, I’ve had many discussions with members of the original team. They have emphasized that one of their guiding principles for designing the benchmark was ensuring deterministic behaviour because it makes both the performance comparison and the validation of results easier. This is so important that the fact that the algorithms in benchmark are deterministic is mentioned right away in the abstract of the Graphalytics VLDB paper. The team also told me that some algorithms, such as forest fire and a spring-based graph layout (visualization) algorithm were discarded from the benchmark as they couldn’t be made deterministic in a meaningful way.

From a benchmarking perspective, CDLP is an interesting algorithm because it collects the labels from the neighbours of a vertex and pushes them through a non-trivial aggregation function, “random mode value”, which makes for a more complicated access pattern than the one used in BFS/SSSP/PageRank. By making it deterministic, this aggregation function becomes “min mode value” which is even more difficult. For reference, the matrix-based GraphBLAS implementations of BFS/SSSP/PageRank use matrix-vector multiplication; the non-deterministic CDLP would need an adjacency matrix-diagonal matrix multiplication with row-wise reduce (but I have not yet implemented this), while the deterministic CDLP implementation also needs sorting; and LCC needs matrix-matrix multiplication.

It may very well be the case that deterministic CDLP algorithm does not make a whole lot of sense for practical graph analytics – after all, there was no network scientist involved in the design of the benchmark. According to the Graphalytics VLDB paper, the algorithms in the benchmarks were selected after surveying 168 “graph analysis articles published in ten representative conferences on databases, high-performance computing, and distributed systems (e.g., VLDB, SIGMOD, SC, PPoPP)”, so there’s a clear bias for algorithms that are used in core system engineering (DB/HPC, parallel programming) papers.

In the benchmark implementation efforts so far, libraries either didn’t have a CDLP algorithm or had fairly simple ones. When a library didn’t have a CDLP algorithm, we just implemented a deterministic one. For the ones with an existing CDLP/LPA algorithm, we could adjust them to ensure deterministic behaviour, typically by adding an if condition and a few lines of code. It seems that neither is the case for igraph, which uses a fairly sophisticated CDLP algorithm.

So… for the time being, we might opt to skip this algorithm and focus on the other algorithms instead. The Graphalytics specification is still released as “v1.0 draft” and there are no audited benchmarks yet, so, in theory, the specification can be adjusted. This would need thorough discussions with the Graphalytics team and approval from the LDBC board but it is possible. We would still need some way of validating results, e.g. the researcher/auditor running the benchmark should sample the nodes and ascertain whether the community structure is roughly as expected. Let us know if you have any suggestions for this.

PS: I found the comment about the oscillation very interesting. I have actually witnessed this when running the deterministic CDLP algorithm in small graphs, now I know why it happens.