Regarding determinism, I understand your stance. Adding a new algorithm, optimizing it, then maintaining it is probably too much of a burden if its sole purpose is to be used in a benchmark.

I wasn’t part of the team when the benchmark was designed in 2014-2016 but since then, I’ve had many discussions with members of the original team. They have emphasized that one of their guiding principles for designing the benchmark was ensuring deterministic behaviour because it makes both the performance comparison and the validation of results easier. This is so important that the fact that the algorithms in benchmark are deterministic is mentioned right away in the abstract of the Graphalytics VLDB paper. The team also told me that some algorithms, such as forest fire and a spring-based graph layout (visualization) algorithm were discarded from the benchmark as they couldn’t be made deterministic in a meaningful way.

From a benchmarking perspective, CDLP is an interesting algorithm because it collects the labels from the neighbours of a vertex and pushes them through a non-trivial aggregation function, “random mode value”, which makes for a more complicated access pattern than the one used in BFS/SSSP/PageRank. By making it deterministic, this aggregation function becomes “min mode value” which is even more difficult. For reference, the matrix-based GraphBLAS implementations of BFS/SSSP/PageRank use matrix-vector multiplication; the non-deterministic CDLP would need an adjacency matrix-diagonal matrix multiplication with row-wise reduce (but I have not yet implemented this), while the deterministic CDLP implementation also needs sorting; and LCC needs matrix-matrix multiplication.

It may very well be the case that deterministic CDLP algorithm does not make a whole lot of sense for practical graph analytics – after all, there was no network scientist involved in the design of the benchmark. According to the Graphalytics VLDB paper, the algorithms in the benchmarks were selected after surveying 168 “graph analysis articles published in ten representative conferences on databases, high-performance computing, and distributed systems (e.g., VLDB, SIGMOD, SC, PPoPP)”, so there’s a clear bias for algorithms that are used in core system engineering (DB/HPC, parallel programming) papers.

In the benchmark implementation efforts so far, libraries either didn’t have a CDLP algorithm or had fairly simple ones. When a library didn’t have a CDLP algorithm, we just implemented a deterministic one. For the ones with an existing CDLP/LPA algorithm, we could adjust them to ensure deterministic behaviour, typically by adding an if condition and a few lines of code. It seems that neither is the case for igraph, which uses a fairly sophisticated CDLP algorithm.

So… for the time being, we might opt to skip this algorithm and focus on the other algorithms instead. The Graphalytics specification is still released as “v1.0 draft” and there are no audited benchmarks yet, so, in theory, the specification can be adjusted. This would need thorough discussions with the Graphalytics team and approval from the LDBC board but it is possible. We would still need some way of validating results, e.g. the researcher/auditor running the benchmark should sample the nodes and ascertain whether the community structure is roughly as expected. Let us know if you have any suggestions for this.

PS: I found the comment about the oscillation very interesting. I have actually witnessed this when running the deterministic CDLP algorithm in small graphs, now I know why it happens.