testing umap layout with avx2

leuc · 27 January 2023 07:49

I let ChatGPT optimize two functions in src/layout/umap.c using AVX2.

It still passes the test and is ~.40s faster.

before

Small graph, epochs: 5000, repetitions: 40   2.84s   2.83s  0.012s
Small graph, epochs: 500, repetitions: 400   2.87s   2.87s  0.003s
Larger graph, epochs: 60, repetitions: 1     1.55s   1.55s  0.001s

after

Small graph, epochs: 5000, repetitions: 40   2.21s   2.21s  0s
Small graph, epochs: 500, repetitions: 400   2.25s   2.25s  0s
Larger graph, epochs: 60, repetitions: 1     1.02s   1.02s  0s

Here is the .patch file:

gist.github.com

https://gist.github.com/leuc/20a43793d9200b5278ea4fa5b4314a85

umap.avx2.patch

 src/layout/umap.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/src/layout/umap.c b/src/layout/umap.c
index d1f334313..a9e40d7d7 100644
--- a/src/layout/umap.c
+++ b/src/layout/umap.c
@@ -28,6 +28,7 @@

 #include "layout/layout_internal.h"

This file has been truncated. show original

tamas · 27 January 2023 11:07

Thanks, although most likely it means that the code won’t compile any more on ARM as is. (At least it does not compile on an M1 Mac, with a warning saying “This header is only meant to be used on x86 and x64 architecture”).

In theory we could provide optimized vectorized versions of code hot spots like this if it truly improves performance, and hide them behind #ifdef barriers so they are not attempted to be used when the compiler does not support them. But, in order to merge this, we should first:

Verify that whatever ChatGPT came up with is a correct translation of the original function into vector intrinsics. No matter how awesome ChatGPT is, it is essentially a text generation engine and is not expected to provide correct code. Our tests are not really exhaustive for UMAP either (it’s a layout algorithm so it’s hard to test formally).
Verify that we cannot achieve the same speedup with compiler flags or something similar. After all, optimizations like this should eventually be done by the compiler if possible so we can keep on writing code that is easier to read and debug by humans.

leuc · 27 January 2023 12:46

All true, this was purely a test not a PR.

The small speed-up here, is only due to the use of avx2 math and the function being repeatedly called. It would be better to optimize loops, so computations can be done in batches of 4/8/32 (SSE3/AVX2/AVX512). However, that often requires a complete re-structure of functions.

On the matter of portability, I was advised (by an actual human) to use the Hardware capabilities feature of the linker. That requires a move to using shared objects for the various algorithms. However, if done correctly it’s a fairly clean solution and avoids the #ifdef mess.

The issue with optimization by the compiler is, they only use avx2 with -march=x86-64-v2 or -march=native and very few distributions offer such binaries. Using the linker allows building for generic x86/arm etc, while still offering a optimized code path on some platforms.

(In this test both benchmark_igraph_layout_umap binaries were build using -march=native. What the compiler optimized, the avx2 math speed-up was on top of that.)

szhorvat · 27 January 2023 19:22

You say it’s 0.4 seconds faster—but faster than what? Can you state the full result?

leuc · 27 January 2023 20:53

The benchmark output is above, fixed the formatting.

szhorvat · 27 January 2023 21:31

There are some safe and simple (no ChatGPT) and cross-platform refactorings that one could try here without making the code harder to maintain. I would try and benchmark these first. For example,

- (2 * a * b * pow(dsq, b - 1.)) / (1. + a * pow(dsq, b));

could be written as

(-2.0 * a * b) / (dsq * (a + pow(dsq, -b)));

The pow() is likely to be a bottleneck since it’s a function call, and it may also prevent vectorization.

One issue with platform-specific code is the handling of NaN and Inf values—I’m not sure if these are allowed or useful in the UMAP implementation. If they are, the behaviour with these should not change.

Another issue is that, as I understand, this is essentially a force-directed layout. These tend to be extremely sensitive to starting conditions, and minor changes, even like the one I suggest above, may change the end result. I don’t know if this is the case here. But if it is, ideally we should try to avoid getting different results depending on platform. Different results between versions—that’s unavoidable, Different results with the same version depending on platform—I wouldn’t really want that.

szhorvat · 27 January 2023 21:38

If you are going to play with this idea, I’d suggest also trying replacing pow(x,y) with exp(log(x)*y) and see if it makes a noticeable difference. Not sure this will make any difference though.

Topic		Replies	Views
Investigate using link-time optimization for official binaries Development	5	504	6 February 2020
The unit tests thread Development C	56	1908	26 April 2021
Infomap / MapEquation performance / runtime improvement for large directed graphs Usage R	1	262	31 August 2022
Implementing the LDBC Graphalytics benchmark Usage C	14	1166	6 February 2021
Update for matlab-igraph toolbox Development	12	155	11 April 2025

testing umap layout with avx2

Related topics