testing umap layout with avx2

I let ChatGPT optimize two functions in src/layout/umap.c using AVX2.

It still passes the test and is ~.40s faster.


Small graph, epochs: 5000, repetitions: 40   2.84s   2.83s  0.012s
Small graph, epochs: 500, repetitions: 400   2.87s   2.87s  0.003s
Larger graph, epochs: 60, repetitions: 1     1.55s   1.55s  0.001s


Small graph, epochs: 5000, repetitions: 40   2.21s   2.21s  0s
Small graph, epochs: 500, repetitions: 400   2.25s   2.25s  0s
Larger graph, epochs: 60, repetitions: 1     1.02s   1.02s  0s

Here is the .patch file:

Thanks, although most likely it means that the code won’t compile any more on ARM as is. (At least it does not compile on an M1 Mac, with a warning saying “This header is only meant to be used on x86 and x64 architecture”).

In theory we could provide optimized vectorized versions of code hot spots like this if it truly improves performance, and hide them behind #ifdef barriers so they are not attempted to be used when the compiler does not support them. But, in order to merge this, we should first:

  1. Verify that whatever ChatGPT came up with is a correct translation of the original function into vector intrinsics. No matter how awesome ChatGPT is, it is essentially a text generation engine and is not expected to provide correct code. Our tests are not really exhaustive for UMAP either (it’s a layout algorithm so it’s hard to test formally).

  2. Verify that we cannot achieve the same speedup with compiler flags or something similar. After all, optimizations like this should eventually be done by the compiler if possible so we can keep on writing code that is easier to read and debug by humans.

All true, this was purely a test not a PR.

The small speed-up here, is only due to the use of avx2 math and the function being repeatedly called. It would be better to optimize loops, so computations can be done in batches of 4/8/32 (SSE3/AVX2/AVX512). However, that often requires a complete re-structure of functions.

On the matter of portability, I was advised (by an actual human) to use the Hardware capabilities feature of the linker. That requires a move to using shared objects for the various algorithms. However, if done correctly it’s a fairly clean solution and avoids the #ifdef mess.

The issue with optimization by the compiler is, they only use avx2 with -march=x86-64-v2 or -march=native and very few distributions offer such binaries. Using the linker allows building for generic x86/arm etc, while still offering a optimized code path on some platforms.

(In this test both benchmark_igraph_layout_umap binaries were build using -march=native. What the compiler optimized, the avx2 math speed-up was on top of that.)

You say it’s 0.4 seconds faster—but faster than what? Can you state the full result?

The benchmark output is above, fixed the formatting.

There are some safe and simple (no ChatGPT) and cross-platform refactorings that one could try here without making the code harder to maintain. I would try and benchmark these first. For example,

- (2 * a * b * pow(dsq, b - 1.)) / (1. + a * pow(dsq, b));

could be written as

(-2.0 * a * b) / (dsq * (a + pow(dsq, -b)));

The pow() is likely to be a bottleneck since it’s a function call, and it may also prevent vectorization.

One issue with platform-specific code is the handling of NaN and Inf values—I’m not sure if these are allowed or useful in the UMAP implementation. If they are, the behaviour with these should not change.

Another issue is that, as I understand, this is essentially a force-directed layout. These tend to be extremely sensitive to starting conditions, and minor changes, even like the one I suggest above, may change the end result. I don’t know if this is the case here. But if it is, ideally we should try to avoid getting different results depending on platform. Different results between versions—that’s unavoidable, Different results with the same version depending on platform—I wouldn’t really want that.

If you are going to play with this idea, I’d suggest also trying replacing pow(x,y) with exp(log(x)*y) and see if it makes a noticeable difference. Not sure this will make any difference though.