Consistency of RNG_BEGIN and RNG_END

,

Should RNG_BEGIN and RNG_END be put around every function call that call other functions that use randomness?

So if A calls B and B calls C and C calls RNG_INTEGER, should A have

RNG_BEGIN();
B();
RNG_END();

and B have

RNG_BEGIN();
C();
RNG_END();

and then C

RNG_BEGIN();
RND_INTEGER(0, n);
RNG_END();

?

That seems what happens in practice sometimes, but at least for the non-R definitions of RNG_BEGIN and RNG_END this doesn’t seem to be necessary.

Good question! It actually begs the question if the nesting is doing things properly.

That is, if we effectively have two calls to RNG_BEGIN() after each other, it seems to imply that the seed is being read multiple time (which seems to be what GetRNGState does). This might effectively mean that the RNG is always reset to the same seed whenever RNG_BEGIN() is being called. The resulting state is only being written back by PutRNGState, which is done by RNG_END(). If this interpretation is correct, having these nested calls then effectively reduce the “randomness” substantially. I am not sure if this interpretation is correct though, perhaps @Gabor can weigh in indeed.

Perhaps it would be better if the GetRNGstate() and PutRNGstate() would actually only be present in the R-C interface glue code. This is also what seems to be done by Rcpp, if I understand correctly.

Here are the docs for these functions, but I don’t fully understand what they do and why they are needed:

https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Random-numbers

I don’t think that’s what it does (if it did, things would be broken), see here:

https://stat.ethz.ch/pipermail/r-help/2004-March/046967.html

But I don’t understand the purpose of these R functions.

I don’t believe this mail addresses my question. The approaches therein are indeed equivalent, if you interpret each GetRNGstate and PutRNGstate to effectuate a reading / setting of a seed. However, if you do this:

GetRNGstate();
for(i=1; i < 1000; i++){
  x[i]=norm_rand();
}
// This below possibly resets the seed
GetRNGstate();
for(i=1; i < 1000; i++){
  y[i]=norm_rand();
}
PutRNGstate();

You would obtain twice the same random numbers, and x == y, if my interpretation is correct.

The purpose of GetRNGstate and PutRNGstate seems to be to ensure that every call to the R random number generator is consistent. Why that state is not simply being kept internally I don’t really understand either.

I might be wrong, but I think the purpose of GetRNGState() and PutRNGState() is the following:

The RNG is implemented in C, and thus its state is kept in a memory location that is easily accessible from C. But the RNG state is supposed to represented by the .Random.seed (misleading name—it’s not a seed but the state) variable in R. These functions simply ensure that the “C variable” and the “R variable” representing the state are always in sync.

Do you agree @vtraag?

Yes, this is what I meant. So, if the .Random.seed is always read upon GetRNGstate, and sets the current state of the RNG in C, it effectively always resets the state.

The consequences are that for as long as it’s guaranteed that RNG_BEGIN() and RNG_END() always come in a pair (i.e. RNG_END() is guaranteed to execute whenever RNG_BEGIN() does, unless there is an error) we are fine.

Then the guideline for using these would be:

Use RNG_BEGIN and RNG_END only in functions that directly access the random number generator, i.e. call igraph_rng_ functions (possibly through macros).

In this example the answer would be to only use RNG_BEGIN and RNG_END in function C (although using them in A and B won’t cause problems, except for some wasted CPU cycles).

Yes, I agree. However, I am not sure if this is always guaranteed throughout the code base. For example, if you would do something like this:

RNG_BEGIN();

x = RNG_INTEGER(0, 100);

igraph_vector_shuffle(&v);

RNG_END();

the random state will get reset in igraph_vector_shuffle because it calls RNG_BEGIN. Hence, this “reduces the randomness” in that sense.

1 Like

All in all, I think the safest option is to always only get the random state for R on the entry point in C, which is done in the R-C glue code in the R interface, and again save the random state upon return in R.

So then all of this RNG_BEGIN and RNG_END is mostly useless leftover then?

Well, yes, if we move all those calls to the R interface. However, before drawing that conclusion, perhaps we should wait for @Gabor to weigh in. Perhaps there are some other difficulties that we may have not considered yet.

Of course we do need to initialize the default RNG somehow, but that is a separate question, and might be done in a different way.

I have checked this with some small C code:

#include <R.h>

void rng_state(double* x1, double* x2)
{
    GetRNGstate();
    *x1 = unif_rand();
    GetRNGstate();
    *x2 = unif_rand();
    PutRNGstate();
}

You can compile this using R CMD SHLIB test_rng.c. Then fire up R and calling this function yields:

> dyn.load("test_rng.so")
> .C("rng_state", x1=0.0, x2=0.0)
$x1
[1] 0.1239771

$x2
[1] 0.7680124

> .C("rng_state", x1=0.0, x2=0.0)
$x1
[1] 0.46593

$x2
[1] 0.46593

> .C("rng_state", x1=0.0, x2=0.0)
$x1
[1] 0.9483618

$x2
[1] 0.9483618

> .C("rng_state", x1=0.0, x2=0.0)
$x1
[1] 0.6552871

$x2
[1] 0.6552871

As you can see, on the first call, the state is set for the first time, so the two results are not identical. However, any subsequent calls actually do lead to identical results. This shows that there might be potentials problems with the RNG_BEGIN and RNG_END calls for the R interface that “reduce the randomness”.

In general, I think we shouldn’t try to match the semantics of igraph’s RNG_BEGIN() and RNG_END() to what R expects because there are other higher-level interfaces to igraph, which might need different semantics. Or they might not need it at all. @szhorvat, do you use these functions in the Mathematica interface? The Python interface does not make any use of them.

Anyway, I think the expected behaviour of these functions (I mean, what an ordinary user of the library would expect) should allow nested calls. There should be a counter that starts from zero and is increased with every call to RNG_BEGIN() and decreased with every call to RNG_END(). R’s GetRNGState() should be called when the counter goes from 0 to 1, and PutRNGState() should be called when the counter goes from 1 to 0. igraph should abort() when RNG_END() is called while the counter is zero as this indicates incorrect nesting. It is an open question whether the counter should be thread-local (maybe R’s RNG is not thread-safe?) or not.

No, I don’t use them.

I think the general question is why we have them at all. They mostly seem to serve the R interface, and arguably, dealing with this should be done inside the R interface if possible, not in the C library. The only other usecase is to initialize the random number generator, but there might be other ways of dealing with this. Frankly, if we could simply get rid of the use of RNG_BEGIN and RNG_END all together, this would be my preference.

One use-case that I could think of is lazy auto-seeding of the RNG from the current time, which can be done in RNG_BEGIN(). Actually, this also seems to be a side effect of GetRNGstate() in R, judging from the source code – note that GetRNGstate() may call Randomize(), which sets the random seed. This can also explain why Vincent saw that the first call of his test code produced different values for x1 and x2 but not subsequent ones – the first call to GetRNGstate() that copied the RNG state from R-world to C-world copied an unseeded RNG so GetRNGstate() seeded the RNG, then the next call to GetRNGstate() copied the RNG state again, making the RNG unseeded, and then it re-seeded the RNG again. From the second call onwards, the base state was seeded so the results were identical.

If you want to lift RNG_BEGIN() and RNG_END() to the R-to-C glue code, then the glue code must either know whether the igraph function being called will use the RNG and prepare the RNG state accordingly, or must assume that all igraph functions can potentially use the RNG from R, and call GetRNGstate() or PutRNGstate() unconditionally. Even when doing so, we cannot prevent nested calls to RNG_BEGIN() and RNG_END(): think about an igraph function that takes a callback function written in R – the callback function may call back to igraph via the glue code again, leading to a potential nesting.

So, my opinion so far, based on my limited understanding is that:

  • it seems hard to delegate the responsibility of knowing when the RNG will be called (and thus the responsibiity of calling GetRNGstate() and PutRNGstate() to the R-to-C glue code.

  • even if we did so, we cannot prevent nested calls so the counter-based approach that I outlined is probably still necessary

Yes, this is exactly what is the only other use case that is currently present. However, the lazy auto-seeding does not need to be in there. To have these RNG_BEGIN (and unnecessary RNG_END) calls, just so that a user does not have to seed the RNG seems a bit overkill to me. It just requires a user to specify a single call to seed it. For higher level interfaces this can always be done (or does not need to be done), so most users will not even know the difference.

Yes, all functions will simply have an RNG_BEGIN and RNG_END in this scenario. Callback functions need to be looked at carefully indeed, because they should not have this in there. However, most callback functions will not be entry points into the C from R, so I doubt that this will become a problem. This needs to be double checked though.

The solution that you propose does solve the problem indeed, regardless of whether the nesting is correct or not. Nevertheless, nested calls in themselves are in principle not a problem, it is the fact that there may be something happening prior to the nested call.

Let me check the R interface to see if there are such potential problems with callbacks that cannot be solved.