I have 3 identical processes (each running in their own Docker container) that are simultaneously calculating the shortest path between n different pairs of two coordinates. I’m currently testing with only 3 units, so a maximum of 3 paths being calculated at once.
The code will work fine-- the units will move where they’re supposed to go, they’ll avoid impassable terrain, etc.-- until I get a Segmentation Fault. Sometimes it happens after 6 or so paths have been calculated, sometimes after a hundred. I’ve spent over 40 hours trying to debug this, and I haven’t been able to find any pattern.
The backtrace seems to show the fault occurring in threading.py-- or at the very least, I can say that every line of the backtrace that’s more recent than the call to get_shortest_path_astar() is either in threading.py or debug.py. I’m not using multithreading (which is why each of the pathfinding units are implemented as processes, not threads), but I can’t speak for igraph, nor can I vouch for all of the other packages that get included directly or via dependencies.
My first guess is that there’s something wrong with the heuristics algorithm that I supply to A*. It’s deadly simple, it just calculates the straight line distance between the starting coordinates and the ending coordinates. (My map consists of vertices that form a grid of boxes-- basically, graph paper. Or square tiles, where my vertices are the corners of the squares, however you want to think of it.) So it doesn’t seem like there would be much room for me to screw that up, but since I haven’t been able to find a single example of an A* heuristics function, I can’t help but think that maybe I’ve written it wrong-- I mean, functionally it seems to get the job done, but maybe it needs to be written with some thread safety related keywords or something— I’m just making guesses at this point.
Specific questions:
Based on this very general description of the behavior, does anyone have a guess as to what might be the problem?
Does anyone have an example of a heuristics algorithm that they would be willing to share-- I don’t really care about the mathematics of how the return value is calculated, I just want to make sure I’m not missing something fundamental about the syntax or side-effects.
Does anyone have any advice for how I might make progress debugging this? I’m relatively new to Python, but I’ve been programming for 30+ years, and I’m not a total idiot, so I’d like to think I’ve covered the obvious options; but at this point I’d prefer someone assuming I have the intelligence of a small woodland animal and make even super-obvious suggestions, than I would be left completely to my own devices for much longer.
When something is wrong, always use the latest version (0.11.6). Is the problem still present with that?
Regarding the heuristics function: If your nodes are in physical space, using the straight-line distance is fine. But be sure to also provide edge weights that are at least as large as the straight-line distances between the edge’s endpoints. One of the requirements on the heuristic is that it should never overestimate the exact distance.
If the heuristics are wrong, there still shouldn’t be a crash, but to be honest, we haven’t really tested this well. If the crash is due to wrong heuristics (very unlikely!), that should be fixed.
Personally I have no threading experience in Python (or much Python experience at all), but I’ll ask someone to have a look. Do you see the crash only with threading, and is it somewhat random even for the same input?
Since you were so kind as to reply such alacrity, I want to give you a quick but not-yet-conclusive update:
Updating to 0.11.6 seems to have fixed the issue. The reason I hadn’t already done that is because I was under the (apparently mistaken) impression that apt install would always grab the latest version, but apt install still grabs 11.4.
I’ll update again when I can confirm that this solve the issue for sure, but again, since you replied so quickly, I wanted to give you an equally quick update.
Ah, looked like my first test was a fluke. The next two tests faulted with what appears to be an almost identical backtrace, but the message has changed from “Segmentation Fault” to “Bus Error”. A Bus Error gives me some new things to look for (memory leaks, not enough disk space, maybe some others), so perhaps this is progress of a sort.
EDIT: And since I didn’t stop and start my process in between test runs, either a resource leak or not enough disk space seem reasonable, as they would get increasingly worse over the lifetime of the process.
So, in answer to your question: if you would be so kind as to compile a binary for me (I’m running on Ubuntu 24.04 LTS, with an Intel CPU), I would really appreciate it. I will enlist and build the binary as soon as I can, so that no one needs to do this for me again in the future, but it will probably be the weekend before I can do that, and my work life would dramatically improve if I was able to get past this roadblock before then.
Also, I’m about to respond to this thread with a separate message giving (I hope) much more useful information about when the issue occurs-- just an FYI.
Thanks again, I’m honestly overwhelmed at the speed and usefulness of szhorvat and your replies. It’s humbling.
After a lot more experimentation/investigation, I believe that there is most likely a timing / race condition involved. Here’s a summary of what I’ve learned:
If I replace my (very, very simple) heuristics function with one that just returns a constant value, I am not able to replicate the segfault. HOWEVER, if I am correct about this being a timing issue, it’s very possible that this is because I can do so few tests (they’re manual and take a while), and that what’s really happened is that the window for when the timing issue occurs is greatly reduced, but not eliminated, by the simplified heuristic.
I rewrote my heuristic function so that it does not rely on any variables other than those passed in. I hardcoded numbers that had previously been variables contained within the Graph object. The heuristic function now does not make use of the passed in Graph object at all, and is purely a mathematical formula, relying on nothing other than the start and destination coordinates that are passed in as parameters. The segfault still reproduces just as reliably as it did when the function used variables that were part of the Graph object that the A* algorithm passes in to the heuristic function.
I added a print statement after every line (all 4 of them) in my heuristic function, mainly just to see if I could identify the exact line it made it to, but when I did so, I was no longer able to cause the segfault. The same caveat about it not being completely conclusive because it takes me so long to reproduce the issue, so maybe it just made it much less likely to trigger the fault, applies. In the past when I have seen print or logging statements cause buggy behavior to go away, it’s always been due to timing issues like race conditions, or accessing memory in a manner that isn’t thread safe, or similar.
My application does not use multithreading, but the process that uses IGraph can be called into via RPC by more than one other process (there are a bunch of celery workers grabbing work from a queue, and each unit on the board needs pathfinding done, and each unit is going to have a different celery worker processing them (usually)). However, I don’t see evidence that being called into simultaneously is the problem; it seems like several consecutive calls from the same celery worker wind up leading to the segfault. Still, it can get pretty confusing with multiple processes running, so it’s possible that I’m missing something.
I’m 99% sure this is nothing, but for the sake of completeness, I’ll mention it: I have n (currently 4) separate processes, each in their own docker container, all running my mapping / pathfinding code. So they each create their own Graph object (reading from the same data file, since it’s the same map). Units on the board then make an RPC call that’s handled by haproxy, and sent to one of my pathfinding processes. Any given unit may get any given pathfinding process on any given turn-- within a turn, the unit has a 1:1 relationship with a pathfinding process, but it could get a different process the next turn. Since these are separate processes, and are even running in separate docker containers, I assume that they are each loading their own copy of the iGraph C library, and each loaded C library has its own memory, etc. But if I’m just super fundamentally confused about how Linux handles processes and containers, I guess there would be a non-zero chance that they’re all sharing the same in-memory instance of the C library, which would seem to make something like a race condition much more likely than it seems to me to be now, given that I’m not using multiple threads. Again, I’m 99% sure that there’s no cross-process sharing of the library going on, but I’m explaining the architecture in case I’m wrong.
If you point me to the source code fix, I’ll build it myself; I’m pretty much stuck at this point, so figuring out how to build the C library (which, if things go smoothly, I assume won’t take much time) is at the top of my priority list. Thanks!