Child art process in DataLogger crashes due to a race in SharedMemoryManager
This issue is due to a lack of synchronization in the ShmBuffer structure, which manages the state of a shared memory buffer between the datalogger and its child art process. The race happens in a short period of time after the child art process marked a "Full" buffer, which has timed out (or close to timing out) as "Reading", but did not yet update the "last_touch_time". The datalogger interprets the observed state as if the art process "timed out" while reading the buffer and resets the state of this buffer to "Full", while the art process continues to read it. This triggers a "buffer not in the correct state" error, which causes the art process to disconnect from the shared memory and segfault shortly after that. A way to address this race condition is to combine the buffer state (the "sem" filed) and the last_touch_time into a 64bit "C" structure, and rely on hardware atomics and cache coherence guarantees provided by Intel/AMD CPUs. The code also can be redesigned to use semaphores….
#1 Updated by Eric Flumerfelt about 1 year ago
- Category set to Known Issues
- Status changed from Work in progress to Closed
- Target version set to artdaq_core v3_04_02
We have fixed this for now by allowing unowned buffers to be touched at any time, and updating the buffer last_touch_time before setting the mode to reading.