Project

General

Profile

Bug #20528

Child art process in DataLogger crashes due to a race in SharedMemoryManager

Added by Gennadiy Lukhanin 12 months ago. Updated 10 months ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
Known Issues
Target version:
Start date:
08/06/2018
Due date:
% Done:

80%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

This issue is due to a lack of synchronization in the ShmBuffer structure, which manages the state of a shared memory buffer between the datalogger and its child art process. The race happens in a short period of time after the child art process marked a "Full" buffer, which has timed out (or close to timing out) as "Reading", but did not yet update the "last_touch_time". The datalogger interprets the observed state as if the art process "timed out" while reading the buffer and resets the state of this buffer to "Full", while the art process continues to read it. This triggers a "buffer not in the correct state" error, which causes the art process to disconnect from the shared memory and segfault shortly after that. A way to address this race condition is to combine the buffer state (the "sem" filed) and the last_touch_time into a 64bit "C" structure, and rely on hardware atomics and cache coherence guarantees provided by Intel/AMD CPUs. The code also can be redesigned to use semaphores….
See commits:
64b024ad76ef57daec731e4d2e412280d6c4521c
f47d2d195e6acaee37f9e1c955426e9b84157819

History

#1 Updated by Eric Flumerfelt 10 months ago

  • Category set to Known Issues
  • Status changed from Work in progress to Closed
  • Target version set to artdaq_core v3_04_02

We have fixed this for now by allowing unowned buffers to be touched at any time, and updating the buffer last_touch_time before setting the mode to reading.



Also available in: Atom PDF