Review Request #6196
NovaRunControl Code Review
The NovaRunControl package exhibits some of the more persistent
problems in the NOvA DAQ Software suite. Crashes of the rcServer
program have resisted multiple fixes that initially appeared promising.
The purpose of this review is principally to examine the NovaRunControl code
for characteristics that may be responsible for these crashes and the
difficulty in eliminating them despite more than an FTE-month of effort spent
trying to do so. Feedback on other coding issues would also be welcome.
Please communicate questions, answers, comments, and findings via this redmine issue.
#2 Updated by Kurt Biery over 6 years ago
The source code can be viewed via LXR here: http://cdcvs.fnal.gov/lxr/nova/source/Online/pkgs/NovaRunControl/
#3 Updated by Kurt Biery over 6 years ago
- The RunControl server crashes at random times during the run at FarDet. (There is a theory that the times are not random, but are related to the switching from one subrun to another.)
- The RunControl server crashes at various points in the lifecycle of the DAQ. This happens more often after an initial run has been started and ended and the system is being reconfigured and/or a subsequent run started.
- At various times in the lifecycle of RunControl, the communication between the GUI and the server appears to hang. Typing a command (or gibberish) in the ExecuteCommand box in the GUI and submitting that command restores the communication between the two. This happens at PrepareConfiguration, LoadConnectionConfiguration, MakeConnections, ConfigureRun, etc. transitions.
- When running from the novadaq account on novatest01, if there is a stale partition defined in the ResourceManager, removing that stale partition from the ResMgr causes the RunControl server to crash or exit.
#4 Updated by Kurt Biery over 6 years ago
- ssh email@example.com (Marc and John have been added to the .k5login of the novadaq account)
'setup_online -z 3'
- to view the code, look in $SRT_PRIVATE_CONTEXT/NovaRunControl/cxx/include and cxx/src.
- to start the necessary processes to run the system:
'ospl start; startDAQApplicationManager.sh; startRunControl.sh'
- from the RunControl GUI:
Select Resources(keep the existing choices)
- to stop the main processes after running the system:
'stopRunControl.sh; stopDAQApplicationManager.sh; ospl stop'
- to set up Allinea tools:
'setup allinea v4_02_00'
#5 Updated by Kurt Biery over 6 years ago
Here is information on the RunControl state machine as of April 2012: https://cdcvs.fnal.gov/redmine/projects/nova-runcontrol/wiki/Run_Control_State_Machine
Additional information is available from the NOvA Run Control Wiki: https://cdcvs.fnal.gov/redmine/projects/nova-runcontrol/wiki
The Redmine repository view is available from here: https://cdcvs.fnal.gov/redmine/projects/nova-runcontrol/repository
#6 Updated by Kurt Biery over 6 years ago
- The primary class for the RunControl server is RCServer: https://cdcvs.fnal.gov/redmine/projects/nova-runcontrol/repository/changes/cxx/src/RCServer.cpp
- The primary class for the RC client is RCClientGUI: https://cdcvs.fnal.gov/redmine/projects/nova-runcontrol/repository/changes/cxx/src/GUI/RCClientGUI.cpp
The RunControl server talks to the majority of the processes in the DAQ system using RMS/DDS. (Responsive Messaging System/Data Delivery Service) RMS is a set of wrapper classes around the third-party implementation of DDS that we use from PrismTech. Basically, users of RMS create RmsSenders and RmsReceivers (specifying a message type and a target destination, or recipient) to send and receive messages. We probably won't need to spend too much time talking about RMS/DDS, however, there is one important aspect that needs to be kept in mind. The RMS libraries allow messages to be received asynchronously. That means that that callbacks that are invoked when RMS/DDS messages are received are generally run in threads outside of the Qt event processing thread.
The RunControl server talks to the RCGUI using a custom protocol implemented using Qt sockets.
Both the RCServer and the RCGUI make use of the Qt toolkit (http://qt-project.org/doc/qt-4.7/classes.htmlhttp://qt-project.org/doc/qt-4.7/classes.html).
#7 Updated by Kurt Biery over 6 years ago
Suggestions from Jon:
I don't want this to turn into a red herring, but my leading suspicion for the cause of the rcServer crashes is some kind of [race?] condition involving mutexes (mutices?). rcServer sends simple text messages over a tcp/ip socket connection to the GUI and ResourceManager clients that tells them what to do. The method that sends the text message to the clients is called via a Qt signal, and it instantiates a mutex. What I think is happening is that some Qt signal is being emitted, the text message is sent out but the "send" method either hangs or takes too long to return, and in the meantime more Qt signals are being emitted that are trying to send text messages. Eventually somebody stomps on someone else, the "data" the rcServer is trying to send to the clients gets wiped out, and the "send" message accesses data that no longer exists. I don't know if this is even possible, but I think something like this is happening.
So my suggestion is to maybe start looking at how these messages are being sent, determine if mutexes are even needed, and maybe think about additional diagnostics I can add to the code to help pinpoint the problem. I don't know if understanding the overall state machine is really critical from the get go, but that's up to you.