How to handle long delays in table update acknowledgements
At protoDUNE, I noticed a situation in which a BoardReader seemingly never replied to the table update messages from the RoutingMaster. Since the current RM logic retries the same exact table update until all receivers have acknowledged it, this resulted in dataflow stopping, even though incomplete events could have been timed out in EventBuilders (presuming that the bad BR would also not have sent data fragments).
Eric and I talked a little about this, and there are some subtleties to how we would best include additional tokens in updates, while still keeping un-acknowledged information in the updates. So, we are thinking about this more.
In the meantime, I want to capture a tentative code change that I made in RoutingMasterCore. Basically, all I did was fix a bug in incrementing the 'counter' variable and provide some debug TRACE messages when there are a few stragglers that haven't acknowledged the table update.