Complete the implementation of graceful handling of backpressure
At the moment, when there is backpressure in the system that lasts longer than 5-10 seconds, events are dropped by the EventBuilders and Aggregators. We should fix this so that events are not dropped unless a run ends and there truly is no way to recover.
Some work has already been done to prepare for this. There is now a method in EventStore that tries to handle a new event fragment, but returns it if it can't process it within a specified timeout.
We should continue to print out warning messages when backpressure is significant.
After these changes are done, we should test that the system performs as expected both when there is transient and permanent backpressure (in a given run).
#8 Updated by Kurt Biery over 6 years ago
- Status changed from Resolved to Closed
I've verified that there are no longer lost events when we experience back-pressure (tests run at the DS-50 WH14NE teststand).
It should be noted that we'll all need to re-orient our searches for back-pressure when searching through the logs. We used to be able to search for "FAIL". Now, we'll need to search for a substring in
Wed Apr 23 11:27:14 -0500 2014: %MSG-w EventBuilderCore: Aggregator-dsfr6-6650 MF-online
Wed Apr 23 11:27:14 -0500 2014: Unable to process event 10652 because of back-pressure - retrying...
Wed Apr 23 11:27:14 -0500 2014: %MSG
Wed Apr 23 11:27:16 -0500 2014: %MSG-w EventBuilderCore: EventBuilder-dseb8-6642 MF-online
Wed Apr 23 11:27:16 -0500 2014: Unable to process fragment 2 in event 10762 because of back-pressure - retrying...
Wed Apr 23 11:27:16 -0500 2014: %MSG