Project

General

Profile

Bug #7892

EventBuilders saturate CPU when BoardReaders send them fragments too quickly

Added by Eric Flumerfelt over 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
02/17/2015
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

Using the start1x2x2System.sh, I was doing some performance testing of the artdaq-demo system. I noticed that when the BoardReader throttle_usecs parameter was set to 0, the EventBuilders saturate the CPU and report an event rate of only 22 Events/sec. With Ron, I did more testing, varying this parameter, and found that anything below about 2000 us causes this problem to occur. It appears that the EventBuilder cannot tolerate anything higher than about 80 Fragments/sec, and if given data faster, it will quickly decay to the 22 Events/sec rate. This can be a major problem for "bursty" experiments like mu2e, whose BoardReaders will be sending thousands of events in a short time with the expectation that the EventBuilders will queue them and take care of them over the rest of the Main Injector supercycle.

History

#1 Updated by Eric Flumerfelt over 5 years ago

  • Status changed from New to Resolved

This was due to a full-queue waiting loop in EventStore. Apparently, this was already configurable, but it had a minimum of 10000 usecs (configuration had it running at 50000 us). It will now set itself based on the size of the fragment waiting to go into the queue, as theoretically, smaller fragments should move through the queue faster than larger fragments.

#2 Updated by Kurt Biery over 5 years ago

As I've mentioned to Eric, I believe that this bug was affecting LBNE 35t, too. Martin Haigh and John Freeman have been chasing problems where small fragments at high rate were not being sent through a 1x2x2 system as quickly as expected.

#3 Updated by Kurt Biery over 5 years ago

Hi Eric,
I think that it would be good to chat about the candidate fix that you committed.

As background: it would probably have been better if I had written the problematic code in the following way:

if (queue_.full()) {
int MAX_LOOPS = 10;
size_t fragSize = pfrag->size();
size_t sleepTime = 1000000.0 * (enq_timeout_.count() / MAX_LOOPS);
int loopCount = 0;
while (loopCount < MAX_LOOPS && queue_.full()) {
++loopCount;
usleep(sleepTime);
}
if (queue_.full()) {
rejectedFragment = std::move(pfrag);
return false;
}
}

With this code, it might have been more clear that the idea was to break the configured timeout interval into 10 parts.

I think that what we want to do to fix the bug is increase the number of sub-intervals inside the configurable timeout.

The implementation that exists now breaks the link between the size of the sleep interval and the number of loops. And, truthfully, the fragment size only a secondary indicator of the rate at which fragments are being received.

One thought that I had was to base the sleep interval on the rate at which fragments are being received using the EVENT_RATE statistics that are accumulated in the other "insert" method. With that, we could determine the sleepTime first (based on a fraction of the time between inserts), calculate the number of sleep loops from the sleepTime and the overall enqueue timeout, and use those two values in the loop that checks to see if there is space in the queue.

Please let me know what you think.
Kurt

#4 Updated by Kurt Biery over 5 years ago

  • Target version set to v1_12_07

#5 Updated by Eric Flumerfelt over 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF