Project

General

Profile

Feature #22125

Avoiding long reconnection attempts helps improve resilience against loss of EventBuilders

Added by Kurt Biery 11 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
03/13/2019
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

In tests of killing EventBuilders at protoDUNE, I noticed that the BoardReaders would spend 20 seconds per event trying to send data to the failed EB (10 seconds trying to reconnect and 10 seconds trying to send the data, I think). In a system with 20 buffers per EB (and therefore, 20 entries in the routing table per EB), this would essentially hobble the system for ~400 seconds, and it was often hard for the system to recover from that. (Giovanna and others at CERN had noticed that reducing the number of EB buffers from 20 to 1 allowed the system to continue gracefully. I presume that they still endured the 20 second pause while the BRs tried to reconnect to the failed EB, but I haven't verified that and I would guess that they didn't notice it.)

To help avoid such long attempts to reconnect and send data to a failed EB, I made some candidate code changes in TCPSocket_transfer.cc. The spirit of the changes was to keep the existing retries when initially connecting, but only try to reconnect once (per 'call') once the initial connection has been lost.

History

#1 Updated by Eric Flumerfelt 10 months ago

Should the connection_was_lost_ variable be initialized in the member initialization list rather than the body of the constructor?

#2 Updated by Eric Flumerfelt 10 months ago

  • Status changed from Assigned to Resolved

Moving issue through state machine

#3 Updated by Eric Flumerfelt 10 months ago

  • Status changed from Resolved to Reviewed
  • Tracker changed from Idea to Feature
  • Co-Assignees Eric Flumerfelt added

I have reviewed the code and done before/after testing, using runTransferTest and the routing_master_example simple_test_config.

#4 Updated by Eric Flumerfelt 9 months ago

  • Target version set to artdaq v3_05_00
  • Status changed from Reviewed to Closed


Also available in: Atom PDF