Avoiding long reconnection attempts helps improve resilience against loss of EventBuilders
In tests of killing EventBuilders at protoDUNE, I noticed that the BoardReaders would spend 20 seconds per event trying to send data to the failed EB (10 seconds trying to reconnect and 10 seconds trying to send the data, I think). In a system with 20 buffers per EB (and therefore, 20 entries in the routing table per EB), this would essentially hobble the system for ~400 seconds, and it was often hard for the system to recover from that. (Giovanna and others at CERN had noticed that reducing the number of EB buffers from 20 to 1 allowed the system to continue gracefully. I presume that they still endured the 20 second pause while the BRs tried to reconnect to the failed EB, but I haven't verified that and I would guess that they didn't notice it.)
To help avoid such long attempts to reconnect and send data to a failed EB, I made some candidate code changes in TCPSocket_transfer.cc. The spirit of the changes was to keep the existing retries when initially connecting, but only try to reconnect once (per 'call') once the initial connection has been lost.