Feature #8650
reliable retries
Start date:
05/06/2015
Due date:
% Done:
100%
Estimated time:
Description
Hello,
We observe loss of grid jobs because of transient storage issues at the "ifdh cp..." stage. The situation can be improved by making sure that ifdh retries failed commands after a delay. For production jobs I would like the delays to start at a random value of a few seconds, and exponentially increase up to an order of an hour. I think these values are appropriate because a typical production job is several hours long, and waiting for a fraction of that time to preserve its result makes sense.
Andrei
History
#1 Updated by Marc Mengel over 4 years ago
- Assignee set to Marc Mengel
- Target version set to v1_8_3
- % Done changed from 0 to 100
Changes are in d571b579d66, 7b9c69387d72 and 47dbe534ae8
#2 Updated by Marc Mengel about 3 years ago
- Status changed from New to Resolved
#3 Updated by Marc Mengel about 3 years ago
- Status changed from Resolved to Closed