Project

General

Profile

Feature #8650

reliable retries

Added by Andrei Gaponenko over 4 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
05/06/2015
Due date:
% Done:

100%

Estimated time:
Duration:

Description

Hello,

We observe loss of grid jobs because of transient storage issues at the "ifdh cp..." stage. The situation can be improved by making sure that ifdh retries failed commands after a delay. For production jobs I would like the delays to start at a random value of a few seconds, and exponentially increase up to an order of an hour. I think these values are appropriate because a typical production job is several hours long, and waiting for a fraction of that time to preserve its result makes sense.

Andrei

History

#1 Updated by Marc Mengel over 4 years ago

  • Assignee set to Marc Mengel
  • Target version set to v1_8_3
  • % Done changed from 0 to 100

#2 Updated by Marc Mengel about 3 years ago

  • Status changed from New to Resolved

#3 Updated by Marc Mengel about 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF