Project

General

Profile

NotesOnTimeouts

There are really two items: timeouts and retries.

Retries

The behavior that most often bothers people is the retry logic, which tries to keep things running in the face of network dropouts, stuck DCache servers, etc. but is really annoying when you just have a bug in your script and it's actually trying to copy a nonexistent file. The solution is to set IFDH_CP_MAXRETRIES=0 (or maybe 1 or 2) in the environment, especially when testing out new jobs scripts, etc.

Timeouts

There is also a timeout set on most of the copies, again for DCache, that waits for files that might be on tape, or queued copies due to busy pools in Dcache; you have to specify the right override variable for that with the right option for the right type of copy. If you look in the ifdh.cfg file for the various [protocol gsiftp] and [protocol root] stanzas, (i.e.

https://cdcvs.fnal.gov/redmine/projects/ifdhc/repository/revisions/develop/entry/ifdh.cfg#L422

you can see the "extra_env" which is the name of the environment variable to add options to that kind of copy, and the "cp_cmd" entry that gives the usual 14400 second timeout options for such copies. So in the gsiftp and root cases, you could

setenv IDFH_GSIFTP_EXTRA="--stall-timeout 60"

setenv IFDH_ROOT_EXTRA="-DIRequestTimeout 60"

if you want to make the copies wait at most 60 seconds for a file to start transferring. -- however I don't recommend changing these unless you've prestaged your data and have really tight time boundaries on your jobs,
as they can make your copies fail prematurely.