Project

General

Profile

Support #23167

match art/root behavior to dCache

Added by Raymond Culbertson 25 days ago. Updated 13 days ago.

Status:
Feedback
Priority:
Normal
Assignee:
-
Target version:
-
Start date:
08/23/2019
Due date:
% Done:

0%

Estimated time:
Scope:
Internal
Experiment:
Mu2e
SSI Package:
art
Co-Assignees:
Duration:

Description

We are having trouble with xrootd and the solution may be
partailly in art. In observations of log files and discussion with
dCache experts, we have three cases:
  1. the file is in tape-backed dCache and not on disk at the moment
    of the request. In this case, dCache returns via xrootd a code that
    that indicates this state. They say a user should wait and retry,
    but we saw that root/art aborts immediately. (ifdh and nfs block,
    so it isn't an issue there.)
  2. if a server is overloaded and the request goes in a dCache
    queue, after 30s, it will return an error (see below). In this case the
    right thing to do is retry for a while.
  3. there are transient errors (we've see mysterious DNS errors),
    and there should be retries.

In previous discussions with Kyle and Philippe, we had concluded
that root should currently be configured to retry many times
for perhaps an hour. If I understood, this does not happen
because art catches the error and treats all non-info messages as fatal.

As I recall, the error returned in the case of the file on tape
and not staged, the return code was special, "resource not available",
so it could be recognized and treated properly - wait and retry. Ideally
xrootd would just block.

The error returned in the overloaded case is essentially "file not found"
(see below) so it is not distinguishable from an actual missing file.
Ideally we can get those separated so we can treat them differently
and more ideally, it would block for longer than 30s.

The transient error case should be handled with at least a few retires.

22-Aug-2019 18:04:35 UTC  Initiating request to open input file 
"xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/mu2e/persistent/users/
mu2epro/workflow/MDC2018_DS-cosmic-mix_i_0/good/22585566.00/00/00145/
dig.mu2e.DS-cosmic-mix.MDC2018i.001002_00000780.art" 

%MSG-s ArtException:  PostEndJob 22-Aug-2019 18:05:01 UTC ModuleEndJob
cet::exception caught in art
---- OtherArt BEGIN
---- FileOpenError BEGIN
RootInputFileSequence::initFile(): Input file 
xroot://fndca1.fnal.gov/pnfs/fnal.gov/usr/mu2e/persistent/users/mu2epro/
workflow/MDC2018_DS-cosmic-mix_i_0/good/22585566.00/00/00145/
dig.mu2e.DS-cosmic-mix.MDC2018i.001002_00000780.art was not found or could not be opened.

History

#1 Updated by Kyle Knoepfel 13 days ago

  • Status changed from New to Feedback

This will take some investigation. How does this relate to issue #21638?



Also available in: Atom PDF