Project

General

Profile

Feature #21289

Allow root to retry xroot file open

Added by Raymond Culbertson almost 2 years ago. Updated almost 2 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
Infrastructure
Target version:
Start date:
11/01/2018
Due date:
% Done:

100%

Estimated time:
4.00 h
Spent time:
Scope:
Internal
Experiment:
Mu2e
SSI Package:
art
Duration:

Description

A few months ago, mu2e started regularly using xroot for reading
art input files. We find we get errors like:

---- FatalRootError BEGIN
Fatal Root Error: @SUB=TUnixSystem::GetHostByName
getaddrinfo failed for 'fndca1.fnal.gov': Temporary failure in name resolution
---- FatalRootError END

at the percent level. On a ticket, sysadmins and networking
experts can't find any real errors in DNS or networks,
so we seem to be stuck with this error for the forseeable future.

In investigating, we found that root can retry, Philippe Canal said:

The error is not supposed to be fatal. If the (ROOT) error handler
used in your executable is turning it into a fatal error then this
would prevent any retries ....

So it appears art is blocking root from retrying. This ticket
requests that art allow the retries. Ideally we would have some control
over the retry pattern since this problem may be completely fixed
with one retry, so we would want to enable one retry only. Other
more serious errors, that will fail all retries for hours would
just eat up grid time. Also ideally, there would be some distinction
between errors. So for example, "file not found" or "url sytntax error"
would not retry, but "DNS lookup" would retry for a few minutes,
and "xroot server not responding" would retry for a hour.

We are using art v2_10_04

History

#1 Updated by Kyle Knoepfel almost 2 years ago

  • Project changed from cet-is to art
  • Status changed from New to Assigned
  • Assignee set to Kyle Knoepfel
  • Target version set to 2.11.04
  • Estimated time set to 4.00 h

Based on conversation with Rob Kutschke, we will target this bug fix for art 2.11.04.

#2 Updated by Kyle Knoepfel almost 2 years ago

  • Description updated (diff)

#3 Updated by Kyle Knoepfel almost 2 years ago

  • Category set to Infrastructure
  • Status changed from Assigned to Resolved
  • % Done changed from 0 to 100

This bug has been fixed with commit art:9f251fa6. The commit also re-enables art's custom ROOT handler, which was accidentally disabled for the art 2.11 series.

It is quite difficult for us to anticipate the set of XRootD/file-handling errors that should induce a retry and those that should be fatal. For that reason, we will update the list of non-fatal errors as they are encountered. For now, the following errors are not fatal:

  • Any error from TUnixSystem::GetHostByName
  • Any error from TNetXNGFile::Open that is not marked as "FATAL" by XRootD

#4 Updated by Kyle Knoepfel almost 2 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF