Project

General

Profile

Support #21638

What is art file open retry behavior?

Added by Raymond Culbertson 8 months ago. Updated 8 months ago.

Status:
Feedback
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
01/08/2019
Due date:
% Done:

0%

Estimated time:
Spent time:
Scope:
Internal
Experiment:
-
SSI Package:
Duration:

Description

In some recent jobs, I was reading a list of 8 art files on input to an art (v2_11_05) exe. I have 40 jobs running on 8 files each, and 38 finished OK. Two appear to hang on a file open. I asked dCache what they saw, and they said their logs show a series of quick connects followed by disconnects. I can get more information or you can join INC000001011024 and ask questions. For now I'd like to ask what I should expect from RootInput in terms of retries, at the art and root layers, and whether you might recognize this behavior. Also what might be done for logging any retries on the art side.

For now, this is just a request to confirm what art behavior is expected. I have to correlate that with dCache behavior and figure out what is failing. Eventually, we might ask for additional retry behavior (dCache predicted a retry would work in this case). When copying files to disk, we can do retries, but due to FermiGrid disk contention which will not be fixed anytime soon, we are being forced into using xroot streaming file access more, so that must have high reliability.

History

#1 Updated by Kyle Knoepfel 8 months ago

  • Project changed from messagefacility to art
  • Scope set to Internal
  • Experiment - added

#2 Updated by Kyle Knoepfel 8 months ago

  • Status changed from New to Assigned
  • Assignee set to Kyle Knoepfel

I do not believe you are seeing anything art-specific. I will talk with Philippe to figure out what is the expected behavior from ROOT.

#3 Updated by Kyle Knoepfel 8 months ago

The system.rootrc file included with the ROOT distributions we provide includes the following parameters:

# NetXNG.ConnectionWindow     - A time window for the connection establishment. A
#                               connection failure is declared if the connection
#                               is not established within the time window. If a
#                               connection failure happens earlier then another
#                               connection attempt will only be made at the
#                               beginning of the next window.
NetXNG.ConnectionWindow: 30

# NetXNG.ConnectionRetry      - Number of connection attempts that should be
#                               made (number of available connection windows)
#                               before declaring a permanent failure.
NetXNG.ConnectionRetry: 4096

# NetXNG.RequestTimeout       - Default value for the time after which an error
#                               is declared if it was impossible to get a
#                               response to a request.
NetXNG.RequestTimeout: 14400

# NetXNG.RedirectLimit        - Maximum number of allowed redirections.
NetXNG.RedirectLimit: 64

Unless a .rootrc file is provided by the user that overrides these defaults, or unless the XRD_* environment variables are set, these settings will be used by ROOT. Up to 4096 connection attempts are allowed with a timeout of 4 hours (14,400 seconds). It may be that the 4-hour timeout is much too long for your case, in which case it should be overridden. This can be done by setting the XRD_REQUESTTIMEOUT environment variable to the desired number of seconds.

Chris Green (on the watchers list) recalls that the bulk of the XRootD errors he encountered were due to authentication timeouts, and the timeout window for authentication was not configurable. We're not aware if this is still the case. Chris, please correct me where necessary.

#4 Updated by Kyle Knoepfel 8 months ago

  • Status changed from Assigned to Feedback

Ray, is the information above sufficient for you to proceed?

#5 Updated by Raymond Culbertson 8 months ago

Thanks, that's very useful information. Do you know if the
failures before a connect can be logged? I'd like to log
every failed attempt, including the error. Then I think that's
all I need out of this ticket.



Also available in: Atom PDF