Project

General

Profile

Feature #24554

Warn about grid resource unavailable

Added by Marco Mambelli 4 months ago. Updated 26 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
06/20/2020
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

Most glideins are submitted using the grid universe.
When there is an error in the grid_resource string of an HTCondor grid universe submission (e.g. wrong hostname or port), HTCondor jobs remain idle.
This may mask errors in the Factory configuration.
This was not happening in the past, e.g. w/ GRAM jobs (Globus GT2).

The Factory should warn the admins of this, e.g. inspecting Idle jobs for the GridResourceUnavailableTime attribute or the log files for "Detected Down Grid Resource” messages.
It is true that this could be a temporary recoverable error, but it could also be a misspelling, a permanent error.

Maybe the glidein could be set to hold if the resource has been unavailable for a while. Hold is what is normally used to report errors

Below a message w/ the HTCondor approach explained:

Any hostnames in a grid_resource line are not validated at submit time, so condor_submit won’t fail because of a typo in a name.
You should see a "Detected Down Grid Resource” event in the job log, if you have one. Also, the attribute GridResourceUnavailableTime will be set in the job ad. This is done for errors that may be temporary.

If a failure to talk to the CE is not temporary, then our guideline is to put the job on hold. An invalid hostname should probably be treated as such. A mistyped port number or schedd name is trickier. These may be caused by the CE service being down temporarily.

History

#1 Updated by Marco Mambelli 4 months ago

My follow-up email, this discussion should continue w/ HTCondor team

I have seen similar things happen in EC2 universe, the ec2_gahp gets hung up and the glideins shows idle forever.

Steve
From: HTCondor-users <htcondor-users-bounces@cs.wisc.edu> on behalf of Marco Mambelli <marcom@fnal.gov>
Sent: Tuesday, June 16, 2020 6:48 PM
To: HTCondor-Users Mail List <htcondor-users@cs.wisc.edu>
Subject: Re: [HTCondor-users] condor jobs remaining idle if the host name or the port are misspelled

Thanks Jaime,
I see both the "Detected Down Grid Resource” and the GridResourceUnavailableTime attribute, this helps.

The behavior surprised me because in the past I remember the opposite, e.g. w/ GRAM the job was going on hold and in GWMS we had a table of errors that could be recovered and we were triggering a release for those errors.
I interpreted idle as "just wait, all is OK, I'm working on it" (no need to investigate further, things will eventually run) and was expecting a hold for problems letting the submitter decide whether to recover/release or fail.
Here jobs may stay on idle for days if there is a typo and in GWMS we are using the number of idle jobs as a measure of the pressure on the system.

I will inspect the log and GridResourceUnavailableTime to trigger a warning and fail the jobs.
I guess GridResourceUnavailableTime is set for failures with any type resource in the grid universe.
Any other attribute I should look for with grid universe or different universes to alert for possible problems when the job is still idle? 

Thank you,
Marco

#2 Updated by Marco Mambelli about 1 month ago

  • Target version changed from v3_6_4 to v3_6_5

#3 Updated by Marco Mambelli 26 days ago

  • Target version changed from v3_6_5 to v3_6_6

Also available in: Atom PDF