Project

General

Profile

Feature #25199

Avoid crashing the Factory (and the Frontend) if condor is down/not responsive

Added by Marco Mambelli 4 months ago. Updated 13 days ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Target version:
Start date:
11/13/2020
Due date:
% Done:

0%

Estimated time:
Stakeholders:

HEPCloud

Duration:

Description

Currently, both the Frontend and the Factory fail and at times crash if HTCondor is unresponsive for a long enough time.
The exit and the messages printed and in the logs should be more controlled
A stack trace is never a good way to end a program.

Steve Tim (HEPCloud) suggested to add exponential back-off and extend the retries

A problem seen in the HEPCloud factory is that during FNAL security scans HTCondor was unresponsive for 10-15 min and this was crashing the Factory.
After the scan condor was restarting but the Factory was down.
The service had to be restarted by a watchdog script.

It should be determined which is the best behavior and implement it:
- each operation (condor_status, condor_q, advertise, condor_submit) has a timeout. Exit if the operation fails or times out multiple times
- should the operator choose the number of retries? or the timeouts?
- should there be exponential backoff and a more resilient approach (more attempts, longer timeouts)?

Some thoughts:
  • without HTCondor the Factory (or the Frontend) are unable to perform any function:
    • no pilot submission
    • no communication between Factory and Frontend
    • no monitoring (can print monitoring info but it is stale)
  • developers should decide (no new parameter)
  • maybe: fail to start if HTCondor is not there
  • increase resiliency during operation: log problems right away but extend waits and retries (up to 30 min interruptions should be tolerable, glideins and jobs keep running)

History

#1 Updated by Marco Mambelli 3 months ago

  • Target version changed from v3_6_6 to v3_6_7

#2 Updated by Marco Mambelli 13 days ago

  • Target version changed from v3_6_7 to v3_7_4

Also available in: Atom PDF