Avoid crashing the Factory (and the Frontend) if condor is down/not responsive
Currently, both the Frontend and the Factory fail and at times crash if HTCondor is unresponsive for a long enough time.
The exit and the messages printed and in the logs should be more controlled
A stack trace is never a good way to end a program.
Steve Tim (HEPCloud) suggested to add exponential back-off and extend the retries
A problem seen in the HEPCloud factory is that during FNAL security scans HTCondor was unresponsive for 10-15 min and this was crashing the Factory.
After the scan condor was restarting but the Factory was down.
The service had to be restarted by a watchdog script.
It should be determined which is the best behavior and implement it:
- each operation (condor_status, condor_q, advertise, condor_submit) has a timeout. Exit if the operation fails or times out multiple times
- should the operator choose the number of retries? or the timeouts?
- should there be exponential backoff and a more resilient approach (more attempts, longer timeouts)?
- without HTCondor the Factory (or the Frontend) are unable to perform any function:
- no pilot submission
- no communication between Factory and Frontend
- no monitoring (can print monitoring info but it is stale)
- developers should decide (no new parameter)
- maybe: fail to start if HTCondor is not there
- increase resiliency during operation: log problems right away but extend waits and retries (up to 30 min interruptions should be tolerable, glideins and jobs keep running)