Add a configurable limit to the rate of jobs running and fail the glidein if the rate is passed
HEPcloud, Fermilab, CMS
Sometimes a worker node passes all the tests, the glidein starts OK but all jobs keep failing for some internal reason (something VO specific, not tested by the tests).
Would be nice to have a settable threshold to recognize this event: JOB_RATE_FAILURE_TRIGGER (e.g. number of jobs completed in the last 5min)
Whenever too many jobs complete too quickly the glidein should retire and add all the preventive measures to avoid back-hole effects (e.g. holding the node for an additional 20min) like when tests are failing.
The rate can be evaluated in different ways but documentation should be clear about the choice made:
- every 3 or 5 minutes
- only at the beginning or evaluated periodically or moving average every minute
This was suggested in a discussion w/ Tony Tiradani
TODO: This ticket covers the set up of the variables which help the blackhole detection: development of the whole mechanism, publishing attributes in the classAd and writing in the glidein logs at the end of the job.
Since it's more tricky, the logging part will be covered in a separate ticket: #23253
#6 Updated by Lorena Lobato Pardavila over 1 year ago
We commented that the solution might involve a STARTD_CRON job that tracks RecentJobStarts (recent = past 20 min) against some threshold and if that threshold is met, set the START expression to FALSE or something like that.
After several discussions with HTCondor team, the've decided to make an easier implementation ( http://condor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6698). As commented above, the possible solution implies starting to mark the slot as a black hole if jobs are starting/finishing too quickly and choose a good default for the amount of time that passes before it's no longer marked as a black hole.
I'll follow this up
#7 Updated by Lorena Lobato Pardavila over 1 year ago
I was talking to HTCondor team, but I was told that I shouldn’t rely on this being fixed any time soon, though because the person to who is assigned this ticket also going to be really busy writing the annual NSF report so he'll likely be unresponsive.
Reported this in GlideinWMS meeting and we agreed to talk to him during CMS-HTCondor meeting to have a better and clear situation of this. If we see no possibility to have fixed this in short-term, we might leave it for the next release to wait for their fix..
#9 Updated by Lorena Lobato Pardavila over 1 year ago
Today, Marco and me have had a call with Brian Lin and TJ (the person who is gonna implement it ) to discuss and clarify about our request. New uptades:
There was confusion about when and how the averages are calculated.
We thought that they were gathering information for 20 min and updating every 4 min.
The RecentAvg JobBusyTime is a “Miron statistic” about the last 20 min including count, sum,max,min std dev and avg of the value.
The value is calculated using 5 buckets, each w/ stats about 4 min. When the time passes a bucket is dropped and a new one is used (with the next 4 mins).
The values are updated each time a jobs completes (starter terminates).
This will be available in the next development release (8.7.10), probably in a month.
The values are available immediately if the schedd is queried directly (START expression, direct query from client or startd cron script). The start expression is updated every time that job starts, so is a good place for the check.
JobStarts are the jobs that started. There could be many in partitionable slots as it’s not accurated . For our purposes are more important jobs that ended (starter ended). they will publish this as well.
#10 Updated by Lorena Lobato Pardavila over 1 year ago
For the case of JobBusyTime where no jobs have completed yet and after discussion with HTCondor team, we all have decided to make the value . The reason it will allow to differentiate cases between no jobs running yet and where the RecentAvgJobBusyTime goes to 0.
#11 Updated by Lorena Lobato Pardavila over 1 year ago
- If the average lifetime of a starter drops below some threshold and the glidein stops accepting jobs, then that average will never change again, and the glidein will stop accepting jobs forever until it dies (around 20 min)
- If the glide-in is working, and then stops working, then yes it will take a few jobs to go by for the average to drop down to the point where we detect the problem we want . Otherwise a single short job will have the potential to take down your whole pool - a single short job will only quickly kill off a startd that doesn't have a record of success. This is why the 20 minute windowed average is not as useful as the overall lifetime average.
These two new probes will generate in each slot ad when enabled. They won't be published by default in the classAd as they're a lot. If we want them enabled for publication, we can add to con the configuration of the execute nodes the following:
STARTD.STATISTICS_TO_PUBLISH_LIST = $(STATISTICS_TO_PUBLISH_LIST) JobDuration, JobBusyTime
For the record: Due to the way the statistics probes work, the qualifier Avg will appear at the end of the attribute name, not at the beginning.
An example would be: For 2, 2 minute jobs, you would see something like this.
JobBusyTimeAvg = 122.2497668266296 JobBusyTimeCount = 2 JobBusyTimeMax = 122.2925918102264 JobBusyTimeMin = 122.2069418430328 JobDurationAvg = 120.0414055 JobDurationCount = 2 JobDurationMax = 120.061154 JobDurationMin = 120.021657 RecentJobBusyTimeAvg = 122.2497668266296 RecentJobBusyTimeCount = 2 RecentJobBusyTimeMax = 122.2925918102264 RecentJobBusyTimeMin = 122.2069418430328 RecentJobDurationAvg = 120.0414055 RecentJobDurationCount = 2 RecentJobDurationMax = 120.061154 RecentJobDurationMin = 120.021657
#15 Updated by Lorena Lobato Pardavila 8 months ago
Some updates after researching and discussions with stakeholders and GWMS team:
- Glidein will get retire state(rather than shutdown), wait at least 20 min, even longer if there are jobs that need to complete. This is to avoid a new glidein to start right away on the same node and start draining jobs again.
- I will change the classAd to let know that the STARTD is considered a black hole (write which limit was triggered). The new attributes would be with different names, like i.e: IS_BLACKHOLE_TRIGGERED=true/false, BLACKHOLE_REASON=string explaining, containing the limit, and then the limit variable (published by condor).
- Thread hole: job_rate_failure_trigger. This attribute will be specified either in global as in group level. If global is not specified, it will take the group one. If group is specified, it’ll overwrite the global value. This attribute will have a number and we could have three different values:
- 0: It will be by default and it means, unlimited
- 1/None: Use whatever is in global. (This is for example in the case that you have defined the attribute in global but you don’t want to use it in a specific group)
- Value (Num Jobs / Time).
Note : Eventually we’ll add an attribute (time) which will specify the time limitation (num jobs/time) to play with that. By default, we use “Recent-attribute” which will calculate the rates in the last 20 min.
- From Factory operators side, it was suggested to have some lines in the Frontend group logs related to the blackhole detection, which will help them to troubleshoot. Something like: “X glideins are retiring because of high failure rate”. Investigating how difficult would be to add this information in the Frontend logs.
- Stats are based on all jobs and they are determined by measuring the lifetime of each Starter that has exited. The attributes will be undefined until the first time a Starter has exited and it doesn’t count the jobs that never started. We have discussed with HTCondor team, the possibility to NOT have into consideration the succeeded jobs, meaning that the stats would be based on the jobs that have failed during STARTER lifetime. Unfortunately, confirmed by HTCondor team that there is no way to do this with the current code. Any job that exists in a short amount of time must be presumed to have failed, regardless of what the exit code of the job was.
#22 Updated by Lorena Lobato Pardavila 3 months ago
Changes done in v35/19214 (mechanism + documentation).
The pool was added to landscape to get better proof that attributes in the machine classAd are being correctly published. Unfortunately, the value of the attributes like GLIDEIN_BLACKHOLE (True or False), are shown in landscape as the internal evaluated expression (instead of boolean value). We're trying to figure out the macro in HTCondor which would help to solve this.
#23 Updated by Lorena Lobato Pardavila 2 months ago
We have realized there is a bad-configured attribute in our configuration to detect blackholes. This attribute was not using the last value during the evaluation to calculate the next one in a classAd, creating an inconsistency.
Marco has asked HTCondor team if there is any way to retain the values from the machine classAd at the end of condor. After discussing, it seems HTCondor team will add a new straight-forward feature to add it to the classAd and also to the StardLog (which also facilitates our life for #23253).
#24 Updated by Lorena Lobato Pardavila about 2 months ago
Updates from HTCondor team:
- Print last machine classad in startd log. Already implemented, hopefully, it will be added to the next release.
- Add history to machine ads (similar to job ad updates):