Need more prompt fake running glidein detection
The OSG GFactory has noticed that for some sites the rundiff often grows really fast, resulting in very few glideins actaully running on the site most of the time.
We need tools that allow us to spot the fake running glideins, so we can prompty remove them.
#2 Updated by Igor Sfiligoi over 5 years ago
My suggested solution would be to periodically poll all the served VO collectors, and compare the advertised glidein attributes with what we have in our own queues.
There is enough information in the glidein classads to make the exact mapping.
If a glidein is not seen several times in a row (to account for both delays and occasional classad drop), a glidein is marked as "fake running".
Automatic removal may be an option, but having a "report only" mode would be also highly desirable.
- Do we have the collector information we need?
- What kind of security envelope do we want to operate under (the GF itself does not have a credential)
- How often do we want to poll, and how many missed polls are needed to declare a glidein fake.
#3 Updated by Parag Mhashilkar over 5 years ago
- Target version set to v3_2_x
To start with we can come up with a standalone tool that can operate using the info in glideclient classad.
- We already have info about the collector in form of GlideinParamGLIDEIN_Collector = "fermicloud030.fnal.gov:9619-9620"
- We can use switchboard and VO specified proxies for querying if need be but in most cases collectors are open for read access
- Once we under the implication from running it as a tool, we can decide on the frequency of polling
If we do not care about marking each fake glidein and just care about the run diff, we can use info in glideclient classads to account for total reported glideins.
#4 Updated by Igor Sfiligoi over 5 years ago
- Priority changed from Normal to High
It's a bit annoying that we only have the sub-collectors, due to the sheer number of them.
But should be manageable nevertheless.
And I am all for having a separate script, at least in the beginning.
PS: Raising the priority, since it is not an isolated problem... OSG GF is again in the same situation for several sites, and it has a significant impact on operations.