Project

General

Profile

Feature #5309

Need more prompt fake running glidein detection

Added by Igor Sfiligoi almost 6 years ago. Updated over 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Factory
Target version:
Start date:
01/31/2014
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS, OSG

Duration:

Description

The OSG GFactory has noticed that for some sites the rundiff often grows really fast, resulting in very few glideins actaully running on the site most of the time.

We need tools that allow us to spot the fake running glideins, so we can prompty remove them.

rundiff.png (28.4 KB) rundiff.png Igor Sfiligoi, 01/31/2014 11:52 AM

Related issues

Related to GlideinWMS - Feature #3217: Better stale glidein cleanup or factoryNew01/03/2013

Related to GlideinWMS - Idea #3389: Add a Collector for glidein monitoring to the factoryNew09/08/2014

History

#1 Updated by Igor Sfiligoi almost 6 years ago

Attached is an example site graph.

#2 Updated by Igor Sfiligoi almost 6 years ago

My suggested solution would be to periodically poll all the served VO collectors, and compare the advertised glidein attributes with what we have in our own queues.
There is enough information in the glidein classads to make the exact mapping.

If a glidein is not seen several times in a row (to account for both delays and occasional classad drop), a glidein is marked as "fake running".
Automatic removal may be an option, but having a "report only" mode would be also highly desirable.

The major potential problems/choices I see are:
  • Do we have the collector information we need?
  • What kind of security envelope do we want to operate under (the GF itself does not have a credential)
  • How often do we want to poll, and how many missed polls are needed to declare a glidein fake.

#3 Updated by Parag Mhashilkar almost 6 years ago

  • Target version set to v3_2_x

To start with we can come up with a standalone tool that can operate using the info in glideclient classad.

  • We already have info about the collector in form of GlideinParamGLIDEIN_Collector = "fermicloud030.fnal.gov:9619-9620"
  • We can use switchboard and VO specified proxies for querying if need be but in most cases collectors are open for read access
  • Once we under the implication from running it as a tool, we can decide on the frequency of polling

If we do not care about marking each fake glidein and just care about the run diff, we can use info in glideclient classads to account for total reported glideins.

#4 Updated by Igor Sfiligoi almost 6 years ago

  • Priority changed from Normal to High

It's a bit annoying that we only have the sub-collectors, due to the sheer number of them.
But should be manageable nevertheless.

And I am all for having a separate script, at least in the beginning.
PS: Raising the priority, since it is not an isolated problem... OSG GF is again in the same situation for several sites, and it has a significant impact on operations.

#5 Updated by Parag Mhashilkar almost 6 years ago

  • Target version changed from v3_2_x to v3_2_5

#6 Updated by Parag Mhashilkar over 5 years ago

  • Assignee changed from Parag Mhashilkar to Igor Sfiligoi

#7 Updated by Igor Sfiligoi over 5 years ago

Discussing with Jeff, we think doing this via #3389 seems the best solution.
Let me know if anyone objects.

#8 Updated by Parag Mhashilkar over 5 years ago

  • Target version changed from v3_2_5 to v3_2_6

#9 Updated by Parag Mhashilkar over 5 years ago

  • Target version changed from v3_2_6 to v3_2_7

#10 Updated by Parag Mhashilkar about 5 years ago

  • Target version changed from v3_2_7 to v3_x

#11 Updated by Parag Mhashilkar over 4 years ago

  • Priority changed from High to Normal

#12 Updated by Parag Mhashilkar over 4 years ago

  • Assignee deleted (Igor Sfiligoi)

Igor Sfiligoi's tickets. These were some ideas brewing up in his mind but they never materialized or got priority.



Also available in: Atom PDF