Project

General

Profile

Milestone #4989

"Why is my job not running"?

Added by Burt Holzman about 6 years ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
Factory
Target version:
Start date:
11/21/2013
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS, OSG

Duration:

Description

This is for tracking the milestones from the Oct-2013 stakeholders' meeting.

Ideas: new monitoring plots, frontend level tolls ("Why does my job not invoke glidein requests?")

frontend_match_ana (6.93 KB) frontend_match_ana Igor Sfiligoi, 11/21/2013 06:17 PM
frontend_match_list (5.25 KB) frontend_match_list Igor Sfiligoi, 11/21/2013 06:17 PM

Related issues

Blocks GlideinWMS - Feature #3203: OSG: condor_q -analyze analogue for glideinsAssigned12/21/2012

History

#1 Updated by Burt Holzman about 6 years ago

  • Target version changed from v3_2_x to v3_2_4

#2 Updated by Burt Holzman about 6 years ago

  • Tracker changed from Feature to Milestone

#3 Updated by Igor Sfiligoi about 6 years ago

A couple of years ago, I have tried to create such tools, but never managed to get a satisfactory output.

Attaching the code as-is. Not claiming it still works, but could be a starting point for anyone
looking for inspiration.

#4 Updated by Igor Sfiligoi about 6 years ago

One tool could be:
What FE group x FE is a job mapping to.

#5 Updated by Parag Mhashilkar almost 6 years ago

  • Assignee changed from Burt Holzman to Parag Mhashilkar

#6 Updated by Parag Mhashilkar over 5 years ago

  • Target version changed from v3_2_4 to v3_2_5

#7 Updated by Parag Mhashilkar over 5 years ago

  • Target version changed from v3_2_5 to v3_2_6

#8 Updated by Parag Mhashilkar over 5 years ago

  • Target version changed from v3_2_6 to v3_2_7

#9 Updated by Parag Mhashilkar about 5 years ago

  • Target version changed from v3_2_7 to v3_2_8

#10 Updated by Parag Mhashilkar about 5 years ago

  • Target version changed from v3_2_8 to v3_2_9

#11 Updated by Brian Bockelman almost 5 years ago

Note - I've got a working script that does precisely this here:

https://github.com/bbockelm/kestrel/blob/master/src/gwms_analyze_job

#12 Updated by Parag Mhashilkar almost 5 years ago

  • Target version changed from v3_2_9 to v3_2_x

#13 Updated by Parag Mhashilkar over 4 years ago

  • Target version changed from v3_2_x to v3_2_12

#14 Updated by Parag Mhashilkar about 4 years ago

  • Target version changed from v3_2_12 to v3_2_13

#15 Updated by Parag Mhashilkar almost 4 years ago

  • Target version changed from v3_2_13 to v3_2_14

#16 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_14 to v3_2_15

#17 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_15 to v3_2_16

#18 Updated by Parag Mhashilkar about 3 years ago

  • Target version changed from v3_2_16 to v3_2_17

#19 Updated by Parag Mhashilkar almost 3 years ago

  • Target version changed from v3_2_17 to v3_2_18

#20 Updated by Parag Mhashilkar almost 3 years ago

  • Assignee changed from Parag Mhashilkar to HyunWoo Kim

#21 Updated by Marco Mambelli almost 3 years ago

  • Target version changed from v3_2_18 to v3_2_19

#22 Updated by HyunWoo Kim over 2 years ago

I have been developing a new script called traceglidein(this is not complete yet).
It tries to address the following questions(currently it implements some of them):

1. Focusing on glideresources (matching Jobs and Entries)

Does a given Idle Job have any Entry(glidefactory, glideresource, glideclient, glidefactoryclient) matched?
How many glideresources with the given idle job satisfy its match expression?
(glideresource is a result of matching a group and an Entry : performed by FE)
(match_expr comes from glideinwms/creation/lib/cvWParamDict.py and also match_expr in /etc/gwms-frontend/frontend.xml)

How many Idle jobs are there (from all registered schedulers) that satisfy the given match expression?

2. Frontend

Does FE send glidein requests for a given glideresource/glidefactory(Entry) via glideclient to Factory?
If not, is it because of any limits triggered? Show any limits and curbs triggered reported or recorded in glideresource
From this information(number of requested glideins), we can guess if there is any communication problem between Frontend and Factory

Des FE properly communicate with Factory?

3. Factory, Entry

Does the Entry submit a number of glidein_startups to the attached Entry-Scheduler(Grid or Cloud resources)?
Show the number of queued glidein_startups
Show the number of running glidein_startups
Any limits triggered on the Entry side?
(Look at glidefactoryclient for any curbs and limits triggered for the number of glidein_startup.sh)

Are the grid or cloud resource down?
Does the user have proper x509-certificate or AWSAccessKey to access grid/cloud resource?

4. Machine ClassAd

Are there Machine ClassAds in the USRCollector?
(A Machine ClassAd = a running glidein_startup = either busy glidein or idle glidein.)
(this might provide an information on whether the grid or cloud resource is down or working)
Show the number of busy-glideins(running glidein_startup where the actual user job is running)
Show the number of idle-glideins(running glidein_startup where the actual user job is NOT running)
(It is not until glidein_startup is running a grid or cloud side that Machine classAd is created in USRCollector.)

5. Job Matching.

Finally, in the USRCollector, do Job and Machine match properly?

#23 Updated by Marco Mambelli over 2 years ago

  • Target version changed from v3_2_19 to v3_2_20

#24 Updated by Marco Mambelli over 2 years ago

  • Target version changed from v3_2_20 to v3_2_21

#25 Updated by Marco Mambelli about 2 years ago

  • Target version changed from v3_2_21 to v3_2_x

Parag did some work that is in v3/4989
HyunWoo did some work that is in v3/4989_2

The activity in the 2 branches should be merged
Furthermore, we should try to solve one issue at the time:
1. check that Frontend is fine
2. then check the factory ...
3. ...

#26 Updated by Marco Mambelli about 2 years ago

  • Status changed from Assigned to New
  • Assignee deleted (HyunWoo Kim)

#27 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_2_x to v3_4_x

#28 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_4_x to v3_5_x

#29 Updated by Marco Mambelli 2 months ago

  • Target version changed from v3_5_x to v3_6_x


Also available in: Atom PDF