"Why is my job not running"?
This is for tracking the milestones from the Oct-2013 stakeholders' meeting.
Ideas: new monitoring plots, frontend level tolls ("Why does my job not invoke glidein requests?")
#3 Updated by Igor Sfiligoi almost 6 years ago
A couple of years ago, I have tried to create such tools, but never managed to get a satisfactory output.
Attaching the code as-is. Not claiming it still works, but could be a starting point for anyone
looking for inspiration.
#11 Updated by Brian Bockelman over 4 years ago
Note - I've got a working script that does precisely this here:
#22 Updated by HyunWoo Kim over 2 years ago
I have been developing a new script called traceglidein(this is not complete yet).
It tries to address the following questions(currently it implements some of them):
1. Focusing on glideresources (matching Jobs and Entries)
Does a given Idle Job have any Entry(glidefactory, glideresource, glideclient, glidefactoryclient) matched? How many glideresources with the given idle job satisfy its match expression? (glideresource is a result of matching a group and an Entry : performed by FE) (match_expr comes from glideinwms/creation/lib/cvWParamDict.py and also match_expr in /etc/gwms-frontend/frontend.xml) How many Idle jobs are there (from all registered schedulers) that satisfy the given match expression?
Does FE send glidein requests for a given glideresource/glidefactory(Entry) via glideclient to Factory? If not, is it because of any limits triggered? Show any limits and curbs triggered reported or recorded in glideresource From this information(number of requested glideins), we can guess if there is any communication problem between Frontend and Factory Des FE properly communicate with Factory?
3. Factory, Entry
Does the Entry submit a number of glidein_startups to the attached Entry-Scheduler(Grid or Cloud resources)? Show the number of queued glidein_startups Show the number of running glidein_startups Any limits triggered on the Entry side? (Look at glidefactoryclient for any curbs and limits triggered for the number of glidein_startup.sh) Are the grid or cloud resource down? Does the user have proper x509-certificate or AWSAccessKey to access grid/cloud resource?
4. Machine ClassAd
Are there Machine ClassAds in the USRCollector? (A Machine ClassAd = a running glidein_startup = either busy glidein or idle glidein.) (this might provide an information on whether the grid or cloud resource is down or working) Show the number of busy-glideins(running glidein_startup where the actual user job is running) Show the number of idle-glideins(running glidein_startup where the actual user job is NOT running) (It is not until glidein_startup is running a grid or cloud side that Machine classAd is created in USRCollector.)
5. Job Matching.
Finally, in the USRCollector, do Job and Machine match properly?
#25 Updated by Marco Mambelli almost 2 years ago
- Target version changed from v3_2_21 to v3_2_x
Parag did some work that is in v3/4989
HyunWoo did some work that is in v3/4989_2
The activity in the 2 branches should be merged
Furthermore, we should try to solve one issue at the time:
1. check that Frontend is fine
2. then check the factory ...