Project

General

Profile

Bug #18588

Fix factory monitoring

Added by Marco Mambelli over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
12/18/2017
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

Factory Ops

Duration:

Description

This is a continuation of 17825. There are some number that are incorrect.
Here the email from Jeff:

Hi Marco,

I have glideinWMS 3.2.20 installed on the GOC-ITB, and started re-running some of my tests.  I see some issues where the numbers just don't line up quite right.  Like I said on the stakeholder meeting, if at all possible if we can push fixes to 3.2.21 that would be great.

I probably have more minor cosmetic fix requests later on but those can be postponed to later releases.

For this first test, I submitted 6 single core jobs, the FE configured the pilots slot layout to be "fixed" and they are 8 core pilots.  Once submitted, we should see 1 running pilot on the queue, with 8 cores claimed, 6 cores utilized, 2 unmatched.

The results are only partially correct.  The most correct seems to be the factory_status monitoring. Please see the factory_status.png. Registered cores is correctly showing 8, Running cores is correctly showing 6, unmatched  cores showing 2. So all the FE stats look like they are displaying correctly. However, the stats from the factory, Frunning cores, which should be 1 running glidein x 8 since GLIDIEN_CPUS is 8, is actually showing 9.

Next, look at factory_stat_now.png. For UCSDSleep, it gets worse. Running cores is oddly only showing 1. Claimed cores is showing 2 instead of 6. unmatched cores is showing 0 instead of 2. Registered cores is showing 0 instead of 8.

I can assure from manual condor_q that we had one 8 core pilot running, and one idle 8 core pilot that hadn't started up et. It looks like the wrongly counted 9 for "Running cores" also made its way into the schedd_status.xml. The other numbers corresponding to registered, claimed, unmatched cores all look correct in the xml:
RunningCores="9" (should be 8)

rest ok:
CoresTotal="8" (registered)
CoresRunning="6" (claimed)
CoresIdle="2" (unmatched)

For completeness, I include this xml, the classad of the entry from the factory collector, output of entry_q along with the plot snapshots.

I still have 2 more tests to try, submitting single core jobs to single core pilots, and submitting n core jobs to m core partitionable pilots, but I'd like to get your opinion on what's wrong with the above test before continuing.

Thanks,
Jeff

History

#1 Updated by Marco Mambelli over 1 year ago

The issue one (9 instead of 8) was a double counting of multicore pilots and has been fixed in branch v3/17825_2.
Issue 2, claimed and unmatched not counted correctly, is still to solve

#2 Updated by Marco Mambelli over 1 year ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Dennis Box

Fix is in v3/18588
Changes fix the problem and improve the javascript code in creation/web_base/factoryStatusNow.html.

NOTE for THE FUTURE:
For this ticket the javascript rewriting was limited because we don't know if this code will stick around.
If the monitoring web pages will be kept long term the javascript code should be rewritten eliminating repetitions, making things more uniform and parametrized. Now the same hardcoded numbers are repeated in several parts of the code and instead of using functions the same tasks are implemented in different parts of the code in the same or different ways.
Linting should also be applied and suggestions carefully implemented (there are dynamic callbacks)

#3 Updated by Marco Mambelli over 1 year ago

  • Assignee changed from Dennis Box to Parag Mhashilkar

#4 Updated by Parag Mhashilkar over 1 year ago

  • Assignee changed from Parag Mhashilkar to Marco Mambelli

Looks ok to merge.

#5 Updated by Marco Mambelli over 1 year ago

  • Status changed from Feedback to Resolved

#6 Updated by Marco Mambelli over 1 year ago

  • Status changed from Resolved to Assigned

Received some new feedback from Jeff asking for some new features.
Since changes are small I'm reopening this ticket, changes will be in v3/18588_2

Hi,

Thank you Marco, I re-ran my tests and all the numbers are correct now. I know you're on leave at the moment, but if Parag or someone else can take care of it, I have 2 special requests to solidify the monitoring before 3.2.21 comes out.

1) the definition of Rundiff has to be fixed, please see the plot rundiff.png. (this page is accessed when clicking the "troubleshoot" checkmark on factoryStatusNow page. It was calculated as Rundiff = "Running" - "Registered", but to be consistent with the core counts it must be Rundiff = "Running Cores" - "Registered Cores".

In the png file, this was 1 8 core fixed slot pilot.  It shows 1 "Running" pilot, 8 "registered" (meaning slots) and -7 rundiff.

Instead I'd expect to see the columns show 8 "Running cores", 8 "Registered Cores" and 0 for rundiff.

2) Get rid of extraneous columns, all they do is add confusion.  To be removed:
Factory status now:
"Max Cores" 
"Idle Cores" 
"Claimed" 
"Unmatched" 
"Registered" 

I think "max cores" and "idle cores" were from your first iteration with me, but I had mentioned before these numbers aren't useful for us.

And also, now that we have the corresponding *core* versions of "claimed, unmatched, registered", the original columns that really are showing slot counts really have no benefit for site debugging.

Same goes for the factoryStatus plots, we do not want the corresponding:
"Max requested cores" 
"Requested idle cores" 
"Glideins claimed by user jobs" 
"Glideins not matched" 
"Glideins at Collector" 

#7 Updated by Marco Mambelli over 1 year ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Marco Mambelli to Dennis Box

New changes are in v3/18588_2
And deployed on http://gwms-dev-factory.fnal.gov/factory/monitor/

#8 Updated by Dennis Box over 1 year ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Dennis Box to Marco Mambelli

ok to merge

#9 Updated by Parag Mhashilkar over 1 year ago

  • Status changed from Resolved to Closed

#10 Updated by Parag Mhashilkar over 1 year ago

  • Stakeholders updated (diff)


Also available in: Atom PDF