Fix factory monitoring
This is a continuation of 17825. There are some number that are incorrect.
Here the email from Jeff:
Hi Marco, I have glideinWMS 3.2.20 installed on the GOC-ITB, and started re-running some of my tests. I see some issues where the numbers just don't line up quite right. Like I said on the stakeholder meeting, if at all possible if we can push fixes to 3.2.21 that would be great. I probably have more minor cosmetic fix requests later on but those can be postponed to later releases. For this first test, I submitted 6 single core jobs, the FE configured the pilots slot layout to be "fixed" and they are 8 core pilots. Once submitted, we should see 1 running pilot on the queue, with 8 cores claimed, 6 cores utilized, 2 unmatched. The results are only partially correct. The most correct seems to be the factory_status monitoring. Please see the factory_status.png. Registered cores is correctly showing 8, Running cores is correctly showing 6, unmatched cores showing 2. So all the FE stats look like they are displaying correctly. However, the stats from the factory, Frunning cores, which should be 1 running glidein x 8 since GLIDIEN_CPUS is 8, is actually showing 9. Next, look at factory_stat_now.png. For UCSDSleep, it gets worse. Running cores is oddly only showing 1. Claimed cores is showing 2 instead of 6. unmatched cores is showing 0 instead of 2. Registered cores is showing 0 instead of 8. I can assure from manual condor_q that we had one 8 core pilot running, and one idle 8 core pilot that hadn't started up et. It looks like the wrongly counted 9 for "Running cores" also made its way into the schedd_status.xml. The other numbers corresponding to registered, claimed, unmatched cores all look correct in the xml: RunningCores="9" (should be 8) rest ok: CoresTotal="8" (registered) CoresRunning="6" (claimed) CoresIdle="2" (unmatched) For completeness, I include this xml, the classad of the entry from the factory collector, output of entry_q along with the plot snapshots. I still have 2 more tests to try, submitting single core jobs to single core pilots, and submitting n core jobs to m core partitionable pilots, but I'd like to get your opinion on what's wrong with the above test before continuing. Thanks, Jeff
#2 Updated by Marco Mambelli almost 3 years ago
- Status changed from New to Feedback
- Assignee changed from Marco Mambelli to Dennis Box
Fix is in v3/18588
NOTE for THE FUTURE:
Linting should also be applied and suggestions carefully implemented (there are dynamic callbacks)
#6 Updated by Marco Mambelli almost 3 years ago
- Status changed from Resolved to Assigned
Received some new feedback from Jeff asking for some new features.
Since changes are small I'm reopening this ticket, changes will be in v3/18588_2
Hi, Thank you Marco, I re-ran my tests and all the numbers are correct now. I know you're on leave at the moment, but if Parag or someone else can take care of it, I have 2 special requests to solidify the monitoring before 3.2.21 comes out. 1) the definition of Rundiff has to be fixed, please see the plot rundiff.png. (this page is accessed when clicking the "troubleshoot" checkmark on factoryStatusNow page. It was calculated as Rundiff = "Running" - "Registered", but to be consistent with the core counts it must be Rundiff = "Running Cores" - "Registered Cores". In the png file, this was 1 8 core fixed slot pilot. It shows 1 "Running" pilot, 8 "registered" (meaning slots) and -7 rundiff. Instead I'd expect to see the columns show 8 "Running cores", 8 "Registered Cores" and 0 for rundiff. 2) Get rid of extraneous columns, all they do is add confusion. To be removed: Factory status now: "Max Cores" "Idle Cores" "Claimed" "Unmatched" "Registered" I think "max cores" and "idle cores" were from your first iteration with me, but I had mentioned before these numbers aren't useful for us. And also, now that we have the corresponding *core* versions of "claimed, unmatched, registered", the original columns that really are showing slot counts really have no benefit for site debugging. Same goes for the factoryStatus plots, we do not want the corresponding: "Max requested cores" "Requested idle cores" "Glideins claimed by user jobs" "Glideins not matched" "Glideins at Collector"