Project

General

Profile

Bug #5239

Factory monitoring broken - significantly scaled up

Added by Igor Sfiligoi over 5 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
High
Assignee:
Parag Mhashilkar
Category:
Factory Monitoring
Target version:
Start date:
03/25/2014
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

The CMS AnaOps has noticed that the FE monitoring of running glideins is way off;
to be specific, 10x lower than what the factory is reporting.

Since the CMS AnaOps FE uses 10 pilot proxies we suspect that's the root cause, but have no certainty.

frontend.png (93.6 KB) frontend.png Igor Sfiligoi, 01/22/2014 01:39 PM
factory.png (105 KB) factory.png Igor Sfiligoi, 01/22/2014 01:39 PM

Subtasks

Bug #5751: #5239 may still have issues with UserRunningClosedParag Mhashilkar


Related issues

Related to GlideinWMS - Bug #5414: Multiple proxies not being handled correctly - numbers too lowNew02/11/2014

Related to GlideinWMS - Bug #6201: Frontend matcmaking times too long after Monitoring patchClosed05/08/2014

History

#1 Updated by Igor Sfiligoi over 5 years ago

Here are the graphs for one entry on one factory.

#2 Updated by Igor Sfiligoi over 5 years ago

I pinged Parag, and he first suspected a javacriptrrd related problem.

I had tried both 0.6.1 and 1.1.1 and there are no changes.

#3 Updated by Igor Sfiligoi over 5 years ago

  • Subject changed from Frontend monitoring broken - significantly scaled down to Factory monitoring broken - significantly scaled up
  • Category changed from Frontend Monitoring to Factory Monitoring

Actually...
looking at the condor_status, looks like the factory numbers are the ones that are 10x what they should be!

#4 Updated by Parag Mhashilkar over 5 years ago

  • Target version set to v3_2_4

#5 Updated by Parag Mhashilkar over 5 years ago

Just an update the changes so far in v3/5239 fix the factory side monitoring. Info that comes from frontend monitoring is still broken.

#6 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from New to Feedback
  • Assignee changed from Parag Mhashilkar to Marco Mambelli

Fixed the monitoring info. I think I covered different use cases and this is ready for review.

#7 Updated by Marco Mambelli over 5 years ago

Feedback sent to Parag, discussed the changes, Parag applied the changes

#8 Updated by Marco Mambelli over 5 years ago

  • Assignee changed from Marco Mambelli to Parag Mhashilkar

Reassigning it back to Parag, it's ready to be merged.

PS I checked self.trust_domain in the Credential in frontend/glideinFrontendInterface.py. It's OK to use the string "None". It is compared with string values written in the xml config file, which may include "None" and "Any" beside the actual trust domain.

#9 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Feedback to Resolved

merged to branch_v3_2. Closing.

#10 Updated by Parag Mhashilkar over 5 years ago

Closing the issues that were take care of in v3.2.4

#11 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Resolved to Closed

#12 Updated by Igor Sfiligoi over 5 years ago

  • Status changed from Closed to Assigned
  • Target version changed from v3_2_4 to v3_2_5

CMS just upgraded to 3_2_5 and half of the monitoring is now completely broken!
In particular, all glidein related numbers are advertised as 0.
(PS: we also upgraded the factroy to 3_2_5, no difference)
GlideinMonitorGlideinsIdle = 0
GlideinMonitorGlideinsRunning = 0
GlideinMonitorGlideinsTotal = 0
GlideinMonitorRunningHere = 0

Reopening this ticket.

#13 Updated by Igor Sfiligoi over 5 years ago

Maybe I spoke too early...
looks like the FE is indeed properly counting the glideins started with the new factory
(but ignoring to old ones, for reporting purposes)

Waiting a little bit more, but may be able to re-close the ticked.

#14 Updated by Igor Sfiligoi over 5 years ago

  • Status changed from Assigned to Closed

False alarm indeed. Closing again.

Sorry for the fuss.

#15 Updated by Parag Mhashilkar over 5 years ago

  • Target version changed from v3_2_5 to v3_2_4

PLEASE!!! Do not reopen tickets that were closed as part of released versions. Just create a new ticket pointing to the the original ticket.



Also available in: Atom PDF