Project

General

Profile

Bug #21880

Factory GlideinMonitor* classads appear to be erased periodically

Added by Anthony Tiradani 12 months ago. Updated 10 months ago.

Status:
Closed
Priority:
High
Category:
Factory & Frontend Monitoring
Target version:
Start date:
02/11/2019
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

HEPCloud

Duration:

Description

I installed the 3.4.3-1 version of factory and discovered that the GlideinMonitor* classads appear to be periodically erased. I took a snapshot (via cron + condor_status) of the classads and extracted one entry where you can clearly see the flipping of values. The output is listed below:

GlideinMonitorTotalClientMonitorGlideRunning = 5292, GlideinMonitorTotalClientMonitorJobsRunning = 10366, GlideinMonitorTotalStatusRunning = 336, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 63, LastHeardFrom = 1549916635, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = 5292, GlideinMonitorTotalClientMonitorJobsRunning = 10366, GlideinMonitorTotalStatusRunning = 336, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 63, LastHeardFrom = 1549916635, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = 5292, GlideinMonitorTotalClientMonitorJobsRunning = 10366, GlideinMonitorTotalStatusRunning = 336, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 63, LastHeardFrom = 1549916635, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = 5292, GlideinMonitorTotalClientMonitorJobsRunning = 10366, GlideinMonitorTotalStatusRunning = 336, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 63, LastHeardFrom = 1549916635, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = 5292, GlideinMonitorTotalClientMonitorJobsRunning = 10366, GlideinMonitorTotalStatusRunning = 336, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 63, LastHeardFrom = 1549916635, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = undefined, GlideinMonitorTotalClientMonitorJobsRunning = undefined, GlideinMonitorTotalStatusRunning = 0, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 64, LastHeardFrom = 1549916936, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = undefined, GlideinMonitorTotalClientMonitorJobsRunning = undefined, GlideinMonitorTotalStatusRunning = 0, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 64, LastHeardFrom = 1549916936, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = undefined, GlideinMonitorTotalClientMonitorJobsRunning = undefined, GlideinMonitorTotalStatusRunning = 0, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 64, LastHeardFrom = 1549916936, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = undefined, GlideinMonitorTotalClientMonitorJobsRunning = undefined, GlideinMonitorTotalStatusRunning = 0, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 64, LastHeardFrom = 1549916936, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = undefined, GlideinMonitorTotalClientMonitorJobsRunning = undefined, GlideinMonitorTotalStatusRunning = 0, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 64, LastHeardFrom = 1549916936, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = 5294, GlideinMonitorTotalClientMonitorJobsRunning = 10270, GlideinMonitorTotalStatusRunning = 338, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 65, LastHeardFrom = 1549917236, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = 5294, GlideinMonitorTotalClientMonitorJobsRunning = 10270, GlideinMonitorTotalStatusRunning = 338, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 65, LastHeardFrom = 1549917236, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = 5294, GlideinMonitorTotalClientMonitorJobsRunning = 10270, GlideinMonitorTotalStatusRunning = 338, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 65, LastHeardFrom = 1549917236, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = 5294, GlideinMonitorTotalClientMonitorJobsRunning = 10270, GlideinMonitorTotalStatusRunning = 338, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 65, LastHeardFrom = 1549917236, MyAddress = <X.X.X.X>
GlideinMonitorTotalClientMonitorGlideRunning = 5294, GlideinMonitorTotalClientMonitorJobsRunning = 10270, GlideinMonitorTotalStatusRunning = 338, Name = CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_fermifactory01@gfactory_service_fermifactory01, UpdateSequenceNumber = 65, LastHeardFrom = 1549917236, MyAddress = <X.X.X.X>

Is there some error condition where the factory is unable to complete the classad but is sending an incomplete classad anyway?


Related issues

Blocks GlideinWMS - Feature #22163: Check if there are load changes in Factory and solve TODOs added in #21880New03/19/2019

Associated revisions

Revision f190a150 (diff)
Added by Marco Mambelli 10 months ago

Feedback changes for #21880

History

#1 Updated by Parag Mhashilkar 11 months ago

  • Target version changed from v3_4_x to v3_5
  • Priority changed from Normal to High
  • Assignee set to Marco Mambelli

#2 Updated by Marco Mambelli 11 months ago

  • Target version changed from v3_5 to v3_4_4

#3 Updated by Marco Mambelli 11 months ago

  • Status changed from New to Work in progress

Progress update. The host presenting the problem is cmssrv280
cmssrv280 is running w/ 3.4.3.rc1 (minimal differences from 3.4.3 - compared w/ tag in git and checked changes)

The problem seems to be caused by glideinFactoryMonitoring.py condorQStats.get_total() being called w/o jobs statistics
I noticed that the up/downtime status in schedd_status.xml is incorrect. This happens because ist set_downtime() is called only in glideFactoryEntry unit_work_v3() (entry.gflFactoryConfig.qc_stats.set_downtime(in_downtime)) which is not called here because there are no requests for this entry.
This is a bug and needs to be fixed.
Downtime info in classad (condor_status -any -constraint 'MyType == "glidefactory"' -af GlideinMonitorTotalStatusIdle GlideinMonitorTotalStatusRunning GLIDEIN_In_Downtime Name) and status (/usr/sbin/gwms-factory statusdown -entry entries) are correct.

Still, the status (from condor_q) should be updated also without requests. This finding does not explain the missing condor_q info.
Check if this is the case considered in [#21741], but it should not be.

#4 Updated by Marco Mambelli 11 months ago

  • Stakeholders updated (diff)

code in v34/21880
Branch code being tested in cmssrv280.

Tested successfully w/ Frontends stopped (no glideclient classad) and glideins removed by hand.
Double check correct handling of glideins in excess.

#5 Updated by Marco Mambelli 10 months ago

  • Assignee changed from Marco Mambelli to Lorena Lobato Pardavila
  • Status changed from Work in progress to Feedback
  • Occurs In v3_4_3 added
  • Occurs In deleted (v3_4_x)

Ready for review, code in v34/21880
2.nd commit adds glidein sanitation and several changes not affecting the code
Tested on 395/398

#6 Updated by Marco Mambelli 10 months ago

  • Blocks Feature #22163: Check if there are load changes in Factory and solve TODOs added in #21880 added

#7 Updated by Lorena Lobato Pardavila 10 months ago

  • Assignee changed from Lorena Lobato Pardavila to Marco Mambelli
  • Status changed from Feedback to Accepted

Feedback provided by email and slack. Proposed changes were already applied.

It looks good, it can be merged.

#8 Updated by Marco Mambelli 10 months ago

  • Status changed from Accepted to Resolved

#9 Updated by Marco Mambelli 10 months ago

Added branch v34/21880_1 to handle better dictionary modification (problem seen in RC testing)

#10 Updated by Marco Mambelli 10 months ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF