Project

General

Profile

Bug #21525

Stale running and held glidein numbers reported in factory classads

Added by Marco Mambelli over 1 year ago. Updated over 1 year ago.

Status:
Closed
Priority:
Urgent
Category:
Factory
Target version:
Start date:
12/11/2018
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

HEPCloud

Duration:

Description

It seems that the factory is not updating the classads when there are no new requests form the Frontend.
This should not happen because the number of glidins can change.

Here the email from Steve Timm:

This appears to be a bug in the factory itself. 
I am seeing held jobs and running cores reported from CMSHTPC_T3_US_NERSC_Cori_shared
by the factory but there are no such jobs held or running.  However, the data blocks as
reported by the decision engine are accurately replicating these errors in the factory.

There should not be 8 running cores on Cori Shared or 4 jobs held.  Likewise there should not be 36 running cores showing on Edison shared.

[root@cmssrv280 entry_CMSHTPC_T3_US_NERSC_Cori_shared]# condor_status -any -constraint 'MyType=="glidefactory"' -af Name GlideinMonitorTotalStatusHeld GlideinMonitorTotalStatusRunningCores  | grep NERSC
CMSHTPC_T3_US_NERSC_Cori@gfactory_instance_cmssrv280@gfactory_service_cmssrv280 0 0
CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_cmssrv280@gfactory_service_cmssrv280 0 0
CMSHTPC_T3_US_NERSC_Cori_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280 4 8
CMSHTPC_T3_US_NERSC_Edison@gfactory_instance_cmssrv280@gfactory_service_cmssrv280 4 0
CMSHTPC_T3_US_NERSC_Edison_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280 0 36

All glideinwms staff have root login to cmssrv280 if you would like to check.

History

#1 Updated by Steven Timm over 1 year ago

Update--if things wait long enough some of those fields become Nan and then eventually just undefined.

#2 Updated by Dennis Box over 1 year ago

Hello Steve,
I understand that Marco Mambelli is working on this but I thought I would log on and look at logs and condor_queries to see if I could be helpful. However, I don't appear to have access to cmssrv280. Can you put me in the k5login?
Thanks
Dennis

#3 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_5 to v3_4_3

#4 Updated by Marco Mambelli over 1 year ago

  • Stakeholders updated (diff)

#5 Updated by Steven Timm over 1 year ago

Sorry did not see the request from Dennis Box earlier this week to be added to .k5login on cmssrv280.
This is now done.

Is there any update on the underlying bug? Cause known?

Steve Timm

#6 Updated by Steven Timm over 1 year ago

Do we have any schedule on delivery of a release candidate with this bug fixed? HEPCloud integration testing can not complete until this bug is fixed. I am worried that there has been no update on this ticket at all in the last two weeks.

Steven Timm

#7 Updated by Marco Mambelli over 1 year ago

  • File gwms21525_min.patch added
  • File glideFactoryEntryGroup.py added
  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Marco Mascheroni

Here attached the patch and changed file.
There are a couple of other changes in the branch but are not in code (comments, strings)

#8 Updated by Marco Mambelli over 1 year ago

A clarification of the changes:
- before the Factory was updating the classads only if there were active requests form the frontend
- HEPCloud is listening to the classads also when there are no active requests, to know the number of glideins running at the different entries
So the Factory now is publishing the classads all the time. This will increase the load for HTCondor but should be still OK.

#9 Updated by Marco Mascheroni over 1 year ago

I discussed with Marco Mambelli this, we believe the same result of the patch can be achieved by setting:

<glidein advertise_delay="1" ...
...

in the factory configuration. advertise_delay is set with this value in the UCSD factory.

#10 Updated by Marco Mascheroni over 1 year ago

  • Assignee changed from Marco Mascheroni to Marco Mambelli

#11 Updated by Steven Timm over 1 year ago

before

[root@cmssrv280 ~]# condor_status -any -constraint 'MyType=="glidefactory"' -af GlideinMonitorTotalStatusRunning Name
0 CMSHTPC_T3_US_Bridges@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 CMSHTPC_T3_US_NERSC_Cori@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 CMSHTPC_T3_US_NERSC_Cori_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
6 CMSHTPC_T3_US_NERSC_Edison@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
80 CMSHTPC_T3_US_NERSC_Edison_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 CMSHTPC_T3_US_TACC@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 CMS_T1_US_FNAL_condce2@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 CMS_T1_US_FNAL_condce3@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 CMS_T1_US_FNAL_condce4@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 CMS_T1_US_FNAL_condce@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Cori@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Cori_KNL@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Cori_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Edison@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Edison_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 FNAL_HEPCLOUD_AWS_us-east-1a_m3.2xlarge@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FNAL_HEPCLOUD_AWS_us-east-1a_m3.xlarge@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FNAL_HEPCLOUD_GOOGLE_us-central1-a_custom-16-32768@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FNAL_HEPCLOUD_GOOGLE_us-central1-a_custom-32-65536@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
0 FNAL_HEPCLOUD_GOOGLE_us-central1-a_n1-standard-1@gfactory_instance_cmssrv280@gfactory_service_cmssrv280

OK-- I installed the patch and restarted the factory. There was no difference in the behavior--the monitoring values remained undefined in the entries that weren't being used.

I then stopped the factory, changed the advertise delay to 1, reconfigured the factory, then stopped and restarted htcondor before starting the factory to be sure I had clean
classads

This made every single glide factory classad have a value of "undefined"

After

[root@cmssrv280 entry_FIFE_T3_US_NERSC_Cori]# condor_status -any -constraint 'MyType=="glidefactory" && GlideinMonitorTotalStatusRunning=?=undefined' -af GlideinMonitorTotalStatusRunning Name
undefined CMSHTPC_T3_US_Bridges@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMSHTPC_T3_US_NERSC_Cori@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMSHTPC_T3_US_NERSC_Cori_KNL@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMSHTPC_T3_US_NERSC_Cori_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMSHTPC_T3_US_NERSC_Edison@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMSHTPC_T3_US_NERSC_Edison_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMSHTPC_T3_US_TACC@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMS_T1_US_FNAL_condce2@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMS_T1_US_FNAL_condce3@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMS_T1_US_FNAL_condce4@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined CMS_T1_US_FNAL_condce@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Cori@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Cori_KNL@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Cori_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Edison@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FIFE_T3_US_NERSC_Edison_shared@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FNAL_HEPCLOUD_AWS_us-east-1a_m3.2xlarge@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FNAL_HEPCLOUD_AWS_us-east-1a_m3.xlarge@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FNAL_HEPCLOUD_GOOGLE_us-central1-a_custom-16-32768@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FNAL_HEPCLOUD_GOOGLE_us-central1-a_custom-32-65536@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
undefined FNAL_HEPCLOUD_GOOGLE_us-central1-a_n1-standard-1@gfactory_instance_cmssrv280@gfactory_service_cmssrv280
[root@cmssrv280 entry_FIFE_T3_US_NERSC_Cori]#


The factory is now dead in the water and all figures of merit, rather than just a few, are "Not a number" because of the undefined values.

From the emails earlier today it is not clear that you and I are trying to fix the same bug.. we need to get into the same room and look at the same condor_status so you can really see the issue.

Steve Timm

#12 Updated by Steven Timm over 1 year ago

Some further clarification:
Since 21:10 last night when I restarted the cmssrv280 factory with the patch, it has not successfully been able to submit any new glideins and the queue is gradually draining.
Looking at /var/log/gwms-factory/server/entry_CMSHTPC_T3_US_NERSC_Edison. (which is the entry from which the DE is requesting glideins) I see

[2019-01-02 21:10:19,722] DEBUG: glideFactoryEntry:1040: Checking security credentials for client hepcsvc03-fnal-gov_hepcloud_decisionengine.cms_all
[2019-01-02 21:10:19,723] WARNING: glideFactoryEntry:427: Entry CMSHTPC_T3_US_NERSC_Edison has hit the limit for total glideins, cannot submit any more
[2019-01-02 21:10:19,730] WARNING: glideFactoryEntry:1455: Malformed classad for client 693452_CMSHTPC_T3_US_NERSC_Edison@gfactory_instance_cmssrv280@gfactory_service_cmssrv280@hepcsvc03-fnal-gov_hepcloud_decisionengine.cms_all, missing web parameters, skipping request.

I am not sure how a patch that differs in just one line of code could make that difference but it clearly did.

Later in the evening last night the entry went under its limit but the other message continued about the malformed classad.

#13 Updated by Steven Timm over 1 year ago

Further clarification which perhaps was not clear to you from the initial ticket--namely that the problem is the Undefined monitoring values of the glidefactory classads.
Increasing the frequency of classad advertising will only help if you are advertising the right thing. Our code logic in the DE expects values such as GlideinMonitorTotalStatusRunning to never be undefined, they always need to be numbers. We need to discuss the best path to get us there.

#14 Updated by Marco Mambelli over 1 year ago

Further clarification. Today I worked w/ Steve on cmssrv280 to understand better the problem:
it is not a classad publishing frequency problem, a HTCondor bug or a bug introduced recently in GlideinWMS

The "Undefined" values are due to the way the current Factory-Frontend protocol works, which can be improved to better suit clients like the decision engine.

The current protocol includes some monitoring information in the glidefactoryclient classad and in a section (Totals) of the glidefactory classad that is published only in response to requests from a client (Frontend), so it is not there until a request is sent and then is not updated if there are no interactions w/ clients.
This information has 3 parts: Requests (from the client) ClientMonitor (monitoring from the client) and Status (from the Factory).

Possible improvements:
1. The Status information could be published from start (with 0s) and updated regularly also of there are no requests
2. The Requests and ClientMonitor information could publish 0 values at least in the totals if there are no information/updates from the clients

With Steve wee agreed that:
This ticket will focus on 1.
2 is more tricky (requires the factory to agree with the client on valid attributes) and I will open a separate ticket for future consideration

#15 Updated by Marco Mambelli over 1 year ago

  • Assignee changed from Marco Mambelli to Lorena Lobato Pardavila

#16 Updated by Marco Mambelli over 1 year ago

  • File deleted (gwms21525_min.patch)

#17 Updated by Marco Mambelli over 1 year ago

  • File deleted (glideFactoryEntryGroup.py)

#18 Updated by Lorena Lobato Pardavila over 1 year ago

  • Assignee changed from Lorena Lobato Pardavila to Marco Mambelli
  • Status changed from Feedback to Assigned

#19 Updated by Marco Mambelli over 1 year ago

Removed the patch provided on 1/2. That was not solving the problem. If needed I'll provide a new patch but will be not trivial because multiple files are involved and there may be overlap w/ other changes in 3.4.3.

#20 Updated by Marco Mambelli over 1 year ago

  • Status changed from Assigned to Resolved

Fixed TODOs as suggested in feedback.
This ticket fixed the signaled bug.
Anyway, I opened a new ticket to revise with bigger changes how stats are calculated: [#21741]

#21 Updated by Marco Mambelli over 1 year ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF