Project

General

Profile

Bug #12914

Factory forgets RFC pilots on proxy renewal

Added by Parag Mhashilkar over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
06/13/2016
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

CMS

Duration:

Description

Begin forwarded message:

From: Brian Bockelman
Subject: Re: Factory "forgets" RFC pilots on proxy renewal
Date: June 11, 2016 at 3:40:42 PM CDT
To: Farrukh Aftab Khan
Cc: Jeffrey Michael Dost , glideinwms-support, James Letts , "cms-htcondor-admins (A group HTCondor experts to debug related issues)" , ""

Hi Farrukh,

Indeed, it looks like the frontend generates a new credential ID for each renewal of RFC proxies - but keeps the same ID for non-RFC proxies.

Try the below patch to have it use the EEC DN instead of proxy DN (warning: untested, but should be obvious what the patch is attempting...).

Looking at a separate frontend I have access to confirms the same behavior - many IDs for RFC-based proxies and a single ID for non-RFC proxies.

Brian

--- a/frontend/glideinFrontendInterface.py
+++ b/frontend/glideinFrontendInterface.py
@ -29,6 +29,7 @ from glideinwms.lib import condorMonitor
from glideinwms.lib import condorManager
from glideinwms.lib import classadSupport
from glideinwms.lib import logSupport
+from glideinwms.lib import x509Support

############################################################ #
@ -402,8 +403,7 @ class Credential:

def file_id(self,filename,ignoredn=False):
if (("grid_proxy" in self.type) and not ignoredn):
- dn_list = condorExe.iexe_cmd("openssl x509 subject -in %s -noout" % (filename))
dn = dn_list0
+ dn = x509Support.extract_DN(filename)
hash_str=filename+dn
else:
hash_str=filename

On Jun 11, 2016, at 12:48 PM, Farrukh Aftab Khan <> wrote:

Hi Jeff,

We've now set the proxy generation script to generate a proxy once every 24 hours. For drop off correlation purposes: a new proxy is going to be generated at 15:00 hours (GVA time) every day.

Also, looking at the factory graph, I can confirm that these were the exact times our proxies were being renewed at the frontend.

Best regards,
Farrukh
________________________________
From: Jeffrey Dost
Sent: 11 June 2016 12:03 AM
Subject: Factory "forgets" RFC pilots on proxy renewal

Hi all,

Over the past week since Farrukh changed the CMS global pool to use RFC
compliant proxies for the pilots, we've seen a strange behavior where
the factories just can't seem to ramp up. Please see the monitoring on
our new soon to be replacement factory for CERN:
http://vocms0342.cern.ch/monitor/factoryStatus.html

Note the strange drop offs that seem to happen at regular intervals. I
don't have proof yet but I suspect those are the times the CMS frontend
is renewing its pilot proxies.

Essentially whenever this happens, the factory no longer counts any
previous pilots from before the renewal. For example, for UCSD gw6 we
have 32 idle in the queue:
entry_q CMSHTPC_T2_US_UCSD_gw6 | grep ' I ' | wc -l
32

however in our factory log:
grep 'schedd status' CMSHTPC_T2_US_UCSD_gw6.info.log
[2016-06-10 23:51:01,255] INFO: Client CMSG-v1_0.main (secid:
CMSG-v1_0_cmspilot) schedd status {1: 15, 1002: 15}

the factory only counts 15 idle.

Here, the most recent proxy (credential 64035) matches 15
entry_q CMSHTPC_T2_US_UCSD_gw6 -const 'jobstatus==1' -af x509userproxy |
sort | uniq -c
3
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_43773
6
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_507447
4
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_640004
15
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_64035
2
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_850702
2
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_986123

So the other 17 idle pilots no longer exist from the point of view of
the factory. It looks like whenver RFC proxies are renewed, the CN=
number changes in the proxy subject. So from the factory point of view I
guess it looks like a new pilot proxy rather than a renewed one.
Compare the output to the current valid proxy [1] to the most recent
proxy [2].
subject :
/DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch/CN=765962965
subject :
/DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch/CN=1586594039

This is basically breaking submission and monitoring on the factory for
CMS, so this is urgent to get fixed.

James, Farrukh, can you change your proxy renewal to not be so frequent,
maybe once every 12h rather than every 3h? That will at least lessen
the impact. I don't think reverting to non-RFC proxies is an option,
most European cream sites require them at this point.

Jeff Dost
OSG Glidein Factory Operations

[1]
sudo voms-proxy-info -all -file
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_64035
subject :
/DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch/CN=765962965
issuer : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch
identity : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch
type : RFC3820 compliant impersonation proxy
strength : 1024
path :
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_64035
timeleft : 184:53:57
key usage : Digital Signature, Key Encipherment === VO cms extension information ===
VO : cms
subject : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch
issuer : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=pilot/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/dcms/Role=NULL/Capability=NULL
attribute : /cms/escms/Role=NULL/Capability=NULL
attribute : /cms/itcms/Role=NULL/Capability=NULL
attribute : /cms/local/Role=NULL/Capability=NULL
attribute : /cms/uscms/Role=NULL/Capability=NULL
timeleft : 184:53:57
uri : voms2.cern.ch:15002

[2]
sudo voms-proxy-info -all -file
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_507447
subject :
/DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch/CN=1586594039
issuer : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch
identity : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch
type : RFC3820 compliant impersonation proxy
strength : 1024
path :
/var/lib/gwms-factory/client-proxies/user__fecmsglobal/glidein_v3_2/credential_CMSG-v1_0.main_507447
timeleft : 179:52:26
key usage : Digital Signature, Key Encipherment === VO cms extension information ===
VO : cms
subject : /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch
issuer : /DC=ch/DC=cern/OU=computers/CN=voms2.cern.ch
attribute : /cms/Role=pilot/Capability=NULL
attribute : /cms/Role=NULL/Capability=NULL
attribute : /cms/dcms/Role=NULL/Capability=NULL
attribute : /cms/escms/Role=NULL/Capability=NULL
attribute : /cms/itcms/Role=NULL/Capability=NULL
attribute : /cms/local/Role=NULL/Capability=NULL
attribute : /cms/uscms/Role=NULL/Capability=NULL
timeleft : 179:52:26
uri : voms2.cern.ch:15002

History

#1 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from New to Resolved

From the email chain between CMS frontend admins and Factory ops, this patch fixes the issue however, the first time the fix is applied, you should expect similar behavior because the DN changes again (to the correct one that should be used). However after subsequent proxy renewals, factory monitoring and hence the glideins should be ok.

Merging it to release branches

#2 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF