Project

General

Profile

Bug #19547

On worker nodes jobs get not valid production proxy

Added by Vito Di Benedetto over 1 year ago. Updated 5 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
04/03/2018
Due date:
% Done:

40%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

When I submit jobs with role Production using the CI, I use a Production proxy generated with my credential.
In the mu2e case it looks like:

subject   : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Vito Di Benedetto/CN=UID:vito/CN=4213116889/CN=2487702558
issuer    : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Vito Di Benedetto/CN=UID:vito/CN=4213116889
identity  : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Vito Di Benedetto/CN=UID:vito/CN=4213116889
type      : RFC compliant proxy
strength  : 1024 bits
path      : /var/tmp/ci.mu2epro.proxy
timeleft  : 149:48:37
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Vito Di Benedetto/CN=UID:vito
issuer    : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms2.fnal.gov
attribute : /fermilab/mu2e/Role=Production/Capability=NULL
attribute : /fermilab/argoneut/Role=NULL/Capability=NULL
attribute : /fermilab/genie/Role=NULL/Capability=NULL
attribute : /fermilab/gm2/Role=NULL/Capability=NULL
attribute : /fermilab/icarus/Role=NULL/Capability=NULL
attribute : /fermilab/lariat/Role=NULL/Capability=NULL
attribute : /fermilab/minerva/Role=NULL/Capability=NULL
attribute : /fermilab/minos/Role=NULL/Capability=NULL
attribute : /fermilab/mu2e/Role=NULL/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
attribute : /fermilab/next/Role=NULL/Capability=NULL
attribute : /fermilab/nova/Role=NULL/Capability=NULL
attribute : /fermilab/sbnd/Role=NULL/Capability=NULL
attribute : /fermilab/uboone/Role=NULL/Capability=NULL
timeleft  : 101:48:38
uri       : voms2.fnal.gov:15001

but then, on the worker node, my job gets a proxy generated by the managed proxy service that looks like:

X509_USER_PROXY: /storage/local/data1/condor/execute/dir_34806/x509cc_mu2epro_Production
subject   : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=production/mu2egpvm01.fnal.gov/CN=1916782200/CN=316828294/CN=1929993103/CN=1983765282/CN=638718144
issuer    : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=production/mu2egpvm01.fnal.gov/CN=1916782200/CN=316828294/CN=1929993103/CN=1983765282
identity  : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=production/mu2egpvm01.fnal.gov/CN=1916782200/CN=316828294/CN=1929993103/CN=1983765282
type      : RFC compliant proxy
strength  : 1024 bits
path      : /storage/local/data1/condor/execute/dir_34806/x509cc_mu2epro_Production
timeleft  : 23:36:25
key usage : Digital Signature, Key Encipherment, Data Encipherment
=== VO fermilab extension information ===
VO        : fermilab
subject   : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=production/mu2egpvm01.fnal.gov
issuer    : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms2.fnal.gov
attribute : /fermilab/mu2e/Role=Production/Capability=NULL
attribute : /fermilab/mu2e/Role=NULL/Capability=NULL
attribute : /fermilab/Role=NULL/Capability=NULL
timeleft  : 23:36:25
uri       : voms2.fnal.gov:15001

This proxy has five CN=[:digits:] groups, and this makes the proxy invalid and I can't use it within my script that run on the worker node.

Is there a way to replace a proxy on the worker node with a brand new one, or, eventually, get a proxy renewed fewer times?

Sometime on the worker node I got a Production proxy generated by my credential, in that case the proxy has only three CN=[:digits:] groups.
I got this behavior with both mu2e and uboone VOs, I didn't had a chance to test this with other VOs.

History

#1 Updated by Shreyas Bhat over 1 year ago

  • Status changed from New to Assigned
  • Assignee set to Shreyas Bhat

#2 Updated by Shreyas Bhat about 1 year ago

Server runs equivalent of the following:

1. Get proxy from myproxy:

X509_USER_CERT=/etc/grid-security/jobsub/fifebatch-hostcert.pem   X509_USER_KEY=/etc/grid-security/jobsub/fifebatch-hostkey.pem       myproxy-logon -n -l "/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Shreyas Bhat/CN=UID:sbhat" -t 24 -s myproxy.fnal.gov -o /tmp/sbhat

2. Run voms-proxy-init on it, store it in ANOTHER file:

voms-proxy-init -noregen -rfc -ignorewarn -valid 24:00 -bits 1024 -cert /tmp/sbhat -key /tmp/sbhat -out /tmp/sbhat_proxy -voms fermilab:/fermilab/nova
...
voms-proxy-info -subject -file /tmp/sbhat_proxy
/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Shreyas Bhat/CN=UID:sbhat/CN=3664933323/CN=810124332/CN=3697910925

No matter how many times I run this voms-proxy-init command, we only end up with three CN strings (the original proxy pulled from myproxy had two).

So it's not the jobsub server running voms-proxy-init too many times.

#3 Updated by Shreyas Bhat about 1 year ago

  • Status changed from Assigned to Work in progress

#4 Updated by Shreyas Bhat about 1 year ago

  • % Done changed from 0 to 40

So there are two issues here. One is already solved - that the DN gets overridden with a different one. That was solved some time ago using hash_nondefault_proxy for the mu2e jobsub group.

The second problem is that the CI jobs Vito runs use the proxy we push to submit other jobs. One proxy, for example, looks like this:

/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=Robots/CN=larsoftdev6.fnal.gov/CN=cron/CN=Vito Di Benedetto/CN=UID:vito/CN=2434948135/CN=1569208275/CN=1119005695

This has only three CN number strings (like any other user proxy), but because he's using it to submit jobs, jobsub_client checks how many CN=* strings there are, not just CN=[0-9]+. Because this is a cron cert, it already has 4 CN strings in addition to the three CN number strings that get added as this credential has proxies created from it. jobsub_client has these lines:

        parts = subject.split('/CN=')
        if len(parts) > 7:
            <ERROR CONDITION>

The seven CN strings mean that parts has 8 elements, and thus the job submission within the CI job fails.

We should correct this so that we do something like the following:

parts = re.findall('/CN=[0-9]+', DN)
if len(parts) > 7:
    <ERROR>

#5 Updated by Shreyas Bhat about 1 year ago

  • Occurs In v1.2.8.2 added

#6 Updated by Shreyas Bhat about 1 year ago

  • Target version set to v1.2.9

#7 Updated by Shreyas Bhat about 1 year ago

  • Assignee changed from Shreyas Bhat to Dennis Box

So this is now fixed in jobsub client, and I've pushed the branch 19547 to redmine. Please check it out, Dennis, and merge it into 1.2.9 if you think it looks good.

On a separate note, the managed proxies (one of which Vito uses) have 5 CN=## strings by the time they make it onto the worker node: two from running myproxy-init on the service certs that generate these proxies, one from the jobsub server pulling down the proxy from myproxy, one from the jobsub server running voms-proxy-init on that credential, and I assume one during transfer to the worker node. This means that Vito's case of needing to use that proxy to submit jobs won't work, since we limit the number of CN strings to 5.

I'll investigate fixing this by changing that first step in a separate ticket, #21111

#8 Updated by Dennis Box 11 months ago

  • Target version changed from v1.2.9 to v1.2.9.rc_x

#9 Updated by Dennis Box 8 months ago

  • Target version changed from v1.2.9.rc_x to v1.3

#10 Updated by Dennis Box 5 months ago

  • Status changed from Work in progress to Resolved

I am pretty sure this is resolved by #20988. Closing the ticket



Also available in: Atom PDF