On worker nodes jobs get not valid production proxy
When I submit jobs with role Production using the CI, I use a Production proxy generated with my credential.
In the mu2e case it looks like:
subject : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Vito Di Benedetto/CN=UID:vito/CN=4213116889/CN=2487702558 issuer : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Vito Di Benedetto/CN=UID:vito/CN=4213116889 identity : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Vito Di Benedetto/CN=UID:vito/CN=4213116889 type : RFC compliant proxy strength : 1024 bits path : /var/tmp/ci.mu2epro.proxy timeleft : 149:48:37 key usage : Digital Signature, Key Encipherment, Data Encipherment === VO fermilab extension information === VO : fermilab subject : /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Vito Di Benedetto/CN=UID:vito issuer : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms2.fnal.gov attribute : /fermilab/mu2e/Role=Production/Capability=NULL attribute : /fermilab/argoneut/Role=NULL/Capability=NULL attribute : /fermilab/genie/Role=NULL/Capability=NULL attribute : /fermilab/gm2/Role=NULL/Capability=NULL attribute : /fermilab/icarus/Role=NULL/Capability=NULL attribute : /fermilab/lariat/Role=NULL/Capability=NULL attribute : /fermilab/minerva/Role=NULL/Capability=NULL attribute : /fermilab/minos/Role=NULL/Capability=NULL attribute : /fermilab/mu2e/Role=NULL/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL attribute : /fermilab/next/Role=NULL/Capability=NULL attribute : /fermilab/nova/Role=NULL/Capability=NULL attribute : /fermilab/sbnd/Role=NULL/Capability=NULL attribute : /fermilab/uboone/Role=NULL/Capability=NULL timeleft : 101:48:38 uri : voms2.fnal.gov:15001
but then, on the worker node, my job gets a proxy generated by the managed proxy service that looks like:
X509_USER_PROXY: /storage/local/data1/condor/execute/dir_34806/x509cc_mu2epro_Production subject : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=production/mu2egpvm01.fnal.gov/CN=1916782200/CN=316828294/CN=1929993103/CN=1983765282/CN=638718144 issuer : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=production/mu2egpvm01.fnal.gov/CN=1916782200/CN=316828294/CN=1929993103/CN=1983765282 identity : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=production/mu2egpvm01.fnal.gov/CN=1916782200/CN=316828294/CN=1929993103/CN=1983765282 type : RFC compliant proxy strength : 1024 bits path : /storage/local/data1/condor/execute/dir_34806/x509cc_mu2epro_Production timeleft : 23:36:25 key usage : Digital Signature, Key Encipherment, Data Encipherment === VO fermilab extension information === VO : fermilab subject : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=production/mu2egpvm01.fnal.gov issuer : /DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=voms2.fnal.gov attribute : /fermilab/mu2e/Role=Production/Capability=NULL attribute : /fermilab/mu2e/Role=NULL/Capability=NULL attribute : /fermilab/Role=NULL/Capability=NULL timeleft : 23:36:25 uri : voms2.fnal.gov:15001
This proxy has five CN=[:digits:] groups, and this makes the proxy invalid and I can't use it within my script that run on the worker node.
Is there a way to replace a proxy on the worker node with a brand new one, or, eventually, get a proxy renewed fewer times?
Sometime on the worker node I got a Production proxy generated by my credential, in that case the proxy has only three CN=[:digits:] groups.
I got this behavior with both mu2e and uboone VOs, I didn't had a chance to test this with other VOs.
#2 Updated by Shreyas Bhat 8 months ago
Server runs equivalent of the following:
1. Get proxy from myproxy:
X509_USER_CERT=/etc/grid-security/jobsub/fifebatch-hostcert.pem X509_USER_KEY=/etc/grid-security/jobsub/fifebatch-hostkey.pem myproxy-logon -n -l "/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Shreyas Bhat/CN=UID:sbhat" -t 24 -s myproxy.fnal.gov -o /tmp/sbhat
2. Run voms-proxy-init on it, store it in ANOTHER file:
voms-proxy-init -noregen -rfc -ignorewarn -valid 24:00 -bits 1024 -cert /tmp/sbhat -key /tmp/sbhat -out /tmp/sbhat_proxy -voms fermilab:/fermilab/nova ... voms-proxy-info -subject -file /tmp/sbhat_proxy /DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=People/CN=Shreyas Bhat/CN=UID:sbhat/CN=3664933323/CN=810124332/CN=3697910925
No matter how many times I run this voms-proxy-init command, we only end up with three CN strings (the original proxy pulled from myproxy had two).
So it's not the jobsub server running voms-proxy-init too many times.
#4 Updated by Shreyas Bhat 8 months ago
- % Done changed from 0 to 40
So there are two issues here. One is already solved - that the DN gets overridden with a different one. That was solved some time ago using hash_nondefault_proxy for the mu2e jobsub group.
The second problem is that the CI jobs Vito runs use the proxy we push to submit other jobs. One proxy, for example, looks like this:
/DC=org/DC=cilogon/C=US/O=Fermi National Accelerator Laboratory/OU=Robots/CN=larsoftdev6.fnal.gov/CN=cron/CN=Vito Di Benedetto/CN=UID:vito/CN=2434948135/CN=1569208275/CN=1119005695
This has only three CN number strings (like any other user proxy), but because he's using it to submit jobs, jobsub_client checks how many CN=* strings there are, not just CN=[0-9]+. Because this is a cron cert, it already has 4 CN strings in addition to the three CN number strings that get added as this credential has proxies created from it. jobsub_client has these lines:
parts = subject.split('/CN=') if len(parts) > 7: <ERROR CONDITION>
The seven CN strings mean that parts has 8 elements, and thus the job submission within the CI job fails.
We should correct this so that we do something like the following:
parts = re.findall('/CN=[0-9]+', DN) if len(parts) > 7: <ERROR>
#7 Updated by Shreyas Bhat 8 months ago
- Assignee changed from Shreyas Bhat to Dennis Box
So this is now fixed in jobsub client, and I've pushed the branch 19547 to redmine. Please check it out, Dennis, and merge it into 1.2.9 if you think it looks good.
On a separate note, the managed proxies (one of which Vito uses) have 5 CN=## strings by the time they make it onto the worker node: two from running myproxy-init on the service certs that generate these proxies, one from the jobsub server pulling down the proxy from myproxy, one from the jobsub server running voms-proxy-init on that credential, and I assume one during transfer to the worker node. This means that Vito's case of needing to use that proxy to submit jobs won't work, since we limit the number of CN strings to 5.
I'll investigate fixing this by changing that first step in a separate ticket, #21111