Pilot using an expired proxy when authenticating to the collector
This bug was reported from Justas, who is using a "whole node" setup (160cores per node) with very long pilots (one week). What is happening is that the credential manaded by condor:
is being copied into this file by one of the glideinwms validation scripts (https://github.com/glideinWMS/glideinwms/blob/master/creation/web_base/setup_x509.sh#L166-L173):
This is what eventually the startd uses to authenticate to the collector. However, the former is being refreshed by condor, the latter is not:
[root@blade-03-02 ~]# cd /wntmp/condor/execute/dir_164995/ [root@blade-03-02 dir_164995]# openssl x509 -in credential_CMSG-v1_0.custom_start_411868 -text -noout | grep Not Not Before: Jan 17 15:10:05 2020 GMT Not After : Jan 20 15:00:08 2020 GMT [root@blade-03-02 dir_164995]# openssl x509 -in glide_OCJG3i/ticket/myproxy -text -noout | grep Not Not Before: Jan 14 04:42:40 2020 GMT Not After : Jan 16 23:00:09 2020 GMT [root@blade-03-02 dir_164995]# date Fri Jan 17 07:59:51 PST 2020
And this causes the startd to fail with the following error:
GSS Minor Status Error Chain: globus_gsi_gssapi: Error with GSI credential globus_gsi_gssapi: Error with gss credential handle globus_credential: Error with credential: The proxy credential: /wntmp/condor/execute/dir_164995/glide_OCJG3i/ticket/myproxy with subject: /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch/CN=1892936088/CN=83956108 expired 206 minutes ago.
but the pilot staying around executing the pending jobs for very long time. This is particularly bad for the site due to its specific configuration.
Full pilot logs available [[here][https://mmascher.web.cern.ch/mmascher/pilot20200117.tar.gz]]
#1 Updated by Marco Mascheroni about 1 month ago
I just did a git blame, this bug has been around since 14 years :o
The thing I would propose is to save the old proxy location in a variable like:
And then periodically copy X509_USER_PROXY_ORIGINAL into X509_USER_PROXY (maybe with a condor cron?). Marco Mambelli, thoughts on this?
The other comment is that, despite being a 14years old bug, this seems a high priority ticket for CMS: the landscape of resources is changing, and we might have more and more resources with a similar setup (big number of cores, long pilots). Until now we were lucky that pilot always lasted arond 48hours, and even if they lasted longer nobody had advanced monitor/interest to spot this (those pilots behave like "draining pilots"). Being the pilot 160 cores is what motivated Justas to get to the bottom of this (160cores means you waste a lot of resources when you are in this fake draining phase).
#2 Updated by Marco Mambelli about 1 month ago
In 2018, v3.4, we introduced GLIDEIN_Ignore_X509_Duration, defaulting to true.
Before then the glidein lifetime was shorter than the initial proxy lifetime.
Proxy renewal should have been considered at the time.
At the time I misunderstood the schedd-collector session handling, thinking that that would have been longer than the glidein lifetime, independently from the proxy lifetime that I thought was used only to establish the session
A broader discussion should be done thinking through the implications of proposed changes. I asked for a meeting w/ the condor team (Jaime, Todd) to discuss the following:
1. is it there a workaround in the condor configuration where we can extend the collector-schedd session length? Should the configuration change go in the schedd, collector or both?
This would allow having a temporary fix for the deployed systems (no sw change)
2. the glidein could run periodically the proxy setup script and keep copying the proxy from the original location or the glidein code could be changed to point directly there instead of at the copy. Currently, all is copied in a new directory so that the glidein runs in a sandboxed directory local to the node. This is both for reliability and security: the initial dir depends on the batch system, could be on a shared file system or shared across processes, and could contain info from other jobs running on the node, ...
3. including also tokens in the picture, the mechanism used to authenticate w/ the CE could be different from the one used to talk to the collector. We'd like to start transferring our credentials as encrypted files via condor transfer. Is it there some automatic renewal mechanism or should we implement something in glidein-land on top of what condor provides, e.g. using chirp?