Project

General

Profile

Bug #23920

Pilot using an expired proxy when authenticating to the collector

Added by Marco Mascheroni about 1 month ago. Updated 4 days ago.

Status:
Feedback
Priority:
High
Category:
Glidein
Target version:
Start date:
01/17/2020
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

CMS

Duration:

Description

This bug was reported from Justas, who is using a "whole node" setup (160cores per node) with very long pilots (one week). What is happening is that the credential manaded by condor:

X509_USER_PROXY=/wntmp/condor/execute/dir_164995/credential_CMSG-v1_0.custom_start_411868

is being copied into this file by one of the glideinwms validation scripts (https://github.com/glideinWMS/glideinwms/blob/master/creation/web_base/setup_x509.sh#L166-L173):

/wntmp/condor/execute/dir_164995/glide_OCJG3i/ticket/myproxy

This is what eventually the startd uses to authenticate to the collector. However, the former is being refreshed by condor, the latter is not:

[root@blade-03-02 ~]# cd /wntmp/condor/execute/dir_164995/
[root@blade-03-02 dir_164995]# openssl x509 -in credential_CMSG-v1_0.custom_start_411868 -text -noout | grep Not
           Not Before: Jan 17 15:10:05 2020 GMT
           Not After : Jan 20 15:00:08 2020 GMT
[root@blade-03-02 dir_164995]# openssl x509 -in glide_OCJG3i/ticket/myproxy -text -noout | grep Not
           Not Before: Jan 14 04:42:40 2020 GMT
           Not After : Jan 16 23:00:09 2020 GMT
[root@blade-03-02 dir_164995]# date
Fri Jan 17 07:59:51 PST 2020

And this causes the startd to fail with the following error:

GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with GSI credential
globus_gsi_gssapi: Error with gss credential handle
globus_credential: Error with credential: The proxy credential: /wntmp/condor/execute/dir_164995/glide_OCJG3i/ticket/myproxy
      with subject: /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch/CN=1892936088/CN=83956108
      expired 206 minutes ago.

but the pilot staying around executing the pending jobs for very long time. This is particularly bad for the site due to its specific configuration.

Full pilot logs available [[here][https://mmascher.web.cern.ch/mmascher/pilot20200117.tar.gz]]

History

#1 Updated by Marco Mascheroni about 1 month ago

I just did a git blame, this bug has been around since 14 years :o

https://github.com/glideinWMS/glideinwms/blame/master/creation/web_base/setup_x509.sh#L166-L173

The thing I would propose is to save the old proxy location in a variable like:

X509_USER_PROXY_ORIGINAL=/wntmp/condor/execute/dir_164995/credential_CMSG-v1_0.custom_start_411868

And then periodically copy X509_USER_PROXY_ORIGINAL into X509_USER_PROXY (maybe with a condor cron?). Marco Mambelli, thoughts on this?

The other comment is that, despite being a 14years old bug, this seems a high priority ticket for CMS: the landscape of resources is changing, and we might have more and more resources with a similar setup (big number of cores, long pilots). Until now we were lucky that pilot always lasted arond 48hours, and even if they lasted longer nobody had advanced monitor/interest to spot this (those pilots behave like "draining pilots"). Being the pilot 160 cores is what motivated Justas to get to the bottom of this (160cores means you waste a lot of resources when you are in this fake draining phase).

#2 Updated by Marco Mambelli about 1 month ago

In 2018, v3.4, we introduced GLIDEIN_Ignore_X509_Duration, defaulting to true.
Before then the glidein lifetime was shorter than the initial proxy lifetime.
Proxy renewal should have been considered at the time.
At the time I misunderstood the schedd-collector session handling, thinking that that would have been longer than the glidein lifetime, independently from the proxy lifetime that I thought was used only to establish the session

A broader discussion should be done thinking through the implications of proposed changes. I asked for a meeting w/ the condor team (Jaime, Todd) to discuss the following:
1. is it there a workaround in the condor configuration where we can extend the collector-schedd session length? Should the configuration change go in the schedd, collector or both?
This would allow having a temporary fix for the deployed systems (no sw change)

2. the glidein could run periodically the proxy setup script and keep copying the proxy from the original location or the glidein code could be changed to point directly there instead of at the copy. Currently, all is copied in a new directory so that the glidein runs in a sandboxed directory local to the node. This is both for reliability and security: the initial dir depends on the batch system, could be on a shared file system or shared across processes, and could contain info from other jobs running on the node, ...

3. including also tokens in the picture, the mechanism used to authenticate w/ the CE could be different from the one used to talk to the collector. We'd like to start transferring our credentials as encrypted files via condor transfer. Is it there some automatic renewal mechanism or should we implement something in glidein-land on top of what condor provides, e.g. using chirp?

#3 Updated by Marco Mascheroni 25 days ago

  • Assignee changed from Marco Mascheroni to Marco Mambelli
  • Status changed from New to Feedback

Solution number 2 has been implemented in v36/23920

#4 Updated by Marco Mambelli 25 days ago

  • Status changed from Feedback to Accepted

#5 Updated by Marco Mambelli 25 days ago

  • Status changed from Accepted to Feedback

#6 Updated by Marco Mambelli 4 days ago

  • Assignee changed from Marco Mambelli to Marco Mascheroni


Also available in: Atom PDF