Project

General

Profile

Bug #22779

Not escaped comma in GSI_DAEMON_NAME causing problems

Added by Marco Mambelli 6 months ago. Updated 4 months ago.

Status:
Closed
Priority:
High
Assignee:
Category:
Glidein
Target version:
Start date:
06/20/2019
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

GSI_DAEMON_NAME is used by the master daemon in the Glideins to authenticate.
The new UCSD DN has a comma (not present in previous DNs) which is not escaped in the GSI_DAEMON_NAME of the Glidein condor configuration.
The comma is normally a list separator, so this causes problems in the attribute in the configuration/classad
The problem affects only the master in the Glidein that cannot advertise itself to the collector (owned by the VO).
And it affects the ability to kill the glideins from the Frontend (glidein_off, limited debug use), the regular cycle and killing form the Factory (via schedd) are OK, so it is not an emergency

The startd in the Glidein uses a mapfile for authentication.
The suggestion is to use a mapfile also for the master, it may be a different one if the list of DN (currently in GSI_DAEMON_NAME) is different.

Make also sure that:
  • eliminating GSI_DAEMON_NAME and related config changes do not break anything
  • GSI_DAEMON_NAME is not used somewhere else (Frontend or Facotry). If it is it should be replaced w/ the mapfile or the content should be escaped to allow commas in the DN

Here the problem as reported by Edgar:

Indeed this is a wide scale problem. Since all pools at UCSD have the problems. IT is not only limited to NIKHEF. We looked at a pilot at UCSD and it has same problems in the master log:

06/12/19 10:16:45 (pid:45400) ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5006:Failed to authenticate because the subject '/DC=org/DC=incommon/C=US/ST=CA/L=La Jolla/O=University of California, San Diego/OU=UCSD/CN=osg-ligo-1.t2.ucsd.edu' is not currently trusted by you.  If it should be, add it to GSI_DAEMON_NAME or undefine GSI_DAEMON_NAME.
06/12/19 10:16:45 (pid:45400) CCBListener: connection to CCB server osg-ligo-1.t2.ucsd.edu:9630 failed; will try to reconnect in 60 seconds.
06/12/19 10:17:46 (pid:45400) SECMAN: required authentication with collector osg-ligo-1.t2.ucsd.edu:9630 failed, so aborting command CCB_REGISTER.
06/12/19 10:17:46 (pid:45400) ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5006:Failed to authenticate because the subject '/DC=org/DC=incommon/C=US/ST=CA/L=La Jolla/O=University of California, San Diego/OU=UCSD/CN=osg-ligo-1.t2.ucsd.edu' is not currently trusted by you.  If it should be, add it to GSI_DAEMON_NAME or undefine GSI_DAEMON_NAME.
06/12/19 10:17:46 (pid:45400) CCBListener: connection to CCB server osg-ligo-1.t2.ucsd.edu:9630 failed; will try to reconnect in 60 seconds.
06/12/19 10:18:30 (pid:45400) The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown
06/12/19 10:18:30 (pid:45400) Got SIGTERM. Performing graceful shutdown.
06/12/19 10:18:30 (pid:45400) SECMAN: required authentication with collector osg-ligo-1.t2.ucsd.edu:9622 failed, so aborting command INVALIDATE_MASTER_ADS.
06/12/19 10:18:30 (pid:45400) ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5006:Failed to authenticate because the subject '/DC=org/DC=incommon/C=US/ST=CA/L=La Jolla/O=University of California, San Diego/OU=UCSD/CN=osg-ligo-1.t2.ucsd.edu' is not currently trusted by you.  If it should be, add it to GSI_DAEMON_NAME or undefine GSI_DAEMON_NAME.
06/12/19 10:18:30 (pid:45400) Failed to send update to collector osg-ligo-1.t2.ucsd.edu:9622.
06/12/19 10:18:30 (pid:45400) All daemons are gone.  Exiting.

However this only the Master daemon. The startd daemon is fine and that is why we do not have thoubsasd of users screaming at us. So it is a wide problem but jobs keep running. So it is important to fix this but we can live with it. I am not sure why but it looks someone already thought about this problem in advance and did this:

https://github.com/glideinWMS/glideinwms/blob/91885100835a5272ccec91266aa92f4e0ff5fb46/creation/web_base/condor_config#L101

Since GSI_DAEMON_NAME is a comma separated list there is no way around this. At the central managers what we did was to completely get rid of it. And use only grid map file. I think the best way would be to just unset GSI_DAEMON_NAME completely and just use grid map file.

So my question is how can I patch the code to avoid setting GSI_DAEMON_NAME in the pilot condor config?

History

#1 Updated by Marco Mambelli 6 months ago

  • Target version changed from v3_5_1 to v3_4_6

#2 Updated by Marco Mambelli 5 months ago

  • Description updated (diff)

#3 Updated by Dennis Box 4 months ago

Activity on this has been going on via email, with fits and starts due to vacation schedules. Unfortuneatly the ticket was not updated to track progress.

I determined that a single line edit on the factory, on file /var/lib/gwms-factory/web-base/condor_vars.lst,
commenting out line 95 like so:

  1. X509_GRIDMAP_TRUSTED_DNS C - GSI_DAEMON_NAME Y N -

causes the master daemon on new glideins to use the mapfile instead of GSI_DAEMON_NAME. The mapfile is escaped properly to handle commas, so a frontend operator with commas in their DN should be able to kill the glideins as they could previously.

I was not able to find a recipe to change this behavior from the frontend.

To test the proposed fix, we need a frontend with a comma in the DN, which I do not have easy access to. It was proposed that Marco or some other factory admin modify the condor_vars.lst file on the CERN ITB factory, do a reconfig, and provide a security name to Edgar so the LIGO frontend (which has commas in the DN) can be connected to the CERN ITB.

Edgar should then be able to submit jobs and more importantly kill the glideins.

#4 Updated by Dennis Box 4 months ago

Edgar and Ilan successfully tested this change and request that it get into the next release.

#5 Updated by Dennis Box 4 months ago

  • Assignee changed from Dennis Box to Marco Mambelli
  • Status changed from New to Feedback

one line commented out of creation/web_base/condor_vars.lst for review

#6 Updated by Marco Mambelli 4 months ago

  • Assignee changed from Marco Mambelli to Dennis Box

Did you consider install/services/Condor.py and install/glidecondor_addDN?
Do they need changes?
Then the documentation may need change.
Consider that we are droppint the tar-ball installation. So just add a note that you'll not change a document if that is only used for tar-ball installation.

#7 Updated by Dennis Box 4 months ago

  • Status changed from Feedback to Resolved

merged to branch_v3_4

#8 Updated by Marco Mambelli 4 months ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF