Project

General

Profile

Bug #19827

Authentication error in glidein

Added by Marco Mambelli about 1 year ago. Updated about 1 year ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
04/27/2018
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

The glidein cannot connect back and accept jobs.
There is an authentication error visible in the startd log.
Glideins run on different hosts not known apriori. This requires setting an exception in HTCondor, using the GSI_SKIP_HOST_CHECK_CERT_REGEX mechanism.
It seems that on one site there is a mismatch between the DN in the exception and the one used. The exception is in the "ending" of the DN, the part modified when creating proxies.

Looking at the code it seems it can be improved to be more robust with proxies.

Anyway, the event happened only in one site and the same CMS proxy is used in many sites, so it will need further investigation why it happened there.

Below an excerpt from the GGUS ticket
https://ggus.eu/index.php?mode=ticket_info&ticket_id=134763

From GGUS ticket:

Int. Diary:
Notified CMS Glidein Factory team of this ticket.
Public Diary:
Looking at the Startd logs of that specific pilot, I see [1]. That DNS checking it complains about, must be skipped because we normally put the pilot DN in both GSI_DAEMON_NAME and also in GSI_SKIP_HOST_CHECK_CERT_REGEX in the pilot and Actually I can see those variables on the pilot log.err file (job.6248907.3.out). 

What it seems to be the problem is that, in the variables above (GSI_SKIP_HOST_CHECK_CERT_REGEX and GSI_DAEMON_NAME ) I see the pilot DN added as [2] for this pilot, whereas in another healthy pilot I see it as [3], so I think the problem is in the definition of this variables.

I've set GlideinFactory on the Support unit, so we can have somebody from the factory team to look at this.

Regards,

Diego

[1] SECMAN: required authentication with daemon at <172.30.1.141:25058> failed, so aborting command DC_CHILDALIVE.
04/26/18 10:25:50 (pid:400351) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <172.30.1.141:25058> (try 1 of 3): AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5008:We are trying to connect to a daemon with certificate DN (/DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch), but the host name in the certificate does not match any DNS name associated with the host to which we are connecting (host name is 'wn-a2-21.brunel.ac.uk', IP is '2001:630:10:f001::1c8d', Condor connection address is '<[2001:630:10:f001::1c8d]:25058?addrs=172.30.1.141-25058+[2001-630-10-f001--1c8d]-25058>'). Check that DNS is correctly configured. If the certificate is for a DNS alias, configure HOST_ALIAS in the daemon's configuration. If you wish to use a daemon certificate that does not match the daemon's host name, make GSI_SKIP_HOST_CHECK_CERT_REGEX match the DN, or disable all host name checks by setting GSI_SKIP_HOST_CHECK=true or by defining GSI_DAEMON_NAME.

[2] /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch/CN=1400921378/CN=317110013
[3] /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch

History

#1 Updated by Marco Mambelli about 1 year ago

  • Subject changed from Authentcation error in glidein (startd logs) to Authentcation error in glidein

#2 Updated by Marco Mambelli about 1 year ago

  • Subject changed from Authentcation error in glidein to Authentication error in glidein

#3 Updated by Marco Mambelli about 1 year ago

The site in question is IPv6

#4 Updated by Brian Bockelman about 1 year ago

Note that HTCondor should be matching GSI_SKIP_HOST_CHECK_CERT_REGEX against the EEC identity, not the DN of the proxy itself. In the case of a hostcert, the two should be the same thing.

Could be a condor bug. Can't see how this would be a GWMS bug.

#5 Updated by Brian Bockelman about 1 year ago

Oh - I should point out that the proxy will be continuously updated by some CEs throughout the lifetime of the pilot. So, we can't really hardcode a particular proxy DN here regardless.

Another thing to note is that HTCondor automatically sets up the security sessions between daemons (this was done 2-3 years ago) so all of this is superfluous. One approach may be to just rip it all out.

#6 Updated by Marco Mambelli about 1 year ago

  • Status changed from New to Closed

Marco Mascheroni reported that the problem was caused by IPv6 interactions in condor.
Was solved by disabling IPv6.
No problem w/ GlideinWMS



Also available in: Atom PDF