Authentication error in glidein
The glidein cannot connect back and accept jobs.
There is an authentication error visible in the startd log.
Glideins run on different hosts not known apriori. This requires setting an exception in HTCondor, using the GSI_SKIP_HOST_CHECK_CERT_REGEX mechanism.
It seems that on one site there is a mismatch between the DN in the exception and the one used. The exception is in the "ending" of the DN, the part modified when creating proxies.
Looking at the code it seems it can be improved to be more robust with proxies.
Anyway, the event happened only in one site and the same CMS proxy is used in many sites, so it will need further investigation why it happened there.
Below an excerpt from the GGUS ticket
From GGUS ticket: Int. Diary: Notified CMS Glidein Factory team of this ticket. Public Diary: Looking at the Startd logs of that specific pilot, I see . That DNS checking it complains about, must be skipped because we normally put the pilot DN in both GSI_DAEMON_NAME and also in GSI_SKIP_HOST_CHECK_CERT_REGEX in the pilot and Actually I can see those variables on the pilot log.err file (job.6248907.3.out). What it seems to be the problem is that, in the variables above (GSI_SKIP_HOST_CHECK_CERT_REGEX and GSI_DAEMON_NAME ) I see the pilot DN added as  for this pilot, whereas in another healthy pilot I see it as , so I think the problem is in the definition of this variables. I've set GlideinFactory on the Support unit, so we can have somebody from the factory team to look at this. Regards, Diego  SECMAN: required authentication with daemon at <172.30.1.141:25058> failed, so aborting command DC_CHILDALIVE. 04/26/18 10:25:50 (pid:400351) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <172.30.1.141:25058> (try 1 of 3): AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5008:We are trying to connect to a daemon with certificate DN (/DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch), but the host name in the certificate does not match any DNS name associated with the host to which we are connecting (host name is 'wn-a2-21.brunel.ac.uk', IP is '2001:630:10:f001::1c8d', Condor connection address is '<[2001:630:10:f001::1c8d]:25058?addrs=172.30.1.141-25058+[2001-630-10-f001--1c8d]-25058>'). Check that DNS is correctly configured. If the certificate is for a DNS alias, configure HOST_ALIAS in the daemon's configuration. If you wish to use a daemon certificate that does not match the daemon's host name, make GSI_SKIP_HOST_CHECK_CERT_REGEX match the DN, or disable all host name checks by setting GSI_SKIP_HOST_CHECK=true or by defining GSI_DAEMON_NAME.  /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch/CN=1400921378/CN=317110013  /DC=ch/DC=cern/OU=computers/CN=cmspilot02/vocms080.cern.ch
#5 Updated by Brian Bockelman over 1 year ago
Oh - I should point out that the proxy will be continuously updated by some CEs throughout the lifetime of the pilot. So, we can't really hardcode a particular proxy DN here regardless.
Another thing to note is that HTCondor automatically sets up the security sessions between daemons (this was done 2-3 years ago) so all of this is superfluous. One approach may be to just rip it all out.