Project

General

Profile

Bug #10142

Jobs failing after match because startd "Couldn't send ALIVE to schedd"

Added by Marco Mambelli over 4 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Urgent
Category:
-
Target version:
Start date:
09/14/2015
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

OSG

Duration:

Description

After upgrading the factory to 3.2.11-1 jobs form some VOs are failing. CMS jobs are running fine.
SBGrid and OSG-Connect jobs are failing.

The following is visible in the StartdLog included in the glidein logs:

JobCurrentStartExecutingDate = 1441180074
JOB_GLIDEIN_Schedd = "schedd_glideins4@gfactory-1.t2.ucsd.edu" 
BlockWrites = 0
09/02/15 01:15:33 (pid:56300) ERROR: SECMAN:2009:DENIED authorization of server 'submit-side@matchsession/134.174.140.230' (I am acting as the client): reason: CLIENT authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 134.174.140.230,sch.med.harvard.edu, hostname size = 1, original ip address = 134.174.140.230.
09/02/15 01:15:33 (pid:56300) Couldn't send ALIVE to schedd at <134.174.140.230:9615?sock=6783_7ee8_5>
…
09/02/15 01:19:02 (pid:56300) ERROR: SECMAN:2009:DENIED authorization of server 'submit-side@matchsession/134.174.140.230' (I am acting as the client): reason: cached result for CLIENT; see first case for the full reason.
09/02/15 01:19:02 (pid:56300) Couldn't send ALIVE to schedd at <134.174.140.230:9615?sock=6783_7ee8_5>
JobPid = 56388
...

History

#1 Updated by Parag Mhashilkar over 4 years ago

We should try with adding following to glidein's condor config

STARTD.ALLOW_DAEMON = submit-side@matchsession/*

You can do this by adding following attribute in factory config

<attr name="STARTD.ALLOW_DAEMON" const="True" glidein_publish="True" job_publish="False" parameter="True" publish="False" type="expr" value="submit-side@matchsession/*"/>

Let me know if this works and we can make a point release with these changes.

#2 Updated by Parag Mhashilkar over 4 years ago

  • Target version set to v3_2_11_2
  • First Occurred set to v3_2_11
  • Stakeholders updated (diff)

#3 Updated by Marco Mambelli over 4 years ago

The error suggests to add:

submit-side@matchsession/134.174.140.230

The manual about SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION suggests to add:

ALLOW_DAEMON = submit-side@matchsession/192.168.123.*

The change in behavior seems connected with the security changes done for #7807

@@ -85,10 +85,19 @@ GSI_DAEMON_DIRECTORY=$(LOCAL_DIR)
 SEC_DEFAULT_AUTHENTICATION = REQUIRED
 SEC_DEFAULT_AUTHENTICATION_METHODS = GSI 

+
+SEC_CLIENT_AUTHENTICATION_METHODS = CLAIMTOBE, $(SEC_DEFAULT_AUTHENTICATION_METHODS)
+STARTD.ALLOW_CLIENT = collector*/*, anonymous@claimtobe/*, frontend*/*, condor/*
+STARTD.GSI_DAEMON_NAME =
+STARTD.GSI_SKIP_HOST_CHECK = true
+
+
 DENY_WRITE = anonymous@*
 DENY_ADMINISTRATOR = anonymous@*
 DENY_DAEMON = anonymous@*
 DENY_NEGOTIATOR = anonymous@*
+DENY_OWNER = anonymous@*
+DENY_CONFIG = anonymous@*

 LOCAL_CONFIG_FILE       = 
 

#4 Updated by Marco Mambelli over 4 years ago

I've been trying to reproduce the error using VMs in fermicloud.

The jobs are matched and started but the startd cannot authenticate back with the schedd that will think that the job died.
Authentication should happen by trusting the collector that authenticated with both schedd and startd and is setting an authentication token between schedd and startd at the time of match.
The following must be in both startd and schedd in order for that to work:
SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION=True

I checked and it is in all the glideins (startd).
If the submit hosts were installed with the Glideinwms package they should have that variable set to true.
The default is different depending on the HTCondor version.

I don’t know why the change in GlideinWMS version is causing the problem.
If the jobs are submitted by the frontend + user collector this will not happen because the DN of the user collector is in the condor_mapfile (so it is trusted anyway)
I did the following tests, all successful:
- glidein-factory 3.2.11.1-1.osg32.el6, condor 8.2.9
- glidein-vofrontend 3.2.11.1-1.osg32.el6, condor 8.2.9

All the combinations of
  • jobs submitted by:
    - the vofrontend
    - custom condor 8.2.9 submit host
    - glideinwms-userschedd 8.2.10-1, condor 8.2.9
  • and the following glideins:
    - condor 8.2.6
    - condor 8.3.8
    - condor 8.2.8 downloaded from UCSD (the tar ball used in the jobs that failed)

Simple test jobs all run successfully.

I tried to artificially set SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION=False and the error observed is different (xxx.xxx.xxx.aaa is the submit host):

09/14/15 15:34:14 (pid:6655) DC_AUTHENTICATE: required authentication of xxx.xxx.xxx.aaa failed: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5005:Failed to authenticate with client.  Client does not trust our certificate.  You may want to check the GSI_DAEMON_NAME in the condor_config

#5 Updated by Marco Mambelli over 4 years ago

Different attempts to trigger the error changing the configuration failed.

At the end seems that the trigger is simply the length of the job over 20min (when the startd sends the alive message).
I triggered the same error:
- jobs are evicted and keep being resubmitted (in a never ending loop since I did not set a max number of re-submissions)
- the error above is visible in the startd logs
- glideins keep running and the same jobs are re-assigned to them

It seems to happen also from the frontend's schedd.

The fix, tested patching manually, is actually to add "submit-side@matchsession/*" to STARTD.ALLOW_CLIENT instead of STARTD.ALLOW_DAEMON

The error does happen also when the glidein (startd) is htcondor 8.3.8
This is puzzling because we attributed to the condor version the fact that CMS jobs were not affected by this error

Here the error w/ 8.3.8:

09/17/15 16:49:37 (pid:21490) ERROR: SECMAN:2009:DENIED authorization of server 'submit-side@matchsession/131.225.154.190' (I am acting as the client): reason: CLIENT authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 131.225.154.190,fermicloud055.fnal.gov, hostname size = 1, original ip address = 131.225.154.190.
09/17/15 16:49:37 (pid:21490) Couldn't send ALIVE to schedd at <131.225.154.190:9615?sock=26118_1441_3>
09/17/15 16:49:42 (pid:21490) SharedPortClient: sent connection request to schedd at <131.225.154.190:9615> for shared port id 26118_1441_3
09/17/15 16:49:42 (pid:21490) SECMAN: command 441 ALIVE to schedd at <131.225.154.190:9615> from TCP port 38837 (blocking).
09/17/15 16:49:42 (pid:21490) Using requested session <131.225.154.159:59276>#1442526170#1.
09/17/15 16:49:42 (pid:21490) SECMAN: found cached session id <131.225.154.159:59276>#1442526170#1 for {<131.225.154.190:9615?sock=26118_1441_3>,<441>}.

#6 Updated by Marco Mambelli over 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mambelli to Parag Mhashilkar
  • Priority changed from Normal to Urgent
  • Occurs In v3_2_11_1 added

Changes are in v3/10142

These changes fix the problem.
Remains to understand why this was not occurring before #7807 changes or with CMS jobs.

#8 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Parag Mhashilkar to Marco Mambelli

Looks ok. Merged to branch_v3_2.

#9 Updated by Marco Mambelli over 4 years ago

I did some more tests with long jobs to explain why CMS and GLOW were not experiencing the problem.

With a schedd with htcondor 8.3.8 long jobs run fine, both 8.3 and 8.2 glideins.
So the version of the schedd and the length of the jobs seem to be the deciding factors.

The problem (startd unable to cend the keepalive message) happens when:
- jobs are longer than 20 min (e.g 1500 sec jobs)
- the schedd is htcondor 8.2

Having a schedd separate from the frontend or the version of the startd seem to be non deciding factors.

#10 Updated by Marco Mambelli over 4 years ago

Adding here a message from Jaime that clarifies that htcondor 8.3.2 or bigger is not showing the problem because it uses a different keepalive mechanism.

The documentation incomplete. To take advantage of SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION , you need to modify both ALLOW_DAEMON and ALLOW_CLIENT to include submit-side@matchsession/* (or some restrictive set of IPs). I believe the ALLOW_CLIENT part was overlooked because the default ALLOW_CLIENT setting is *, i.e. allow everything. So if ALLOW_CLIENT isn’t set in the config file, everything still works. Most people never set ALLOW_CLIENT to something more restrictive. I’ll see that the HTCondor manual is updated.

Two changes in HTCondor 8.3 may help explain the difference in behavior you’re seeing between different frontends and glideins. First, SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION is enabled by default starting with 8.3.6. Second, starting with 8.3.2, ALIVE messages are not sent from the startd to the schedd while a job is running. Instead, HTCondor relies on the long-lived connection between the starter and shadow to detect a disconnect or failure.

For 8.4.1 and beyond, we are working on removing the requirement of setting ALLOW_DAEMON and ALLOW_CLIENT for SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION to work properly.

Just to avoid confusion, following messages established that ALLOW_DAEMON is not actually needed.

#11 Updated by Parag Mhashilkar over 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF