Decouple the jobsub server from the condor schedd
As we discovered yesterday, shutting down a jobsub server, but leaving the associated condor schedd running, causes problems,
since apparently jobsub expects there to be a server for each schedd. I.e. even if a jobsub_submit request goes to fifebatch2, if
the jobusb server decides the job should go to fifebatch1 it will go through the fifebatch1 jobsub server rather than through the fifebatch1 schedd.
This would support future scaling out, HA, and potentially even group-specific schedds.
#1 Updated by Dennis Box over 4 years ago
- Status changed from New to Assigned
- Assignee set to Dennis Box
- Target version set to v1.2.4
This issue is also RITM0423108, I am going to close that request with a pointer to here.
I have worked out most of the details of how this could be done, with the exception of DAGS. It appears the server side implementation of jobsub_submit_dag will need some re-writing along the last few lines of this excerpt from the condor_submit_dag man page:
Submit condor_dagman to a remote machine, specifically the condor_schedd daemon on that machine. The condor_dagman job will not run on the local condor_schedd (the submit machine), but on the specified one. This is implemented using the -remote option to condor_submit. Note that this option does not currently specify input files for condor_dagman, nor the individual nodes to be taken along! It is assumed that any necessary files will be present on the remote computer, possibly via a shared file system between the local computer and the remote computer. It is also necessary that the user has appropriate permissions to submit a job to the remote machine; the permissions are the same as those required to use condor_submit's -remote option. If other options are desired, including transfer of other input files, consider using the -no_submit option, modifying the resulting submit file for specific needs, and then using condor_submit on that.