Track jobs that spawn multiple nodes, e.g. HPC submission
In HPC resources frequently there is a low limit on the number of jobs but each job submitted can span multiple nodes.
This is achieved by adding directives to the batch system that starts the same glidein_startup.sh script with the same parameters on more than one node.
This is confusing for the GWMS system that was expecting one glidein with N cpus and is receiving M (number of nodes) glideins with M*N total cpus.
It is problematic for provisioning (the problem is sometime referred as "job bombing" because the number of provided cpus explodes) and Dirk seemed to report a problem also with accounting (same parameters for the glidein script may cause problems, needs investigation).
This ticket is about a temporary solution that could solve the problem and would require changes limited to the glidein submission.
A more complete solution considering the nodes of multi node jobs as first class citizen will follow in a different ticket and will require pervasive changes in both frontend and factories.
#1 Updated by Marco Mambelli almost 3 years ago
With Parag we discussed about assigning to GLIDEIN_CPUS the total number of cpus (N*M), adding a new GLIDEIN_NODES (name may change) parameter with the number of nodes that the job will span (default 1), and having the glidien_startup.sh script divide the GLIDEIN_CPUS by the nodes.
This should make things work and limit the parts to change.
#4 Updated by Marco Mambelli over 2 years ago
Notes form a discussion with Dirk:
- the mechanism in place in SLURM BLAHP allows already multijob submission by setting it in the factory
- a problem for debugging is that all logs from different nodes are overlapping in a single file (condor stdout/err mechanism - single job submitted by condor-G)
- draining multiple nodes may be problematic. The user is billed until the last node is freed. Makes no sense to release early few nodes. Nodes should ideally coordinate (locally in the spool directory or via a classad)
- OK to define the GLIDEIN_NODES in the factory
- Glidein timeout may be problematic (on HPC resources you have to stay queued for days)
- Most important is to make Frontend aware of the tot # of cpus, to scale down submission
#8 Updated by Marco Mambelli over 1 year ago
Adding a note: GLIDEIN_NODES/GLIDEIN_CPUS relationships changed from earlier:
- GLIDEIN_NODES is still the number of nodes the job will span
- GLIDEIN_CPUS is the number of CPUs expected on one node
So the total # of CPUs will be GLIDEIN_NODES*GLIDEIN_CPUS
This was changed because it is better to keep GLIDEIN_CPUS independent:
- admin may want to set GLIDEIN_CPUS artificially to create "virtual" CPUs on the nodes
- GLIDEIN_CPUS could be the result of automatic discovery
#13 Updated by Marco Mambelli 10 months ago
- Assignee changed from Marco Mambelli to Dennis Box
- Status changed from Work in progress to Feedback
changes in v35/15176
undocumented testing feature:
attribute GLIDEIN_MULTIGLIDEIN (in frontend or factory) forks multiple glideins all w/ the same parameters (similar to multinode submission, but is all on one node)
e.g. <attr name="GLIDEIN_MULTIGLIDEIN" glidein_publish="True" job_publish="True" parameter="True" type="int" value="3"/>