Project

General

Profile

Feature #15176

Track jobs that spawn multiple nodes, e.g. HPC submission

Added by Marco Mambelli almost 3 years ago. Updated 9 months ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
Start date:
01/17/2017
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS, HEPCloud

Duration:

Description

In HPC resources frequently there is a low limit on the number of jobs but each job submitted can span multiple nodes.
This is achieved by adding directives to the batch system that starts the same glidein_startup.sh script with the same parameters on more than one node.
This is confusing for the GWMS system that was expecting one glidein with N cpus and is receiving M (number of nodes) glideins with M*N total cpus.
It is problematic for provisioning (the problem is sometime referred as "job bombing" because the number of provided cpus explodes) and Dirk seemed to report a problem also with accounting (same parameters for the glidein script may cause problems, needs investigation).

This ticket is about a temporary solution that could solve the problem and would require changes limited to the glidein submission.

A more complete solution considering the nodes of multi node jobs as first class citizen will follow in a different ticket and will require pervasive changes in both frontend and factories.

History

#1 Updated by Marco Mambelli almost 3 years ago

With Parag we discussed about assigning to GLIDEIN_CPUS the total number of cpus (N*M), adding a new GLIDEIN_NODES (name may change) parameter with the number of nodes that the job will span (default 1), and having the glidien_startup.sh script divide the GLIDEIN_CPUS by the nodes.
This should make things work and limit the parts to change.

#2 Updated by Parag Mhashilkar almost 3 years ago

  • Priority changed from Normal to High

#3 Updated by Parag Mhashilkar almost 3 years ago

  • Target version changed from v3_3_x to v3_3_3

#4 Updated by Marco Mambelli over 2 years ago

Notes form a discussion with Dirk:
- the mechanism in place in SLURM BLAHP allows already multijob submission by setting it in the factory
- a problem for debugging is that all logs from different nodes are overlapping in a single file (condor stdout/err mechanism - single job submitted by condor-G)
- draining multiple nodes may be problematic. The user is billed until the last node is freed. Makes no sense to release early few nodes. Nodes should ideally coordinate (locally in the spool directory or via a classad)
- OK to define the GLIDEIN_NODES in the factory
- Glidein timeout may be problematic (on HPC resources you have to stay queued for days)
- Most important is to make Frontend aware of the tot # of cpus, to scale down submission

#5 Updated by Parag Mhashilkar over 2 years ago

  • Stakeholders updated (diff)

#6 Updated by Marco Mambelli about 2 years ago

Code is in mst/15176

#7 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_3_3 to v3_3_4

#8 Updated by Marco Mambelli over 1 year ago

Adding a note: GLIDEIN_NODES/GLIDEIN_CPUS relationships changed from earlier:
- GLIDEIN_NODES is still the number of nodes the job will span
- GLIDEIN_CPUS is the number of CPUs expected on one node
So the total # of CPUs will be GLIDEIN_NODES*GLIDEIN_CPUS

This was changed because it is better to keep GLIDEIN_CPUS independent:
- admin may want to set GLIDEIN_CPUS artificially to create "virtual" CPUs on the nodes
- GLIDEIN_CPUS could be the result of automatic discovery

#9 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_3_4 to v3_4_0

#10 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_4_0 to v3_4_1

#11 Updated by Marco Mambelli over 1 year ago

  • Target version changed from v3_4_1 to v3_5

#12 Updated by Marco Mambelli about 1 year ago

  • Status changed from New to Work in progress

#13 Updated by Marco Mambelli 10 months ago

  • Assignee changed from Marco Mambelli to Dennis Box
  • Status changed from Work in progress to Feedback

changes in v35/15176

undocumented testing feature:
attribute GLIDEIN_MULTIGLIDEIN (in frontend or factory) forks multiple glideins all w/ the same parameters (similar to multinode submission, but is all on one node)
e.g. <attr name="GLIDEIN_MULTIGLIDEIN" glidein_publish="True" job_publish="True" parameter="True" type="int" value="3"/>

#14 Updated by Marco Mambelli 10 months ago

  • Assignee changed from Dennis Box to Marco Mambelli
  • Status changed from Feedback to Resolved

#15 Updated by Marco Mambelli 10 months ago

  • Target version changed from v3_5 to v3_4_4

#16 Updated by Marco Mambelli 9 months ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF