Project

General

Profile

Feature #10092

Support gpus as a resource thats analogous to cpus

Added by Parag Mhashilkar over 4 years ago. Updated about 4 years ago.

Status:
Closed
Priority:
Urgent
Category:
-
Target version:
Start date:
09/09/2015
Due date:
% Done:

0%

Estimated time:
Stakeholders:

OSG

Duration:

Description

Just like cpus, we can extend the scheme to look at GLIDEIN_GPUS and attach it as a resource to SLOT_TYPE_1 if it is present.

Parag and Marco,

Derek and I have been working on getting GPU slots working for the OSG and GLOW VOs. We have factory and frontend entries and basic jobs are working, but we have now found a problem due HTCondor not knowing that the GPU is a consumable item. The reason this is important is that users wants to request a number of regular CPU cores to go along with the GPU. Thus we have partitionable slots enabled. For example, a user might do:

request_cpus = 4
request_gpus = 1

This might match a glidein with 8 CPUs and 1 GPU. As HTCondor does not account for the GPU, it might match another job like the first one, which will not work as the GPU is already in use.

It seems like what we are missing is a GPU entry here:

https://github.com/holzman/glideinWMS/blob/master/creation/web_base/condor_startup.sh#L567

So that in the final config file we have:

SLOT_TYPE_1 = cpus=8, gpus=1

We tried to override SLOT_TYPE_1 in our frontend config, but condor_startup.sh does not allow that to be overridden.

We also discussed using the GLIDEIN_Resource_Slots, but that seems to just add a SLOT_TYPE_3 instead of replacing SLOT_TYPE_1, right?

So, any thoughts on how to do this correctly? Allow for appending on overriding SLOT_TYPE_1? Have GLIDEIN_Resource_Slots control everything?

Thanks,

--
Mats Rynge
USC/ISI - Pegasus Team <http://pegasus.isi.edu>

History

#1 Updated by Parag Mhashilkar over 4 years ago

  • Assignee changed from Parag Mhashilkar to Marco Mambelli

#2 Updated by Brian Bockelman over 4 years ago

  • Priority changed from Normal to Urgent

#3 Updated by Parag Mhashilkar over 4 years ago

Marco please review the hotfix inv3/10092-hotfix. This should not be merged into the production branches in favor of a better long term solution.

#4 Updated by Marco Mambelli over 4 years ago

Hotfix seems correct but it is only for partitionable slots. Which is OK given the request.
Working on solution in v3/10092_2

#5 Updated by Marco Mambelli over 4 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Marco Mambelli to Parag Mhashilkar

Changes are in v3/10092_2

Notes:
- there is an extra commit. I tried the use of amend but resulted in a merge ...
condor_gpu_discovery (in sbin) is needed for autodiscovery to work. It is available only for HTC >= 8.1.6 Should we add it in tarballs or let people build their own?
- if # of GPUs is not specified and autodiscovery is not working (condor_gpu_discovery missing or returning error, different from 0 GPUs found) then I'm assuming 1 GPU. Should I change this to 0?
- I'm not making use of the variable used in the hotfix. Should I?

#6 Updated by Parag Mhashilkar over 4 years ago

  • Assignee changed from Parag Mhashilkar to Marco Mambelli

Sent feedback separately. Rest looks ok to merge

#7 Updated by Marco Mambelli about 4 years ago

  • Status changed from Feedback to Resolved

Tested on HCC_US_Omaha_crane_gpu
Changes in v3/10092_2 merged to branch_v3_2
Opened #10910 to review how slots layout is handled (resources added to main require partitionable slots)

#8 Updated by Marco Mambelli about 4 years ago

Here is a blurb that I sent to Joel about the use of the Lucille cluster for the testing of this ticket

Use of Lucille to test a new feature for GlideinWMS.

The Glidein Workflow Management System1 is late binding workflow manager used to run the majority of the jobs in Open Science Grid. The glideins are pilot jobs that start on grid or cloud resources, validate the node, create an uniform environment and advertise the capabilities (e.g. number of cores, available RAM and disk space) so that user jobs can make use of them. GlideinWMS works provisioning heterogeneous resources via glideins and making them available to OSG users by creating a overlay cluster with thousands of CPUs.
A new feature being added to the next release of GlideinWMS (ticket #10092 [2]) allows glideins to auto-discover and advertise GPUs. There are few clusters in OSG that have GPUs. Some experiments like IceCube are using private clusters with GPUs and expressed interest in using all the GPUs available on OSG [3]. This feature will make these resources more available to all OSG users.
The ability to submit glideins to lutgw1.lunet.edu was instrumental for adding this new feature. Lucille was one of the two clusters where I could test and validate the auto-discovery and advertising of GPUs. With this new feature we are expecting that more science collaborations will ask for opportunistic use of GPU resources on OSG and clusters like Lucille will be more in demand.

[1]http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/index.html
[2]https://cdcvs.fnal.gov/redmine/issues/10092
[3]http://www.opensciencegrid.org/icecube-is-providing-a-unique-view-into-the-universe/

#9 Updated by Parag Mhashilkar about 4 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF