Support gpus as a resource thats analogous to cpus
Just like cpus, we can extend the scheme to look at GLIDEIN_GPUS and attach it as a resource to SLOT_TYPE_1 if it is present.
Parag and Marco,
Derek and I have been working on getting GPU slots working for the OSG and GLOW VOs. We have factory and frontend entries and basic jobs are working, but we have now found a problem due HTCondor not knowing that the GPU is a consumable item. The reason this is important is that users wants to request a number of regular CPU cores to go along with the GPU. Thus we have partitionable slots enabled. For example, a user might do:
request_cpus = 4
request_gpus = 1
This might match a glidein with 8 CPUs and 1 GPU. As HTCondor does not account for the GPU, it might match another job like the first one, which will not work as the GPU is already in use.
It seems like what we are missing is a GPU entry here:
So that in the final config file we have:
SLOT_TYPE_1 = cpus=8, gpus=1
We tried to override SLOT_TYPE_1 in our frontend config, but condor_startup.sh does not allow that to be overridden.
We also discussed using the GLIDEIN_Resource_Slots, but that seems to just add a SLOT_TYPE_3 instead of replacing SLOT_TYPE_1, right?
So, any thoughts on how to do this correctly? Allow for appending on overriding SLOT_TYPE_1? Have GLIDEIN_Resource_Slots control everything?
USC/ISI - Pegasus Team <http://pegasus.isi.edu>
#5 Updated by Marco Mambelli over 4 years ago
- Status changed from Assigned to Feedback
- Assignee changed from Marco Mambelli to Parag Mhashilkar
Changes are in v3/10092_2
- there is an extra commit. I tried the use of
amend but resulted in a merge ... condor_gpu_discovery (in sbin) is needed for autodiscovery to work. It is available only for HTC >= 8.1.6 Should we add it in tarballs or let people build their own?
- if # of GPUs is not specified and autodiscovery is not working (condor_gpu_discovery missing or returning error, different from 0 GPUs found) then I'm assuming 1 GPU. Should I change this to 0?
- I'm not making use of the variable used in the hotfix. Should I?
#8 Updated by Marco Mambelli over 4 years ago
Here is a blurb that I sent to Joel about the use of the Lucille cluster for the testing of this ticket
Use of Lucille to test a new feature for GlideinWMS.
The Glidein Workflow Management System1 is late binding workflow manager used to run the majority of the jobs in Open Science Grid. The glideins are pilot jobs that start on grid or cloud resources, validate the node, create an uniform environment and advertise the capabilities (e.g. number of cores, available RAM and disk space) so that user jobs can make use of them. GlideinWMS works provisioning heterogeneous resources via glideins and making them available to OSG users by creating a overlay cluster with thousands of CPUs.
A new feature being added to the next release of GlideinWMS (ticket #10092 ) allows glideins to auto-discover and advertise GPUs. There are few clusters in OSG that have GPUs. Some experiments like IceCube are using private clusters with GPUs and expressed interest in using all the GPUs available on OSG . This feature will make these resources more available to all OSG users.
The ability to submit glideins to lutgw1.lunet.edu was instrumental for adding this new feature. Lucille was one of the two clusters where I could test and validate the auto-discovery and advertising of GPUs. With this new feature we are expecting that more science collaborations will ask for opportunistic use of GPU resources on OSG and clusters like Lucille will be more in demand.