Fixed partitioning broken
The fixed partitioning of multi-core glideins seems to be completely broken in 3_2_3.
The problem stems from the fact that the glideins try to allocate the whole memory to each and every slot,
which of course fails after the first one.
#1 Updated by Igor Sfiligoi over 5 years ago
Example error message from Condor
03/07/14 20:28:53 (pid:25443) ERROR: Can't allocate 2nd slot of type 1 Requesting: slot type 1: Cpus: 1, Memory: 16000, Swap: auto, Disk: auto Available: Slot #1: Cpus: 7, Memory: 0, Swap: 100.00%, Disk: 100.00% 03/07/14 20:28:53 (pid:25443) ERROR "Ran out of system resources" at line 122 in file /slots/10/dir_25749/userdir/src/condor_startd.V6/slot_builder.cpp
And here is what the Condor was working on:
MEMORY=16000 GLIDEIN_MaxMemMBs=16000 SLOT_TYPE_1 = cpus=1, memory=$(GLIDEIN_MaxMemMBs)
#2 Updated by Igor Sfiligoi over 5 years ago
- Status changed from New to Feedback
- Assignee changed from Igor Sfiligoi to Parag Mhashilkar
I have fixed the problem by reverting back to HTCondor auto-partitioning for fixed slots.
Since GLIDEIN_MaxMemMBs is used to set the total amount of memory the glidein can use, there is no need for further micromanagement.
The code is in branch v3/5621 (branched from v3_2_3 tag).