Project

General

Profile

Bug #5621

Fixed partitioning broken

Added by Igor Sfiligoi over 5 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Igor Sfiligoi
Category:
Glidein
Target version:
Start date:
03/10/2014
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

The fixed partitioning of multi-core glideins seems to be completely broken in 3_2_3.

The problem stems from the fact that the glideins try to allocate the whole memory to each and every slot,
which of course fails after the first one.

History

#1 Updated by Igor Sfiligoi over 5 years ago

Example error message from Condor

03/07/14 20:28:53 (pid:25443) ERROR: Can't allocate 2nd slot of type 1
        Requesting: slot type 1: Cpus: 1, Memory: 16000, Swap: auto, Disk: auto
        Available:  Slot #1: Cpus: 7, Memory: 0, Swap: 100.00%, Disk: 100.00%
03/07/14 20:28:53 (pid:25443) ERROR "Ran out of system resources" at line 122 in file /slots/10/dir_25749/userdir/src/condor_startd.V6/slot_builder.cpp

And here is what the Condor was working on:

MEMORY=16000
GLIDEIN_MaxMemMBs=16000
SLOT_TYPE_1 = cpus=1, memory=$(GLIDEIN_MaxMemMBs)

#2 Updated by Igor Sfiligoi over 5 years ago

  • Status changed from New to Feedback
  • Assignee changed from Igor Sfiligoi to Parag Mhashilkar

I have fixed the problem by reverting back to HTCondor auto-partitioning for fixed slots.
Since GLIDEIN_MaxMemMBs is used to set the total amount of memory the glidein can use, there is no need for further micromanagement.

The code is in branch v3/5621 (branched from v3_2_3 tag).

Please review.

#3 Updated by Parag Mhashilkar over 5 years ago

  • Status changed from Feedback to Closed
  • Assignee changed from Parag Mhashilkar to Igor Sfiligoi

Merged it to branch_v3_2.



Also available in: Atom PDF