Project

General

Profile

Bug #22151

--blacklist option doesn't block sites like it's supposed to do

Added by Shreyas Bhat 4 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
03/20/2019
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

A user brought up in INC000001040864 that they'd tried to use the --blacklist option to jobsub to block the site "Pisa". The jobs ran there anyway.

Upon further investigation of job , we can see in the main .cmd file that the requirements string was set to:

requirements  = target.machine =!= MachineAttrMachine1 && target.machine =!= MachineAttrMachine2 && (stringListIMember(target.GLIDEIN_Site,my.Blacklist_Sites))  && (isUndefined(DesiredOS) || stringListsIntersect(toUpper(DesiredOS),IFOS_installed)) && (stringListsIntersect(toUpper(target.HAS_usage_model), toUpper(my.DESIRED_usage_model))) && (TARGET.HAS_CVMFS_gm2_opensciencegrid_org==true)

The second line there is exactly the opposite of what we intend - we're actually selecting the site that's been blacklisted, rather than excluding it. The offending line is in lib/groupsettings/JobSettings.py, line 1847:

            _default = '&& (stringListIMember(target.GLIDEIN_Site,my.Blacklist_Sites)) '

That is the default that gets added to the job requirements if a job has --blacklist specified. This line needs to be changed to something like:

            _default = '&& (stringListIMember(target.GLIDEIN_Site,my.Blacklist_Sites) == FALSE) '

Subtasks

Bug #22178: Review request [commit:b92ca687b7b0e5ff4c506cb13f4da1d81978cbda: If --site or --blacklist are specified in the command line, don't run if the slot's GLIDEIN_Site is undefined]ClosedDennis Box

History

#1 Updated by Shreyas Bhat 4 months ago

  • Assignee set to Shreyas Bhat

#2 Updated by Shreyas Bhat 4 months ago

  • % Done changed from 0 to 50
  • Status changed from New to Work in progress

Change made and pushed to branch on redmine. Testing will be a bit strange for this one, since we can't really replicate sites as easily as we can just test it fully in ITB. Procedure will be as follows:

1) Test on my dev machines that the correct job requirements string is generated in the classad.
2) Get this into a RC.
3) When we're ready to release the next version, do a test specific to this ticket (try to blacklist a site, make sure it doesn't go there; and then blacklist a site and select the same site, and make sure that the job doesn't run).

#3 Updated by Shreyas Bhat 4 months ago

(1) is done. Any additional code changes will be committed shortly.

Dennis and I decided to change one behavior: If --site or --blacklist are specified and GLIDEIN_Site is not defined in a slot, we will fail the job.

Here are the notes from testing that:

Site:

&& ((isUndefined(target.GLIDEIN_Site) == FALSE) && (stringListIMember(target.GLIDEIN_Site,my.DESIRED_Sites)))

Test 1: select the site
Test 2: don't select the site

Both should run

Blacklist:

&& ((isUndefined(target.GLIDEIN_Site) == FALSE) && (stringListIMember(target.GLIDEIN_Site,my.Blacklist_Sites) == FALSE))

Test 3: blacklist somewhere else (job should run)
Test 4: blacklist site (job should not run)
Test 5: blacklist site AND others (job should not run)

Test undefined:

Remove GLIDEIN_Site from condor config, restart condor

Test 6: --site shouldn't run
Test 7: --blacklist shouldn't run
Test 8: neither --site nor --blacklist should run

While this is proceeding, tests (4) and (5) should continue to stay idle


On server:

[root@fermicloud074 groupsettings]# condor_status -af GLIDEIN_Site
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE
MYTESTSITE

Site tests:

(1)

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit -G nova --jobsub-server=fermicloud074.fnal.gov --site MYTESTSITE file:///home/sbhat/probe SLEEP 600 | grep "Use job id" 
Use job id 303.0@fermicloud074.fnal.gov to retrieve output

(2)

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit -G nova --jobsub-server=fermicloud074.fnal.gov file:///home/sbhat/probe SLEEP 600 | grep "Use job id" 
Use job id 304.0@fermicloud074.fnal.gov to retrieve output

Blacklist Tests:

(3)

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit -G nova --jobsub-server=fermicloud074.fnal.gov --blacklist MYTESTSITEFAKE file:///home/sbhat/probe SLEEP 600 | grep "Use job id" 
Use job id 305.0@fermicloud074.fnal.gov to retrieve output

(4)

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit -G nova --jobsub-server=fermicloud074.fnal.gov --blacklist MYTESTSITE file:///home/sbhat/probe SLEEP 600 | grep "Use job id" 
Use job id 306.0@fermicloud074.fnal.gov to retrieve output

(5)

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit -G nova --jobsub-server=fermicloud074.fnal.gov --blacklist MYTESTSITE,MYOTHERTESTSITE file:///home/sbhat/probe SLEEP 600 | grep "Use job id" 
Use job id 307.0@fermicloud074.fnal.gov to retrieve output

Test results:

[root@fermicloud074 groupsettings]# condor_q

-- Schedd: fermicloud074.fnal.gov : <131.225.154.254:64758>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 303.0   sbhat           3/20 12:08   0+00:05:10 R  0   0.0  probe_20190320_120
 304.0   sbhat           3/20 12:09   0+00:04:50 R  0   0.0  probe_20190320_120
 305.0   sbhat           3/20 12:10   0+00:03:49 R  0   0.0  probe_20190320_121
 306.0   sbhat           3/20 12:10   0+00:00:00 I  0   0.0  probe_20190320_121
 307.0   sbhat           3/20 12:13   0+00:00:00 I  0   0.0  probe_20190320_121

This is what we expect. All but 306 and 307 should run. Looking at 306 more closely:

[root@fermicloud074 groupsettings]# condor_q 306 -better-analyze

-- Schedd: fermicloud074.fnal.gov : <131.225.154.254:64758>
User priority for sbhat@fermicloud074.fnal.gov is not available, attempting to analyze without it.
---
306.000:  Run analysis summary.  Of 20 machines,
     20 are rejected by your job's requirements
      0 reject your job because of their own requirements
      0 match and are already running your jobs
      0 match but are serving other users
      0 are available to run your job
    No successful match recorded.
    Last failed match: Wed Mar 20 12:12:02 2019

    Reason for last match failure: no match found

WARNING:  Be advised:
   No resources matched request's constraints

The Requirements expression for your job is:

    ( target.machine isnt MachineAttrMachine1 &&
      target.machine isnt MachineAttrMachine2 &&
      ( ( isUndefined(target.GLIDEIN_Site) == false ) &&
        ( stringListIMember(target.GLIDEIN_Site,my.Blacklist_Sites) == false ) ) &&
      ( isUndefined(DesiredOS) ||
        stringListsIntersect(toUpper(DesiredOS),IFOS_installed) ) ) &&
    ( TARGET.Arch == "X86_64" ) && ( TARGET.OpSys == "LINUX" ) &&
    ( TARGET.Disk >= RequestDisk ) && ( TARGET.Memory >= RequestMemory ) &&
    ( TARGET.HasFileTransfer )

Your job defines the following attributes:

    Blacklist_Sites = "MYTESTSITE" 
    DiskUsage = 20
    ImageSize = 7
    RequestDisk = 20
    RequestMemory = 1

The Requirements expression for your job reduces to these conditions:

         Slots
Step    Matched  Condition
-----  --------  ---------
[0]          20  target.machine isnt MachineAttrMachine1
[1]          20  target.machine isnt MachineAttrMachine2
[4]           0  stringListIMember(target.GLIDEIN_Site,my.Blacklist_Sites) == false
[7]          20  isUndefined(DesiredOS)

Suggestions:

    Condition                         Machines Matched    Suggestion
    ---------                         ----------------    ----------
1   ( ( isUndefined(target.GLIDEIN_Site) == false ) && ( stringListIMember(target.GLIDEIN_Site,"MYTESTSITE") == false ) )
                                      0                   REMOVE
2   target.machine isnt MachineAttrMachine120
3   target.machine isnt MachineAttrMachine220
4   ( isUndefined(DesiredOS) || stringListsIntersect(toUpper(DesiredOS),IFOS_installed) )
                                      20
5   ( TARGET.Arch == "X86_64" )       20
6   ( TARGET.OpSys == "LINUX" )       20
7   ( TARGET.Disk >= 20 )             20
8   ( TARGET.Memory >= ifthenelse(MemoryUsage isnt undefined,MemoryUsage,1) )
                                      20
9   ( TARGET.HasFileTransfer )        20

Good. 306 can't run because we say in our job that we don't want MYTESTSITE. The better-analyze output from 307 looks similar.

The other jobs ran but these won't. That's what we expect. I'm going to turn off GLIDEIN_Site and these should continue to stay idle due to the first condition.

[root@fermicloud074 groupsettings]# condor_q

-- Schedd: fermicloud074.fnal.gov : <131.225.154.254:64758>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 306.0   sbhat           3/20 12:10   0+00:00:00 I  0   0.0  probe_20190320_121
 307.0   sbhat           3/20 12:13   0+00:00:00 I  0   0.0  probe_20190320_121

2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended

All good:

[root@fermicloud074 groupsettings]# condor_status -af GLIDEIN_Site
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined
undefined

[root@fermicloud074 groupsettings]# condor_q

-- Schedd: fermicloud074.fnal.gov : <131.225.154.254:64758>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 306.0   sbhat           3/20 12:10   0+00:00:00 I  0   0.0  probe_20190320_121
 307.0   sbhat           3/20 12:13   0+00:00:00 I  0   0.0  probe_20190320_121

2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended

(6)

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit -G nova --jobsub-server=fermicloud074.fnal.gov --site MYTESTSITE file:///home/sbhat/probe SLEEP 600 | grep "Use job id" 
Use job id 308.0@fermicloud074.fnal.gov to retrieve output

(7)

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit -G nova --jobsub-server=fermicloud074.fnal.gov --blacklist MYTESTSITE file:///home/sbhat/probe SLEEP 600 | grep "Use job id" 
Use job id 309.0@fermicloud074.fnal.gov to retrieve output

(8)

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit -G nova --jobsub-server=fermicloud074.fnal.gov file:///home/sbhat/probe SLEEP 600 | grep "Use job id" 
Use job id 310.0@fermicloud074.fnal.gov to retrieve output

We see exactly what we expect. 310 starts to run, but 306-309 don't:

[root@fermicloud074 groupsettings]# condor_q

-- Schedd: fermicloud074.fnal.gov : <131.225.154.254:64758>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 306.0   sbhat           3/20 12:10   0+00:00:00 I  0   0.0  probe_20190320_121
 307.0   sbhat           3/20 12:13   0+00:00:00 I  0   0.0  probe_20190320_121
 308.0   sbhat           3/20 12:37   0+00:00:00 I  0   0.0  probe_20190320_123
 309.0   sbhat           3/20 12:37   0+00:00:00 I  0   0.0  probe_20190320_123
 310.0   sbhat           3/20 12:37   0+00:00:50 R  0   0.0  probe_20190320_123

About 25 minutes later, still good:

[root@fermicloud074 groupsettings]# condor_q

-- Schedd: fermicloud074.fnal.gov : <131.225.154.254:64758>
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
 306.0   sbhat           3/20 12:10   0+00:00:00 I  0   0.0  probe_20190320_121
 307.0   sbhat           3/20 12:13   0+00:00:00 I  0   0.0  probe_20190320_121
 308.0   sbhat           3/20 12:37   0+00:00:00 I  0   0.0  probe_20190320_123
 309.0   sbhat           3/20 12:37   0+00:00:00 I  0   0.0  probe_20190320_123

4 jobs; 0 completed, 0 removed, 4 idle, 0 running, 0 held, 0 suspended

We'll call this a success.

#4 Updated by Shreyas Bhat 4 months ago

  • Start date changed from 03/18/2019 to 03/20/2019
  • Due date set to 03/20/2019

due to changes in a related task: #22178

#5 Updated by Shreyas Bhat 4 months ago

  • Status changed from Work in progress to Feedback

#6 Updated by Dennis Box 3 months ago

  • Target version set to v1.3
  • Status changed from Feedback to Resolved

Merged this but never updated tickets



Also available in: Atom PDF