Roadmap for Singularity support
Through a discussion with GlideinWMS stakeholders we want to define and implement GWMS behavior regarding Singularity.
A first meeting was on 3/13:
Attending: Mats, Brian, Joe, Dave Dykstra, Dennis, Lorena, Mambelli, Mascheroni, Parag, Shreyas, Jeff, Edgar, Antonio, Tony
- Marco Mambelli explained how things work right now
Do a test for FIFE, using default images (would like to work out of the box)
Singularity activation and binary¶
glexec_bin was necessary before osg was rpm and it was necessary to distinguish between osg and lcg. Can we rely on PATH?
BB: It would be a good improvement to move away from SINGULAIRY_BIN
Mambelli: We need a trigger, to disable or enforce via Factory
Antonio: We need an independent trigger. on cms we need to say from vo side
Mats: osg vo is running with wherever available. if required, need to be careul about when there are worker nodes with issue.
JD: sites are not enabling it for entire cluster, some do, some dont
MR: make if available as default for vo and use it and validate
JD: If site wants we can turn it off. default is on
TT: fermi sec requires sing as required.
BB: reluctant to site give options to set path env. if we give them option it gives more work for jeff
RESULT: BIN SHOULD BE AUTO DISCOVERED. USE OSG AS REFERENCE FOR HOW TO FIND IT.
FACTORY HAS SINGLE VARIABLE TO MIRROR THE FRONTEND OPTION required/optional(if available)/never(turn it off). THIS WILL REPLACE REQUIRE_SINGULARITY
RESULT (rephrased): No singularity_bin (better not to give even the option to avoid the possibility of extra work for Factory operators with error-prone configurations),
GWMS will auto-discover based on standard ways: path, environment and module, like OSG is doing.
A script sets up grid environment and sets up the path environment, before GWMS starts.
If the CE/worker node is not configured correctly, singularity will be unavailable and job may fail (depending on site/job requirements)
It is up to the facility to maintain things correctly.
QUESTION (MM): If singularity (bin, not images) is in CVMFS, should it be used also if the facility is not configured for it (not in PATH, module)? If yes, how will GWMS know about it?
MT: OSG has multiple defaults for el6 and el7. provides auto ways to transition to one or other el. osg uses imgs from cvmfs repos. but looking at repo there is no way to find out what default is. done by pre script. Some percentage of nodes are starting w/ one, some w/ the other (script selects at random using given percentages)
MT: will have to auto detect default images. how to make decision
BB: attr. OSG_IMG=path. or hook to provide a bash function. Differences between CMS and OSG and how they look for images and need a batch script to detect on the fly.
MM: we can make a community defined img. can be over written. We need it if we want GWMS to work out of the box for VOs that know nothing about Singularity
BB: WE WILL PUT A DEFAULT IMG AND PUT IT SOMEWHERE AND FRONTEND WILL ALOW FOR THIS TO BE OVER WRITTEN
MR: Sense a conflicts about variable set and cvmfs. you also need to check for cvmfs.
BB: part of validation is to check if it is accessible.
BB: CMS and OSG have common heritage for the scripts.
Discussion on DEFAULT, DEFAULT6 and DEFAULT7
MR: DEFAULT is needed for validation. then when job comes in if it wants to switch later it is fine.
Need to pass basic test with sing bin and default img and if failed set the singularity availability to fail.
BB: what happens if you are in job wrapper and sing bin becomes un-accessible?
need to give a hook to users to do validation.
job wrapper 2 customization 1. pre launching things 2. what if failures happen. what we should do.
Edgar: like to see that validation is run outside sing and next inside singularity. In singularity first run inside default image, then re re-run when job arrives. Also applies when job is running
periodic checks against default is to check for singularity. periodic validation should trigger draining of job in case of failures.
Edgar: validation scripts are executes 2ice: out of singularity first, in singularity before the job.
A periodic script, inside Singularity is needed to check the status and also to keep cvmfs alive (e.g. something wierd happens on luster, singularity stops working)
Current VO practices:
- GWMS currently asks for a script that finds and sets the default images (VO provides a bash function)
- CMS looks for job requests
- OSG Some percentage of nodes are starting w/ one, some w/ the other (script selects at random between 2 images using given percentages)
PROPOSED to discuss and define:
OSG will maintain a default image (one or one per OS, EL6, EL7 - limited matrix); put it somewhere (not in the release, e.g. in CVMFS) and make it available
GWMS default script will pick an OSG provided default image
VO and Job both can override these default images
- VO can provide a setting in the Frontend or a script
- Job will have an attribute (optional) to select the image
If the default image (as specified by GWMS and VO, not job) is not available Singularity should fail
Tests should be able to run:
- outside Singularity
- in Singularity w/ the default image (during Glidein activation, before being available for matching)
- in Singularity before the job, with the image the job will run in
GWMS should provide some tests
Should be able to add tests in any of the 3 places via Frontend (VO) and Factory (Facility) configuration
If the tests fail Singularity should fail
If Singularity is required by the site or job, the Glidein will fail if Singularity fails; if it is optional, then the job will run w/o Singularity
QUESTION: Should tests all run in the standard Singularity invocation or be able to pass different parameters (command line flags, CVMFS mounts, bind mounts, ...)QUESTION: Should VO decide what to do in case of failure? One option, different actions for different tests, ...
Things to consider:
- Tests before matching affect only the Glidein; should anti-blackhole measures be in place?
- Tests running periodically or before the job will affect also other jobs already running on the node:
- Should the Glidein kill all and quit?
- Should there be a timeout (graceful shutdown)?
- Should it retire and drain, letting running jobs complete?
A Follow-up meeting and GWMS development activities need to be scheduled
#2 Updated by Marco Mambelli over 1 year ago
- Support mechanism for VO & possibly site to specify list of potential bind mountpoints for singularity [#17970] WIP
- Include unprivileged singularity in pilot software [#17560] - TO_DISCUSS - should we deliver Singularity w/ the Glidein?
- Add a variable for VOs to add extra singularity bind mount points [#20030] WIP