Project

General

Profile

Milestone #19515

Roadmap for Singularity support

Added by Marco Mambelli over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Category:
-
Target version:
Start date:
03/27/2018
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

Through a discussion with GlideinWMS stakeholders we want to define and implement GWMS behavior regarding Singularity.

A first meeting was on 3/13:

Attending: Mats, Brian, Joe, Dave Dykstra, Dennis, Lorena, Mambelli, Mascheroni, Parag, Shreyas, Jeff, Edgar, Antonio, Tony

- Marco Mambelli explained how things work right now
https://docs.google.com/document/d/1WT33Y32H49a9sovLj3gA-_1xMM5DCyVU4Kny_rH5uiw/edit?usp=sharing

singularity.sh
Do a test for FIFE, using default images (would like to work out of the box)

Singularity activation and binary

glexec_bin was necessary before osg was rpm and it was necessary to distinguish between osg and lcg. Can we rely on PATH?
BB: It would be a good improvement to move away from SINGULAIRY_BIN

Mambelli: We need a trigger, to disable or enforce via Factory
Antonio: We need an independent trigger. on cms we need to say from vo side
Mats: osg vo is running with wherever available. if required, need to be careul about when there are worker nodes with issue.
JD: sites are not enabling it for entire cluster, some do, some dont
MR: make if available as default for vo and use it and validate
JD: If site wants we can turn it off. default is on
TT: fermi sec requires sing as required.
BB: reluctant to site give options to set path env. if we give them option it gives more work for jeff

RESULT: BIN SHOULD BE AUTO DISCOVERED. USE OSG AS REFERENCE FOR HOW TO FIND IT.
FACTORY HAS SINGLE VARIABLE TO MIRROR THE FRONTEND OPTION required/optional(if available)/never(turn it off). THIS WILL REPLACE REQUIRE_SINGULARITY

RESULT (rephrased): No singularity_bin (better not to give even the option to avoid the possibility of extra work for Factory operators with error-prone configurations),
GWMS will auto-discover based on standard ways: path, environment and module, like OSG is doing.
A script sets up grid environment and sets up the path environment, before GWMS starts.
If the CE/worker node is not configured correctly, singularity will be unavailable and job may fail (depending on site/job requirements)
It is up to the facility to maintain things correctly.

QUESTION (MM): If singularity (bin, not images) is in CVMFS, should it be used also if the facility is not configured for it (not in PATH, module)? If yes, how will GWMS know about it?

Singularity images

MT: OSG has multiple defaults for el6 and el7. provides auto ways to transition to one or other el. osg uses imgs from cvmfs repos. but looking at repo there is no way to find out what default is. done by pre script. Some percentage of nodes are starting w/ one, some w/ the other (script selects at random using given percentages)
MT: will have to auto detect default images. how to make decision
BB: attr. OSG_IMG=path. or hook to provide a bash function. Differences between CMS and OSG and how they look for images and need a batch script to detect on the fly.
MM: we can make a community defined img. can be over written. We need it if we want GWMS to work out of the box for VOs that know nothing about Singularity
BB: WE WILL PUT A DEFAULT IMG AND PUT IT SOMEWHERE AND FRONTEND WILL ALOW FOR THIS TO BE OVER WRITTEN
MR: Sense a conflicts about variable set and cvmfs. you also need to check for cvmfs.
BB: part of validation is to check if it is accessible.
BB: CMS and OSG have common heritage for the scripts.

Discussion on DEFAULT, DEFAULT6 and DEFAULT7
MR: DEFAULT is needed for validation. then when job comes in if it wants to switch later it is fine.

Need to pass basic test with sing bin and default img and if failed set the singularity availability to fail.

BB: what happens if you are in job wrapper and sing bin becomes un-accessible?
need to give a hook to users to do validation.
job wrapper 2 customization 1. pre launching things 2. what if failures happen. what we should do.
Edgar: like to see that validation is run outside sing and next inside singularity. In singularity first run inside default image, then re re-run when job arrives. Also applies when job is running

periodic checks against default is to check for singularity. periodic validation should trigger draining of job in case of failures.

Edgar: validation scripts are executes 2ice: out of singularity first, in singularity before the job.

A periodic script, inside Singularity is needed to check the status and also to keep cvmfs alive (e.g. something wierd happens on luster, singularity stops working)

Current VO practices:
- GWMS currently asks for a script that finds and sets the default images (VO provides a bash function)
- CMS looks for job requests
- OSG Some percentage of nodes are starting w/ one, some w/ the other (script selects at random between 2 images using given percentages)

PROPOSED to discuss and define:
OSG will maintain a default image (one or one per OS, EL6, EL7 - limited matrix); put it somewhere (not in the release, e.g. in CVMFS) and make it available
GWMS default script will pick an OSG provided default image
VO and Job both can override these default images
- VO can provide a setting in the Frontend or a script
- Job will have an attribute (optional) to select the image
If the default image (as specified by GWMS and VO, not job) is not available Singularity should fail
Tests should be able to run:
- outside Singularity
- in Singularity w/ the default image (during Glidein activation, before being available for matching)
- in Singularity before the job, with the image the job will run in
GWMS should provide some tests
Should be able to add tests in any of the 3 places via Frontend (VO) and Factory (Facility) configuration
If the tests fail Singularity should fail
If Singularity is required by the site or job, the Glidein will fail if Singularity fails; if it is optional, then the job will run w/o Singularity

QUESTION: Should tests all run in the standard Singularity invocation or be able to pass different parameters (command line flags, CVMFS mounts, bind mounts, ...)

QUESTION: Should VO decide what to do in case of failure? One option, different actions for different tests, ...
Things to consider:
  • Tests before matching affect only the Glidein; should anti-blackhole measures be in place?
  • Tests running periodically or before the job will affect also other jobs already running on the node:
    • Should the Glidein kill all and quit?
    • Should there be a timeout (graceful shutdown)?
    • Should it retire and drain, letting running jobs complete?

A Follow-up meeting and GWMS development activities need to be scheduled


Related issues

Related to GlideinWMS - Feature #20030: Add a variable for VOs to add extra singularity bind mount points and improve/update the Singularity scriptsClosed05/25/2018

Related to GlideinWMS - Feature #17970: Support mechanism for VO & possibly site to specify list of potential bind mountpoints for singularityClosed10/19/2017

Related to GlideinWMS - Feature #20811: Adopt Singularity mechanisms provided by HTCondorNew09/12/2018

Related to GlideinWMS - Feature #21875: Invoke Singularity via HTCondor INSTEAD of GWMS job wrapperNew02/09/2019

Related to GlideinWMS - Feature #21711: Add a portable condor_chirp for jobs running under GlideinWMSNew01/16/2019

Related to GlideinWMS - Feature #21639: Include OSG distributed unprivileged Singularity to the search path and do a full test of SingularityClosed01/08/2019

Related to GlideinWMS - Feature #21635: Increase verbosity to help Singularity troubleshootingClosed01/07/2019

Related to GlideinWMS - Feature #20776: Ask for feedback about the new Singularity scripts and remove the TODOsNew09/07/2018

Related to GlideinWMS - Support #20749: library problem sometimes when other scripts are using the GWMS HTCondorNew09/05/2018

Related to GlideinWMS - Feature #21885: Support to run test and periodic scripts within SingularityNew02/11/2019

Related to GlideinWMS - Feature #21886: Custom modules for job wrappersNew02/11/2019

Related to GlideinWMS - Feature #23290: Support condor_ssh_to_jpb also for Singularity jobsNew09/17/2019

History

#1 Updated by Marco Mambelli over 1 year ago

  • Target version set to v_collections

#2 Updated by Marco Mambelli over 1 year ago

Related tickets:
- Support mechanism for VO & possibly site to specify list of potential bind mountpoints for singularity [#17970] WIP
- Include unprivileged singularity in pilot software [#17560] - TO_DISCUSS - should we deliver Singularity w/ the Glidein?
- Add a variable for VOs to add extra singularity bind mount points [#20030] WIP

#3 Updated by Marco Mambelli about 1 year ago

  • Related to Feature #20030: Add a variable for VOs to add extra singularity bind mount points and improve/update the Singularity scripts added

#4 Updated by Marco Mambelli about 1 year ago

  • Related to Feature #17970: Support mechanism for VO & possibly site to specify list of potential bind mountpoints for singularity added

#5 Updated by Marco Mambelli about 1 year ago

  • Related to Feature #20811: Adopt Singularity mechanisms provided by HTCondor added

#6 Updated by Marco Mambelli 10 months ago

  • Related to Feature #21875: Invoke Singularity via HTCondor INSTEAD of GWMS job wrapper added

#7 Updated by Marco Mambelli 10 months ago

  • Related to Feature #21711: Add a portable condor_chirp for jobs running under GlideinWMS added

#8 Updated by Marco Mambelli 10 months ago

  • Related to Feature #21639: Include OSG distributed unprivileged Singularity to the search path and do a full test of Singularity added

#9 Updated by Marco Mambelli 10 months ago

  • Related to Feature #21635: Increase verbosity to help Singularity troubleshooting added

#10 Updated by Marco Mambelli 10 months ago

  • Related to Feature #20776: Ask for feedback about the new Singularity scripts and remove the TODOs added

#11 Updated by Marco Mambelli 10 months ago

  • Related to Support #20749: library problem sometimes when other scripts are using the GWMS HTCondor added

#12 Updated by Marco Mambelli 10 months ago

  • Related to Feature #21885: Support to run test and periodic scripts within Singularity added

#13 Updated by Marco Mambelli 10 months ago

#14 Updated by Marco Mambelli 3 months ago

  • Related to Feature #23290: Support condor_ssh_to_jpb also for Singularity jobs added


Also available in: Atom PDF