Project

General

Profile

Support #21291

Setup a ITB frontend for GWMS and FIFE testing

Added by Marco Mambelli 11 months ago. Updated 22 days ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
11/01/2018
Due date:
% Done:

0%

Estimated time:
Stakeholders:

FIFE

Duration:

Description

Setup an ITB frontend for GWMS and FIFE testing:
- connected to the ITB (or production) Factory
- w/ access for FIFE developers (Shreyas, Ken)

History

#1 Updated by Lorena Lobato Pardavila 10 months ago

  • Assignee set to Lorena Lobato Pardavila

#2 Updated by Lorena Lobato Pardavila 3 months ago

  • Status changed from New to Work in progress

At the moment, I have set an SL7 - ITB Frontend with GlideinWMS 3.4.5 and HTCondor 8.6 running. The machine is fermicloud308.fnal.gov and I've already given access to Ken, Shreyas, and Marco.
Currently, it is connected to one of my Factories to do some tests while I fully set the Frontend configuration as the existing one in Production and I wait for the access to the OSG Factory in production.

Did some tests with the groups enabled are fermilab and fife_test_singularity to the CE fermicloud127.fnal.gov. Test successful so far.

Updates in the coming days.

#3 Updated by Dennis Box about 1 month ago

I have set up a jobsub server on fermicloud030 that submits to the ITB frontend fermicloud308.
So far I have been able to run singularity jobs at BNL and UCSD with the following jobsub_submit incantation:

jobsub_submit -G nova -l "+SingularityJob=True" -l '+REQUIREDOS=\"rhel7\"' -l '+REQUIRED_OS=\"rhel7\"' --jobsub-server fermicloud030.fnal.gov --site 'BNL,UCSD' --append_condor_requirements TARGET.HAS_SINGULARITY  file://job_sleep.sh

Here are some UCSD singularity jobs caught in the act:

[dbox@fermicloud042 ~]$ jobsub_q --jobsub-server fermicloud030.fnal.gov --run
JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
dbox 08/07 13:44 0+00:02:46 R 0 0.0 UCSD_fermilab_sleep.sh_20190807_134421_857183_0_1_wrap.sh
dbox 08/07 13:53 0+00:02:46 R 0 0.0 UCSD_nova_sleep.sh_20190807_135305_863408_0_1_wrap.sh
dbox 08/07 14:02 0+00:02:46 R 0 0.0 UCSD_uboone_sleep.sh_20190807_140229_871090_0_1_wrap.sh

Jobs submitted with the above incantation to MWT2, Michigan, Wisconsin, and UChicago start but hold with error messages like this:

[dbox@fermicloud042 ~]$ jobsub_q --jobsub-server fermicloud030.fnal.gov --hold
JOBSUBJOBID OWNER HELD_SINCE HOLDREASON
dbox 08/07 13:44 Error from : STARTER at 72.36.96.44 failed to send file(s) to <131.225.154.164:9618>: error reading from /scratch.local/condor/execute/dir_458387/glide_I46CaR/execute/dir_426007/.empty_file: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <72.36.96.44:43795>
dbox 08/07 13:57 Error from : STARTER at 128.104.100.113 failed to send file(s) to <131.225.154.164:9618>: error reading from /var/lib/condor/execute/slot1/dir_100007/glide_ORlUIL/execute/dir_113407/.empty_file: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <128.104.100.113:33765>
dbox 08/07 14:04 Error from : STARTER at 192.41.230.245 failed to send file(s) to <131.225.154.164:9618>: error reading from /tmp/condor/execute/dir_1104297/glide_vMg7QJ/execute/dir_1123208/.empty_file: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <192.41.230.245:43269>

#4 Updated by Dennis Box about 1 month ago

I left all my idle jobs in the queue overnight, and singularity jobs ran for nova, fermilab, and uboone at Colorado.

I also tried submitting non-singularity jobs for the nova vo to see where they would start and run.

As before, they start and get held at UChicago, and MWT2. Also as before, they start and run at BNL.

Unlike singularity jobs, they also start but get held at FZU

Non-singularity jobs start and run at Wisconsin, unlike their singularitied (I don't think thats a word but I am going with it) bretheren which start but get held.

Non-singularity jobs ran to completion at SU-ITS.

If this is interesting to anyone, I should make a table.

#5 Updated by Lorena Lobato Pardavila about 1 month ago

  • Status changed from Work in progress to Feedback

Apart from jobsub, for normal submission the ITB is also working: 5 simple jobs calculating the pi number

llobato@fermicloud308:~$ condor_q

-- Schedd: fermicloud308.fnal.gov : <131.225.154.84:9618?... @ 08/08/19 16:18:14
OWNER   BATCH_NAME                 SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
llobato CMD: jobpi_test.sh   8/8  16:18      _      4      1      5 19.0-4

5 jobs; 0 completed, 0 removed, 1 idle, 4 running, 0 held, 0 suspended

llobato@fermicloud308:~$ condor_history -l 19.3 | grep  MATCH_EXP_JOB_GLIDEIN_Site
MATCH_EXP_JOB_GLIDEIN_SiteWMS = “HTCondor”
MATCH_EXP_JOB_GLIDEIN_SiteWMS_Queue = “osggrid01.hep.wisc.edu”
MATCH_EXP_JOB_GLIDEIN_Site = “Wisconsin”
MATCH_EXP_JOB_GLIDEIN_SiteWMS_JobId = “2063227.0" 
MATCH_EXP_JOB_GLIDEIN_SiteWMS_Slot = “slot1_9@e445.chtc.wisc.edu”

#6 Updated by Lorena Lobato Pardavila about 1 month ago

For singularity, I have upgraded the Frontend to v3.4.6.RC1 as contains some fixes related to. I have executed tests for normal submission and jobs run.

Details

My submission file:

Executable = job_singularity_test.sh
Arguments  = 10000000
Log        = /cloud/login/llobato/logssingularity/job.$(Cluster).log
Output     = /cloud/login/llobato/logssingularity/job.$(Cluster).$(Process).out
Error      = /cloud/login/llobato/logssingularity/job.$(Cluster).$(Process).err
should_transfer_files   = YES
when_to_transfer_output = ON_EXIT
+REQUIRED_OS="rhel6" 
+DESIRED_Sites = "OSC" 
+JobFactoryType="itb" 
+AccountingGroup="llobato.group_fermilab" 
+SingularityImage="/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl6:latest" 

queue 5

The script:

#!/bin/bash

echo "Test for FIFE ITB Frontend and singularity" 
ps -ef
echo "The following commands will say different things because containers still run with the host OS kernel" 
echo "1. The information related to the system: " 
uname -a
echo "2.The redhat-release is: " 
cat /etc/redhat-release
echo "The environment:" 
printenv
echo "##################" 
echo "Singularity information:" 
cat /image-build-info.txt

Submitted several jobs, running without problems. I don't display the output of the jobs since it's to long, but the information related Singularity:

llobato@fermicloud308:~$ cat ./logssingularity/job.24.3.out | grep -i singularity
Test for FIFE ITB Frontend and singularity
cvmfs     222853       1  0 19:13 ?        00:00:00 /usr/bin/cvmfs2 -o rw,fsname=cvmfs2,allow_other,grab_mountpoint,uid=496,gid=495 singularity.opensciencegrid.org /cvmfs/singularity.opensciencegrid.org
cvmfs     222857       1  0 19:13 ?        00:00:00 /usr/bin/cvmfs2 -o rw,fsname=cvmfs2,allow_other,grab_mountpoint,uid=496,gid=495 singularity.opensciencegrid.org /cvmfs/singularity.opensciencegrid.org
nobody    224414  224410  0 19:15 ?        00:00:00 /bin/bash /var/lib/condor/execute/dir_224410/condor_exec.exe -v std -name gfactory_instance -entry Glow_US_Syracuse_condor-ce2_rhel6 -clientname fermicloud308-fnal-gov_OSG_gWMSFrontend.fife_test_singularity -schedd schedd_glideins2@gfactory-2.opensciencegrid.org -proxy OSG -factory OSG -web http://gfactory-2.opensciencegrid.org/factory/stage -sign 4e4c9d5ec15bc3bfc7e3e6f04e495b0ae9033447 -signentry ad3b283495fec75a2256e654e2c6513572566fb7 -signtype sha1 -descript description.j887wD.cfg -descriptentry description.j887wD.cfg -dir Condor -param_GLIDEIN_Client fermicloud308-fnal-gov_OSG_gWMSFrontend.fife_test_singularity -submitcredid 389917 -slotslayout fixed -clientweb http://gwms-web.fnal.gov/fermicloud308/vofrontend/stage -clientsign f9375b26560c7e0b7b25634f572ac08eecbc1e31 -clientsigntype sha1 -clientdescript description.j88i3p.cfg -clientgroup fife_test_singularity -clientwebgroup http://gwms-web.fnal.gov/fermicloud308/vofrontend/stage/group_fife_test_singularity -clientsigngroup fa381765ac532bff7d783b112b97115022dadd19 -clientdescriptgroup description.j88i3p.cfg -param_CONDOR_VERSION default -param_FIFE_DESC GenericOffsite -param_GLIDEIN_Job_Max_Time 34800 -param_GLIDECLIENT_ReqNode gfactory.minus,2.dot,opensciencegrid.dot,org -param_GLIDECLIENT_Rank 1 -param_GLIDEIN_Report_Failed NEVER -param_MIN_DISK_GBS 1 -param_GLIDEIN_Glexec_Use NEVER -param_GLIDEIN_DEBUG_OUTPUT True -param_GLIDEIN_Monitoring_Enabled False -param_CONDOR_ARCH default -param_UPDATE_COLLECTOR_WITH_TCP True -param_USE_MATCH_AUTH True -param_CONDOR_OS default -param_GLIDEIN_Collector fermicloud308.dot,fnal.dot,gov.colon,9618.question,sock.eq,collector20 -cluster 557007 -subcluster 0
HAS_CVMFS_singularity_opensciencegrid_org=True
GWMS_SINGULARITY_VERSION=
GLIDECLIENT_Name=fermicloud308-fnal-gov_OSG_gWMSFrontend.fife_test_singularity
GWMS_SINGULARITY_AUTOLOAD=0
GWMS_SINGULARITY_IMAGE=/cvmfs/singularity.opensciencegrid.org/fermilab/fnal-wn-sl6:latest
GLIDEIN_Singularity_Use=PREFERRED
GLIDECLIENT_Group=fife_test_singularity
GWMS_SINGULARITY_PATH=
HAS_SINGULARITY=0
GWMS_SINGULARITY_STATUS=
GWMS_SINGULARITY_BIND_CVMFS=1
SINGULARITY_IMAGES_DICT=rhel6:/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el6:latest,rhel7:/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest
GWMS_SINGULARITY_IMAGES_DICT=rhel6:/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el6:latest,rhel7:/cvmfs/singularity.opensciencegrid.org/opensciencegrid/osgvo-el7:latest
GWMS_SINGULARITY_BIND_GPU_LIBS=1
CVMFS_singularity_opensciencegrid_org_REVISION=66601
GWMS_SINGULARITY_LIB_VERSION=1

#7 Updated by Lorena Lobato Pardavila about 1 month ago

Added a similar function ( done by Mambelli) in my script which checks if singularity is set: The original function is:

function singularity_is_inside {
    # Return true (0) if in Singularity false (1) otherwise
    # In Singularity SINGULARITY_NAME and SINGULARITY_CONTAINER are defined
    # In the default GWMS wrapper GWMS_SINGULARITY_REEXEC=1
    # The process 1 in singularity is called init-shim (v>=2.6), not init
    # If the parent is 1 and is not init (very likely)
    [[ -n "$SINGULARITY_NAME" ]] && { true; return; }
    [[ -n "$GWMS_SINGULARITY_REEXEC" ]] && { true; return; }
    [[ "x`ps -p1 -ocomm=`" = "xshim-init" ]] && { true; return; }
    [[ "x$PPID" = x1 ]] && [[ "x`ps -p1 -ocomm=`" != "xinit" ]] && { true; return; }
    false
    return

I can confirm that it was working for normal condor submission:
Job logs show:

##################
Is singularity working? Checkings:
1 : true
2 : true
3: true
4 : true
Singularity is: true

#8 Updated by Dennis Box 22 days ago

I have rerun my jobsub submissions to test singularity. The jobs going held at certain sites was resolved by Marco with a configuration change. I also tested singularity_is_inside as part of these tests. I think we can mark this ticket resolved.

#9 Updated by Marco Mambelli 22 days ago

  • Status changed from Feedback to Closed


Also available in: Atom PDF