Project

General

Profile

Best Practices

Job Submission

This Job Submission best practice session will help you on maximizing your chances to run your jobs faster and in a successful way.
There are some basic concepts that will help you understand how the jobs act once they are submitted and what kind of problems you can face.
First of all, let's start saying that to maximize the chances to run your job, you can use different kinds of resources, those are separated in:
  • Dedicated
  • Opportunistic
  • Offsite

Depending on where your job will run it's possible that your script will work or not, based on how well written it is.
The majority of the problems that we face are related to scripts that aren't sufficiently robust to operate outside of FermiGrid.

Number of jobs allowed

You should limit the number of batch processes submitted at a time. It would be best for an individual submission to not be over 5000 individual batch jobs. When a submission occurs, the batch system has to internally create a large number of directories and files for each job. We have seen problems when submissions get over 10,000 jobs.

Pre-emption and common errors

The main difference when running outside of GPGrid is the pre-emption. What does preemption mean?
Pre-emption means that if other jobs with higher priority come into the cluster, your job may be evicted in favor of those.
Your job will automatically go back to the queue and eventually restart from the beginning, meaning that the produced stdout and stderr will be lost.
Analyzing condor logs you will find that your job has been evicted or disconnected.
Other possible errors when your job is sent to OSG are:
  • BlueArc areas are not mounted (You shouldn't run code directly from there )
  • Some system libraries may not be installed

How to run faster

So you may think "what's the biggest factor that will help my job to run faster and be successful?"
The answer is "your job's resource request" (CPU, disk, memory, runtime).
OSG is completely based on opportunistic usage, so we don't have control on your job's priority at a remote site.
All the requests are handled on a First come, First served methodology; this means that, depending on the resources available at the site, you may run first if the resource request is smaller than others.
You also have to consider that we may not have Glideins running at a given site all the time. Glideins can take several hours to start so it will be best for you to have jobs in the queue all the time.

Profiling your jobs

Profiling your jobs and requesting only the resources that you need will help you to get assigned slots on the grid faster.Don't just stick with the default resource request even if it works for you.
If you only need two hours or one GB or Ram, modify the request; It will make a huge difference in the waiting time.
Consider also that Pre-emption probability increases with run time. Aim for shorter (few hours) jobs.
Splitting a six hours job in two three-hour jobs or something similar will make the difference.

Write robust scripts

Let's start from the initial assumption that anything that runs on OSG, will also run on GPGrid, but that doesn't mean that if something runs on GPGrid, it will run also on OSG.
This initial assumption should let you focus on the fact that you should write your job scripts with OSG in mind from the beginning.

The best way to ensure success on the OSG is to write a good script:
  • Check for the existence of all needed CVMFS repositories before starting to run
  • Check OS environment and set Env variables as needed (such as UPS_OVERRIDE)
  • Check the exit status of data transfer commands and executables
  • Avoid hard-coding any paths for input or output files (CVMFS paths are ok )
  • Don't make assumption about what username and/or home directory the job sees

Script examples

Easy way Proper Way
#!/bin/bash
. /grid/fermiapp/products/common/etc/setup
setup ifdhc
. /grid/fermiapp/products/larsoft/etc/setup
setup mycode v1_2_3_4 -q whatever

if [ ! -d /cvmfs/myexperiment.opensciencegrid.org ]; then
    echo “experiment CVMFS repo seems to not be present. Sleeping and then exiting.”
    sleep 1000
    exit 1
fi
. /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setup
setup ifdhc
. /cvmfs/fermilab.opensciencegrid.org/products/larsoft/etc/setup
setup mycode v1_2_3_4 -q whatever
ifdh cp -D /pnfs/myexp/scratch/users/${GRID_USER}/myinput.tar.gz ./

ifdh cp -D /pnfs/myexp/scratch/users/${GRID_USER}/myinput.tar.gz ./
IFDH_RESULT=$?
ifdh log “ifdh cp of myinput.tar.gz - job $JOBSUBJOBID finished at `date` - status $IFDH_RESULT”
if [ $IFDH_RESULT -ne 0 ]; then echo “Error copying input tar” ; exit 2 ; fi
ifdh cp -D myoutfile1 myoutfile2 /pnfs/myexp/scratch/users/$USER/
myoutputs/
ifdh cp -D myoutfile1 myoutfile2 /pnfs/myexp/scratch/users/${GRID_USER}/
myoutputs/

Always test first

The most important thing of all: Do NOT send many jobs without sending tests first.
A fermicloud node is available to run tests. Copy your scripts there and run it first.
If it works, send a single test job to a remote site, then some more, etc.
There is also a node called fnpctest1.fnal.gov where you can run test workflows. Open a Service Desk ticket to Distributed Computing Support to obtain access.

StashCache

StashCache, as the name says is a different approach to accessing data in an opportunistic way for OSG sites.
With StashCache a file is downloaded locally to the cache from the origin server on the first access, and then, on future accesses, the local copy will be used.
When more space is needed, old files are removed ( by some algorithm which decides the definition of "old" )

With StashCache, we have that several cache servers are placed at several strategic cache locations across the OSG
The caching infrastructure is based on SLAC Xrootd server and xrootd protocol

StashCache offers different kinds of user interfaces:
  • "cp"-like
  • HTCondor file transfer
  • POSIX

"cp"-like

All glideins are instrumented to have "stashcp" in the $PATH environment variable
Stashcp emulates the command line interface of the POSIX command "cp"
All you need to do is just write:

stashcp stash:/user/bbockelm/foo $PWD

A summary of statistics, performance and errors encountered will be injected back to HTCondor ClassAd.

HTCondor File Transfer

The workflow system can manage file transfers directly.
The pilot configuration provides a callout script for handling a given URL type
Using HTCondor file transfer plugins provides a mechanism for concurrency management,policy-based retries, and removes need for error handling in user jobs
Users add the following line to the JDL:

transfer_input_files = stash://user/bbockelm/foo

POSIX

Stashcp and HTCondor file transfer plugins require the entire file to be downloaded locally, but it's possible that not all worker nodes have large enough scratch disks.
Using an LD_PRELOAD library from the Xrootd team, we can make StashCache appear to be a POSIX filesystem to the application.
As many applications perform small reads, it uses the local filesystem as a cache for accesses portions of the file.
All "normal" POSIX utilities and API's will work ( think "ls", "cat", "tail", ect )
All you have to do is simply set :

+UseStashCachePosix=True

in the HTCondor submit file.
LD_PRELOAD can have some overhead and may not work in all cases; hence, users must explicitly ask for it

Sam4Users

SAM for Users is a Utility created to assist individual users in making use of the SAM catalog for their own data.
Using SAM the user has different advantages:
  • Users' own data will be just like production data:
    • You can submit grid jobs using SAM project,
    • You can use the existing tools and monitoring for SAM jobs
  • Moving files between different storage locations is made simple;
  • Only the copies of the files will be removed ( if not explicitly told to )

Getting Started

All the commands of SAM4User live in the fife_utils package, so the first thing to do if you want to use this utility is to setup your environment.
On your experiments’ GPVMs, where /grid/fermiapp or cvmfs is mounted:

# ups setup
source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setup
# get a valid proxy from sam
kx509
kxlist -p
voms-proxy-init --rfc --voms=fermilab:/fermilab/nova/Role=Analysis --noregen
# setup fife_utils
# set some environment variables for nova (if nova is your experiment)
export GROUP=nova
export EXPERIMENT=nova
export SAM_EXPERIMENT=nova
export IFDH_BASE_URI=http://samweb.fnal.gov:8480/sam/nova/api
# for minerva
#export IFDH_BASE_URI="http://samweb-minerva.fnal.gov:20004/sam/minerva/api" 

The metadata in SAM belongs to the user who declares it, so only the user who declared the file will be allowed to modify the metadata, add or remove locations, or retire the metadata.
Other users can’t change anything (except admins).
The files in the storage system will be owned by the mapped credentials used to copy them:
  • For dCache this means an individual user
  • For offsite locations this may vary; quite likely to be shared ownership

Then next command will be to add a dataset to SAM

sam_add_dataset --name=${USER}_stuff --directory=/nova/data/user/${USER}/stuff

The previous command will create a dataset named "${USER}_stuff" containing all the files in the folder "/nova/data/user/${USER}/stuff"
Note that all the files available in the folder will be renamed with an uuid.

Before Rename After Rename
neardet_r00011382_s16_nuexsec.root 19153c3f-fd03-492a-a688-d8d66e541ae0-neardet_r00011382_s16_nuexsec.root

Each dataset have his own simple metadata, that we can visualize trough the command

samweb describe-definition <name of the dataset> 

This command will result in an output similar to the following:
Definition Name: mfattoru_tutorial_20170425
  Definition Id: 523513
  Creation Date: 2017-04-25T15:21:41.747907+00:00
       Username: mfattoru
          Group: nova
     Dimensions: dataset.tag mfattoru_tutorial_20170425

The metadata also contains a tag name so you can retrieve a whole group of files.
It's even possible, If you need more detail, to add your own metadata for each file, using all the normal SAM features

#add an additional parameter to the metadata
samweb add-parameter Users.AnalysisGroup string
samweb add-parameter file_format string
#modify the file metadata
samweb modify-metadata ${FILE_NAME} ${METADATA_JSON_FILE}

the structure of the JSON file could be like the following:

{
   "Users.AnalysisGroup": "NDXsec",
   "file_format" : "root" 
}

Once you have declared a dataset it's possible to obtain different informations from it.if's possible to obtain the list of files associated with it:

samweb list-definition-files mfattoru_tutorial_20170425

or it's possible to check where the files are located:
samweb locate-file 19153c3f-fd03-492a-a688-d8d66e541ae0-neardet_r00011382_s16_nuexsec.root 

or get the metadata associated with the file
samweb get-metadata 19153c3f-fd03-492a-a688-d8d66e541ae0-neardet_r00011382_s16_nuexsec.root 

Ideally, each experiment that uses this will be able to develop local expertise that can act as the first source of advice for new users.
Until then, feel free to ask us for advice or Report problems and bugs through the ServiceDesk (it has its own entry in the service catalog)

For more information, click Here

Data Handling

Data Handling
Removing Bluearc Dependencies

IFDH

IFDH is a data transfer framework used to transfer data between different storage locations, abstracting the user from the low-level protocol used during the transfer.

IFDH will handle all the necessary logic for the data transfer

There are a few items that we really recommend to our users:
  • Use it (don't use low-level tools directly)
If you are using plain cp, or cpn, or calling globus-url-copy , etc directly in your script or program; change them to use ifdh.
  • Your script will more likely work offsite
  • You won't have to add special options to deal with DCache when you move your data there.
  • You will be insulated against other storage service changes we haven't even thought of yet.
  • Use plain paths (not full URL's ) wherever possible

Giving in input to ifdh a URL involves giving it a protocol. Giving to ifdh a protocol.
When Ifdh receives a protocol, it interpret your request as "you have to use this protocol", and this tie ifdh's hands.
We want ifdh to have the flexibility to pick a different protocol
You can switch the "schema" to xrootd when you use estabilishProcess/getNextFile/getchInput to get streaming, or not as needed.

  • Avoid "ifdh cp -r"

Ifdh doesn't execute recursive copy by itself.
We have some utilities that don't support it and others that just support it partially.
Also while using "ifdh cp -r" it's really easy to have big errors, just misspelling a path could possibly make you copy a much bigger amount of files than you can imagine

  • avoid the --force option

We implemented the --force option to work around problems.
If you need to use the --force in an experiment production script, we should be fixing something and then you should take it out of your script

  • Use IFDH_STAGE_VIA where appropriate

At some remote sites, connectivity is better to their storage nodes than their worked notes.
Using IFDH_STAGE_VIA="site.werever=>proto://node/path" will let you use it conditionally at a particular site,without bothering others.
Doing so can speed up jobs at such site, and take a lot of load off of DCache.

  • Use ifdh from python and/or C++

All of the ifdh calls are available in the C++ and python bindings.
Ifdh can make your Python and C++ scripts and programs easier to write:
In python the file list is already split into lists and the code throws real exceptions in appropriate cases
In C++ list results are STL vectors, of SL strings, etc and the code throws real exceptions in appropriate cases

Field Guide to Error Messages

Error Possible Cause
error: globus_ftp_client: the server
responded with an error
451 Operation failed: Path exists but is not
of the expected type
You forgot to use -D, or there's a file where you want a
directory.
Error: globus_ftp_control:
gss_init_sec_context faile
globus_gsi_gssapi: Error with gss credential
handle
globus_credential: Valid credentials could
not be found in any of the possible locations
specified by the credential search order.
You don't have a proxy, you need to get one.
globus_credential: Valid credentials could
not be found in any of the possible locations
specified by the credential search order.
Valid credentials cou
You don't have a proxy, and need to get one
globus_credential: Error with credential: The
proxy credential: /tmp/x509up_u1733
with subject:
/DC=gov/DC=fnal/O=Fermilab/OU=People/CN=Marc
W. Mengel/CN=UID:mengel/CN=811559939
expired 3720 minutes ago.
You had a proxy, but it expired.
error: globus_xio: Unable to open file
/tmp/misspeled
globus_xio: System error in open: No such
file or directory
globus_xio: A system call failed: No such
file or directory
You misspelled a local filename, or tried to copy a
non-existent file
451 Operation failed: Path exists but is not
of the expected type
program: globus-url-copy -rst-retries 1
-gridftp2 -nodcau -restart -stall-timeout
14400 -dp
gsiftp://if-gridftp-fermilab.fnal.gov/
no/such/dir/file gsiftp://stkendca50a.fnal
.gov/pnfs/fnal.gov/usr/nova/scratch/users/m
engel exited status 1
You misspelled a local directory, or specified a nonexistent
one, so ifdh decide it is on bluearc, and...
Sometimes it could happen that IFDH just sits there and looks like it's doing nothing.
Sometimes DCache is just slow. Sometimes DCache is stuck on particular files. Other times it can be waiting to fetch a file form tape if it was not already cached on disk.
If you can't tell what is happening from the error message, or in case the error message is not there:
  • Check Fifemon's dcache pages, look for big queues.
  • Could your file be on tape? Reading a file from the tape takes more time than accessing it directly from the disk
  • If you still can't tell, open a ticket. Be specific, where, when and How.

dCache

dCache is a distributed storage system that provides non-POSIX and POSIX-like access to Scientific data stored on immutable files distributed across multiple data servers.
This is not a filesystem mounted off a local disk. Latencies associated with staging or even accessing each file over the network need to be taken in consideration

Avoid millions of small files ( current definition of small file is 200MB. )

For the same amount of data, a big number of small files cause:
  • Increase in the number of files per directory;
  • Increase in the depth and the width of the directory tree;
  • Datasets harder to manage, because they contain million of files;
When dealing with tape, different problems can occur:
  • Bigger per file overhead to write a file mask ( on writes )
  • Tape back-hitching when data streaming to tape is interrupted ( e.g. for the next file in queue )
  • Bigger mount latency on writes and reads
  • Bigger unload time on write and reads
  • Increased Seek time on writes and reads.

Read tape overheads per file contribute directly to "slow down" file delivery for files that are not cached on disk.
For the writing process instead, the user is shielded from tape overheads as they occur out of the band.

Small File Aggregation

For a big number of small files, we offer a Small File Aggregation Feature (SFA) in Enstore that automatically and seamlessly packs small files into large files that are stored on tape.
This feature has Pros and Cons:
  • Pros:
    • The user don't need to worry about small files.
    • Small files in the system have less impact on others when dealing on tapes as latencies associated with small files are minimized (less queueing on Enstore )
  • Cons:
    • Due to the relatively random packaging of files in SFA packages, staging small files back from tape may cause significant latencies when reading files that are in some way related each other ( e.g. data file and a JSON file that describes the data ) and end up to be stored in different packages.

We encourage the experiments to pack or concatenate/merge their files intended for data analysis based on their own criteria rather than relying on automated SFA packaging.
Only experiments themselves know their data and can provide locality for efficient access ( e.g. packaging aux. files together with their data file for efficient access )

SFA is good for independent small files that do not require reading in specific sequence so that in case they need to be staged there is no dependency on files across other packages.

Copying or Streaming?

The pool nodes don't have an infinite I/O and bandwidth, so to avoid subsystem overheating on pool nodes we have to restrict the number of active GFTP movers (transfers) per pool to 20.
On the other hand, slow transfers over the WAN occupy active slots for a long time and will eventually clog up the system, leading to GFTP transfer request queueing, causing low job efficiency.
To fix this possible is possible to stream the data event by event; this doesn't stress the I/O pool and allow us to open a lot more movers slots per pool ( E.g. 1000 XRootD 20 GFTP active movers )

Copy command (Instead of this) (Do this) Streaming command
{cp,ifdh cp, globus-url-copy, wget, xrdcp} /pnfs/foo/bar /local/disk
…
root[0] f = TFile::Open(“/local/disk/bar”);
// access via xrootd
root[0] f = TFile::Open(“root://fndca1/pnfs/fs/usr/foo/bar”);
…
// or dcap
root[0] f = TFile::Open(“dcap://fndca1/pnfs/fs/usr/foo/bar”);
…
// or over NFS v4.1 or NFS v3 and dcap preload library
root[0] f = TFile::Open(“/pnfs/fs/usr/foo/bar”);

Run on ONLINE files to avoid queueing

To keep job efficiency high and avoid wasting resources, it is best practice to run on ONLINE files, that is files that are in dCache pools.
To check if the file is online you can use the following command:

#Structure of the command:
#cat /pnfs/<experiment>/”.(get)(<file name>)(locality)”
#Command for the file ep047d08.0042dila
cat ".(get)(ep047d08.0042dila)(locality)" 
#Output
NEARLINE

The possible output status are:

AVAILABLE STATUS MEANING
ONLINE File is only in disk cache (and not on tape)
ONLINE_AND_NEARLINE File is in disk cache (and also on tape)
NEARLINE File is not in cache (on tape)
UNAVAILABLE file is unavailable (e.g. it is not on tape and the pool where it is located is down)

Check for your file once they are copied

To be sure that your files have been copied successfully, you can check the Alder32 checksum of the original file and its copy in PNFS after you copied it in dCache. Only SRM protocol does it automatically.

To extract the checksum of a file from PNFS,you can run the following command:

#Structure of the command
#cat /pnfs/<experiment>/”.(get)(<file name>)(checksum)”
#Command for the file ep047d08.0042dila
cat /pnfs/fs/usr/test/litvinse/world_readable/".(get)(ep047d08.0042dila)(checksum)" 
#Output
ADLER32:fe99e83e

CVMFS

The CernVM File System provides a scalable, reliable and low-maintenance software distribution service. It was developed to assist High Energy Physics (HEP) collaborations to deploy software on the worldwide distributed computing infrastructure used to run data processing applications.
CernVM-FS is implemented as a POSIX read-only file system in user space (a FUSE (link is external) module). Files and directories are hosted on standard web servers and mounted in the universal namespace /cvmfs.

Best file size

It's recommended to don't upload ON CVMFS files bigger than 100MB, for files of size bigger than 100MB it's recommended to use dCache or StashCache

Check for access to CVMFS

It's good practice to always check if the cvmfs directory is accessible before to call any source or setup

if [ ! -d /cvmfs/myexperiment.opensciencegrid.org ]; then
    echo “experiment CVMFS repo seems to not be present. Sleeping and then exiting.”
    sleep 1000
    exit 1
fi

FIFEMon

FIFEMon is a platform used for monitoring purposes.
Thanks to recent advances in deep learning, we are able to distil the thousands of monitoring inputs received every second into a single, targeted heuristic that tells you what the state of the scientific computing systems, batch systems, and your jobs are right now.

This saves you from having to drill down into all the graphs and tables to figure out if everything is okay. Instead, just sit back and watch the blinking lights!

Link: Dashboard

Overwhelmed by the number of jobs showing up in a table? Check out the list of filters in the drop-down above the table to help narrow it down.

Not sure why your jobs are getting held? With new limits enforcement, you probably need to increase the resources requested (including runtime). The Why Are My Jobs Held? dashboard will show you the reason.

Interested in seeing what the batch jobs for your SAM project are doing? Go to the SAM Project Summary dashboard and select your project name from the dropdown.

The User Overview dashboard, will show you at a glance the recent status of your batch jobs, SAM projects, and IFDH file transfers.

Is the batch system down? There are several resources that provide you the latest news on any known outages or service degradations:

  1. The "FIFE Summary": FIFE Summary dashboard has notes with known outages.
  2. Service Now maintains updated knowledge base articles for maintenance outages, and other news.

Want additional information? Visit the Help page