Getting Started with the Jobsub Client¶
This article is meant as a very brief introduction that will show users how to submit, monitor, and manipulate their jobs using the jobsub client. We will assume that you have your computing account
set up and are a part of an experiment.
Setting up¶
We will use the latest version of jobsub out of ups for the rest of these examples. To set up the latest jobsub, run the following:
-bash-4.1$ source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setups.sh -bash-4.1$ setup jobsub_client
You can verify the version of jobsub_client by running:
-bash-4.1$ jobsub_submit --version 1.2.9
Submitting a job¶
To submit a job, you need to know two things:
- Your experiment (for these examples, we'll assume that's nova)
- An executable to run. For this example, we'll use /grid/fermiapp/common/tools/probe)
Then, to submit the job, use:
jobsub_submit -G nova file:///grid/fermiapp/common/tools/probe
If job submission is successful, you'll get an output like this:
/fife/local/scratch/uploads/nova/sbhat/2019-03-06_162756.803965_1644 /fife/local/scratch/uploads/nova/sbhat/2019-03-06_162756.803965_1644/probe_20190306_162757_82154_0_1_.cmd submitting.... Submitting job(s). 1 job(s) submitted to cluster 17006020. JobsubJobId of first job: 17006020.0@jobsub02.fnal.gov Use job id 17006020.0@jobsub02.fnal.gov to retrieve output
A few notes here:
- The -G option specifies group, which is your experiment
- The file:// URI tells jobsub client what executable you want to run
These are the only two items required to submit jobs. However, we recommend that you use the following options as needed to tailor your resource
requests (this can help you get slots quicker, rather than using the defaults). Some other commonly-used and/or highly-encouraged options are:
- --role: Specify your VOMS role (used mainly for Production jobs)
- --resource-provides=usage_model: This is how you can control whether your jobs run onsite or offsite (in HEPCloud, this might go away):
- To get onsite: --resource-provides=usage_model=DEDICATED
- To get offsite: --resource-provides=usage_model=OFFSITE
- To run anywhere (recommended): --resource-provides=usage_model=DEDICATED,OFFSITE
- --expected-lifetime: 'short'|'medium'|'long'|NUMBER[UNITS]. UNITS are 's', 'm', 'h', or 'd'. This is how long you expect your jobs to run. So you'd use this like --expected-lifetime 4h The default is 8h.
- --disk: This is the amount of disk space you expect your job needs. An example would be --disk 10GB. The default is 35GB.
- --memory: This is how much memory you want to request for your job. An example is --memory 8GB. The default is 2GB.
- --OS: The OS you want your job to run in. An example is --OS=SL5 . The default is SL6 (SL7 is also available).
- --debug: Provide debug output
To get the other options, run jobsub_submit -G <experiment> --help or see here. So a well-formed jobsub_submit command might be:
jobsub_submit -G nova --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=20m --disk=10GB --memory=1GB file:///grid/fermiapp/common/tools/probe
Any arguments after the file:// URI will be assumed to be arguments to the executable specified in the file:// URI.
Monitoring your jobs¶
To monitor your job from jobsub_client, the easiest way is to use your username or the jobid as filters, like this:
Use your username:
-bash-4.1$ jobsub_q -G nova --user=sbhat JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16956646.0@jobsub02.fnal.gov sbhat 03/05 13:26 0+00:00:00 I 0 0.0 probesleep.sh_20190305_132632_173264_0_1_wrap.sh 17001093.0@jobsub02.fnal.gov sbhat 03/06 14:22 0+00:00:00 I 0 0.0 probesleep.sh_20190306_142246_3511893_0_1_wrap.sh 2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended
Use the jobid:
-bash-4.1$ jobsub_q -G nova --jobid=16956646.0@jobsub02.fnal.gov JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16956646.0@jobsub02.fnal.gov sbhat 03/05 13:26 0+00:00:00 I 0 0.0 probesleep.sh_20190305_132632_173264_0_1_wrap.sh 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended
A more user-friendly way is to look at the User Batch Details page on Fifemon (select your username from the dropdown),
and then you can select your job/cluster at the bottom.
Holding jobs¶
If you hold your job, it will be terminated and kept in the queue, but not run. You might do this if, for example, you want to access files on dCache that haven't been staged from tape. Holding
your jobs will allow you to not have to resubmit, but will prevent the job from failing.
If your job uses more resources than you requested, it will often be automatically put into the "Held" status by the batch system.
To hold one of my example jobs above, I'll run:
-bash-4.1$ jobsub_hold -G nova --jobid=16956646.0@jobsub02.fnal.gov Holding job with jobid=16956646.0@jobsub02.fnal.gov 1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied
If I now run jobsub_q, you can see that the "ST" column (status) will now show that the status is "H" for held.
-bash-4.1$ jobsub_q -G nova --jobid=16956646.0@jobsub02.fnal.gov JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 16956646.0@jobsub02.fnal.gov sbhat 03/05 13:26 0+00:00:00 H 0 0.0 probesleep.sh_20190305_132632_173264_0_1_wrap.sh 1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended
Note that this can be run with the --user argument above to hold ALL your jobs. BE CAREFUL WITH THIS.
Releasing jobs¶
Releasing a held job will allow it to be put back in the queue and run when resources are available. To release the job that was previously held, run:
-bash-4.1$ jobsub_release -G nova --jobid=16956646.0@jobsub02.fnal.gov Releasing job with jobid=16956646.0@jobsub02.fnal.gov 1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied
Note that this can be run with the --user argument above to release ALL your jobs. BE CAREFUL WITH THIS.
Removing job¶
To stop your job (if it's running) and remove it from the queue, you will need to remove it. To remove the job I previously released, I'll run:
-bash-4.1$ jobsub_rm -G nova --jobid=16956646.0@jobsub02.fnal.gov Removing job with jobid=16956646.0@jobsub02.fnal.gov 1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied
jobsub_q now shows that the job is gone:
-bash-4.1$ jobsub_q -G nova --jobid=16956646.0@jobsub02.fnal.gov JOBSUBJOBID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
Note that this can be run with the --user argument above to remove ALL your jobs. BE VERY CAREFUL WITH THIS.
Why doesn't job submission work?¶
To submit a job, make sure you have a kerberos ticket, or that the environment variable X509_USER_PROXY is pointed at a VOMS proxy you can use
(like a Managed Proxy)
-bash-4.1$ klist Ticket cache: FILE:/tmp/krb5cc_10610_QprSm31484 Default principal: sbhat@FNAL.GOV Valid starting Expires Service principal 03/06/19 14:21:19 03/07/19 13:55:15 krbtgt/FNAL.GOV@FNAL.GOV
If neither is the case, please run kinit <username>@FNAL.GOV to obtain a kerberos ticket.
If it's not this, the best thing to do is to open a service desk ticket ("Report an issue") against the Batch Job Management (jobsub condorsubmit)
service here . Most times, the issue is that the user is not properly registered
in their experiment.