Project

General

Profile

Getting Started with the Jobsub Client

This article is meant as a very brief introduction that will show users how to submit, monitor, and manipulate their jobs using the jobsub client. We will assume that you have your computing account
set up and are a part of an experiment.

Setting up

We will use the latest version of jobsub out of ups for the rest of these examples. To set up the latest jobsub, run the following:

-bash-4.1$ source /cvmfs/fermilab.opensciencegrid.org/products/common/etc/setups.sh
-bash-4.1$ setup jobsub_client

You can verify the version of jobsub_client by running:

-bash-4.1$ jobsub_submit --version
1.2.9

Submitting a job

To submit a job, you need to know two things:

  1. Your experiment (for these examples, we'll assume that's nova)
  2. An executable to run. For this example, we'll use /grid/fermiapp/common/tools/probe)

Then, to submit the job, use:

jobsub_submit -G nova file:///grid/fermiapp/common/tools/probe

If job submission is successful, you'll get an output like this:

/fife/local/scratch/uploads/nova/sbhat/2019-03-06_162756.803965_1644

/fife/local/scratch/uploads/nova/sbhat/2019-03-06_162756.803965_1644/probe_20190306_162757_82154_0_1_.cmd

submitting....

Submitting job(s).

1 job(s) submitted to cluster 17006020.

JobsubJobId of first job: 17006020.0@jobsub02.fnal.gov

Use job id 17006020.0@jobsub02.fnal.gov to retrieve output

A few notes here:

  • The -G option specifies group, which is your experiment
  • The file:// URI tells jobsub client what executable you want to run

These are the only two items required to submit jobs. However, we recommend that you use the following options as needed to tailor your resource
requests (this can help you get slots quicker, rather than using the defaults). Some other commonly-used and/or highly-encouraged options are:

  • --role: Specify your VOMS role (used mainly for Production jobs)
  • --resource-provides=usage_model: This is how you can control whether your jobs run onsite or offsite (in HEPCloud, this might go away):
    • To get onsite: --resource-provides=usage_model=DEDICATED
    • To get offsite: --resource-provides=usage_model=OFFSITE
    • To run anywhere (recommended): --resource-provides=usage_model=DEDICATED,OFFSITE
  • --expected-lifetime: 'short'|'medium'|'long'|NUMBER[UNITS]. UNITS are 's', 'm', 'h', or 'd'. This is how long you expect your jobs to run. So you'd use this like --expected-lifetime 4h The default is 8h.
  • --disk: This is the amount of disk space you expect your job needs. An example would be --disk 10GB. The default is 35GB.
  • --memory: This is how much memory you want to request for your job. An example is --memory 8GB. The default is 2GB.
  • --OS: The OS you want your job to run in. An example is --OS=SL5 . The default is SL6 (SL7 is also available).
  • --debug: Provide debug output

To get the other options, run jobsub_submit -G <experiment> --help or see here. So a well-formed jobsub_submit command might be:

jobsub_submit -G nova --resource-provides=usage_model=DEDICATED,OFFSITE --expected-lifetime=20m --disk=10GB --memory=1GB file:///grid/fermiapp/common/tools/probe

Any arguments after the file:// URI will be assumed to be arguments to the executable specified in the file:// URI.

Monitoring your jobs

To monitor your job from jobsub_client, the easiest way is to use your username or the jobid as filters, like this:

Use your username:

-bash-4.1$ jobsub_q -G nova --user=sbhat
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD
16956646.0@jobsub02.fnal.gov          sbhat           03/05 13:26   0+00:00:00 I   0   0.0 probesleep.sh_20190305_132632_173264_0_1_wrap.sh
17001093.0@jobsub02.fnal.gov          sbhat           03/06 14:22   0+00:00:00 I   0   0.0 probesleep.sh_20190306_142246_3511893_0_1_wrap.sh

2 jobs; 0 completed, 0 removed, 2 idle, 0 running, 0 held, 0 suspended

Use the jobid:

-bash-4.1$ jobsub_q -G nova --jobid=16956646.0@jobsub02.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD
16956646.0@jobsub02.fnal.gov          sbhat           03/05 13:26   0+00:00:00 I   0   0.0 probesleep.sh_20190305_132632_173264_0_1_wrap.sh

1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended

A more user-friendly way is to look at the User Batch Details page on Fifemon (select your username from the dropdown),
and then you can select your job/cluster at the bottom.

Holding jobs

If you hold your job, it will be terminated and kept in the queue, but not run. You might do this if, for example, you want to access files on dCache that haven't been staged from tape. Holding
your jobs will allow you to not have to resubmit, but will prevent the job from failing.

If your job uses more resources than you requested, it will often be automatically put into the "Held" status by the batch system.

To hold one of my example jobs above, I'll run:

-bash-4.1$ jobsub_hold -G nova --jobid=16956646.0@jobsub02.fnal.gov
Holding job with jobid=16956646.0@jobsub02.fnal.gov
1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied

If I now run jobsub_q, you can see that the "ST" column (status) will now show that the status is "H" for held.

-bash-4.1$ jobsub_q -G nova --jobid=16956646.0@jobsub02.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD
16956646.0@jobsub02.fnal.gov          sbhat           03/05 13:26   0+00:00:00 H   0   0.0 probesleep.sh_20190305_132632_173264_0_1_wrap.sh

1 jobs; 0 completed, 0 removed, 0 idle, 0 running, 1 held, 0 suspended

Note that this can be run with the --user argument above to hold ALL your jobs. BE CAREFUL WITH THIS.

Releasing jobs

Releasing a held job will allow it to be put back in the queue and run when resources are available. To release the job that was previously held, run:

-bash-4.1$ jobsub_release -G nova --jobid=16956646.0@jobsub02.fnal.gov
Releasing job with jobid=16956646.0@jobsub02.fnal.gov
1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied

Note that this can be run with the --user argument above to release ALL your jobs. BE CAREFUL WITH THIS.

Removing job

To stop your job (if it's running) and remove it from the queue, you will need to remove it. To remove the job I previously released, I'll run:

-bash-4.1$ jobsub_rm -G nova --jobid=16956646.0@jobsub02.fnal.gov
Removing job with jobid=16956646.0@jobsub02.fnal.gov
1 Succeeded, 0 Failed, 0 Not Found, 0 Bad Status, 0 Already Done, 0 Permission Denied

jobsub_q now shows that the job is gone:

-bash-4.1$ jobsub_q -G nova --jobid=16956646.0@jobsub02.fnal.gov
JOBSUBJOBID                           OWNER           SUBMITTED     RUN_TIME   ST PRI SIZE CMD

Note that this can be run with the --user argument above to remove ALL your jobs. BE VERY CAREFUL WITH THIS.

Why doesn't job submission work?

To submit a job, make sure you have a kerberos ticket, or that the environment variable X509_USER_PROXY is pointed at a VOMS proxy you can use
(like a Managed Proxy)

-bash-4.1$ klist
Ticket cache: FILE:/tmp/krb5cc_10610_QprSm31484
Default principal: sbhat@FNAL.GOV

Valid starting     Expires            Service principal
03/06/19 14:21:19  03/07/19 13:55:15  krbtgt/FNAL.GOV@FNAL.GOV

If neither is the case, please run kinit <username>@FNAL.GOV to obtain a kerberos ticket.

If it's not this, the best thing to do is to open a service desk ticket ("Report an issue") against the Batch Job Management (jobsub condorsubmit)
service here . Most times, the issue is that the user is not properly registered
in their experiment.