Project

General

Profile

jobsub_client

Offical jobsub support

Project Home
https://cdcvs.fnal.gov/redmine/projects/jobsub

Project Wiki (Admins & Users )
https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki

Project Technical/Design Documentation
https://cdcvs.fnal.gov/redmine/projects/jobsub/documents

Mailing list for discussions and guidance from developers

Overview

jobsub_client may be referred to in various documents as :
  • jobsub_client
  • client-server
  • HA (high availablity)
  • fifebatch
The major goals for jobsub_client
  • Provide proxies for jobs automatically.
    • Users no longer need to log into gpsn01 or minos25 to kcroninit and maintain a crontab running kproxy.
  • Eliminate the ${CONDOR_TMP} NFS shared file system containing .log .err .out files
    • Condor will no longer be affected by user overloads of this system
    • Permissions could let experts review these files
    • Traditional location was /<project>/data/condor-tmp/<username>
  • Thin jobsub client software
    • Easy to maintain and deploy the thin client
    • Easy to test and update the single server
  • High Availability and Capacity servers in the backend
    • jobid has been extended to include the server name.

Minos Usage

For use by Minos
  • source /grid/fermiapp/minos/scripts/jobsub.sh
    This sets up several aliases and functions for compatibility with jobsub

Then submit like

PROBE=/grid/fermiapp/minos/scripts/probe
jobsub file://${PROBE}

Fetch stderr/stdout/log for a cluster to ${CONDOR_TMP} with
  • jobget <jobid>

COMMANDS
COMMAND RUNS NOTES
jobget jobusb_fetch [cluster] - files go to ${CONDOR_TMP}/<jobid>
jobq jobusb_q [jobid]
jobrelease jobsub_release [jobid]
jobrm jobusb_rm [jobid]
jobsub jobusb_submit file:///path/to/your/script

Get a list of available jobid's for fetching ( need kerberos proxy in browser )
https://fifebatch.fnal.gov:8443/jobsub/acctgroups/minos/jobs/0/sandbox/

Direct Usage

You can setup jobsub_client instead of using the jobsub.sh minos wrappers

Usage : https://cdcvs.fnal.gov/redmine/projects/jobsub/wiki/Using_the_Client

There are multiple fifebatch job servers, with no direct user logins.

Typical usage:

. /grid/fermiapp/products/common/etc/setups.sh
setup jobsub_client
export JOBSUB_GROUP=minos
jobsub_submit file:///grid/fermiapp/common/tools/probe
...
Use job id 76867.0@fifebatch2.fnal.gov to retrieve output
...
jobsub_q --user=kreymer

ARG=76867.0@fifebatch2.fnal.gov
HOST=`echo ${ARG} | cut -f 2 -d '@' | cut -f 1 -d .`;
CLUS=`echo ${ARG} | cut -f 1 -d .`;
MDC=/minos/data/condor-tmp/`whoami`/${HOST}-${CLUS};
jobsub_fetchlog --job=${ARG} --unzip=${MDC};
cd ${MDC}
less probe_20140926_082903_24357_0_1.out


FIFEMON provides extensive overall and detailed monitoring
https://fifemon.fnal.gov
You will need to log in with your services password on connecting.

Job status somewhat like condor_q is
https://fifebatch2.fnal.gov:8443/jobsub/acctgroups/minos/jobs/

PDF API document is https://cdcvs.fnal.gov/redmine/attachments/download/15450/JobSub-API-v0.5.pdf
This describes the internal REST/JSON interface to the server.

Tracking of version 1.0 work is at https://cdcvs.fnal.gov/redmine/projects/jobsub/issues?query_id=107

Wrappers

To make the jobsub_* commands look more like the current jobsub,
I have created several functions and aliases, set up with

Issues

Immediate - affecting ongoing tests

ISSUE comment
condor_q -l need this, restrict to single process
condor_q --better-analyze --jobid should be restricted to a single process

Release - needed for ongoing support now that we have shut down gpsn01, minos25/54

ISSUE comment
heirarchical quotas query and set user and subgroup priorities, formerly done with condor_userprio
admin authorize designated users in each project hold/release/rm all project jobs

Maintenance - needed for longterm support
FIFEMON need group breakdown and batch details
jobsub jobsub wrapper to obviate file:///
shared accounts support SSL certs

Documents
ISSUE comment
SLA Need careful statment of Minos/Fermigrid support during beta testing
-a -p -h mentions obsolete Parrot options. Remove or support these
--no_log_buffer -h explains why this is bad. Remove it or limit to -N 1

Resolved

ISSUE comment
condor_hold 04/03 v0_2 jobsub_hold
condor_release 04/03 v0_2 jobsub_release
condor_rm 04/03 v0_2 jobsub_rm
set Role 04/03 v0_2 --role
condor_history 04/17 v0_2_1_rc1 jobsub_history
local batch gpwn and Minos local batch are shut down. By Summer 2015, jobs will run with User UID on Fermigrid
minos_q jobsub_q --summary ( does this support -G <group> for a subset ?
FIFEMON 04/01 http://fifemon1.fnal.gov/monitor/pool/fifebatch1
proxy names names have changed, confusing ifdhc locks. Upgraded ifdhc to use current cpn v1.6 getting name from proxy content
zip output default is now tar tgz, and transparent pipe to output areas is supported
getcert Client cert should be automatic, renewed, and unique to jobsub
Users must do manual kinit ; getcert -s initially and at least once a day
tar working file fetch needs to use pipe to tar rather than intermediate file https://cdcvs.fnal.gov/redmine/issues/6552
jobid I don't like the form. Better would be
jobsub jobsub compatible wrapper for jobsub_submit --group minos --jobsub-server https://fifebatch1.fnal.gov:8443 -g
and to obviate file:///
default group and servers are set, still don't like file:///
kcron tickets kcron tickets need to be allowed, so cron jobs can submit and monitor
shared accounts works with individual's keytab.
available jobids document URL to list available jobs for fetching