Project

General

Profile

Feature #24329

jobsub_lite

Added by Dennis Box 7 months ago. Updated 2 months ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
04/20/2020
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

History

#1 Updated by Dennis Box 7 months ago

  • Assignee set to Marc Mengel

#2 Updated by Marc Mengel 7 months ago

The prototype on branch 24329 now generates a submit file and wrapper script that looks like it would work for simple submissions, and prints the condor_submit line it would use.

I need to go back through the templates for various options, especially the various dag-based options and especially the dagnabbit bits, and also the cvmfs tarball upload service calls.

Questions at the moment:

  • I have a "get_schedd()" call that currently just picks from jobsub01..3 at random; how should the thin-client jobsub pick a schedd? Should the avaliable schedd's be listed in a config, or should there be a web page we should check, or???
  • Do we want to support resilient DCache for tarball dropoff, or only the cvmfs publication service (which is how I have it at the moment).

#3 Updated by Dennis Box 7 months ago

Recipe for submitting using raw condor commands on jobsubdevgpvm01. According to condor_q -better-analyze, these jobs will eventually run

1) make a proxy using kx509 or voms-proxy-init or whatever. In this example its called 'nova_proxy'

2) scp nova_proxy rexbatch@jobsubdevgpvm01:dbox/submit_dir/testjobs

3) ssh

-bash-4.2$ cd dbox/submit_dir/testjobs

-bash-4.2$ cat testjob.jdf
universe = vanilla

executable = system-info.sh
output = joboutput/out.$(cluster).$(process)
error = joboutput/err.$(cluster).$(process)
log = joboutput/log.$(cluster).$(process)

should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT

#x509userproxy=/local/home/testuser/security/grid_proxy
x509userproxy=nova_proxy

Requirements =

queue 3

-bash-4.2$ condor_submit testjob.jdf
Submitting job(s)...
3 job(s) submitted to cluster 39107.

-bash-4.2$ condor_q rexbatch

-- Schedd: jobsubdevgpvm01.fnal.gov : <131.225.240.23:9615?... @ 05/05/20 14:54:50
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
rexbatch ID: 39107 5/5 14:42 _ _ 3 3 39107.0-2

Total for query: 3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
Total for all users: 93 jobs; 0 completed, 0 removed, 93 idle, 0 running, 0 held, 0 suspended

bash-4.2$
-bash-4.2$ ls -lart joboutput/
total 20
drwxr-xr-x 3 rexbatch fife 4096 May 5 14:42 ..
-rw-r--r-
1 rexbatch fife 1113 May 5 14:42 log.39107.1
rw-r--r- 1 rexbatch fife 1113 May 5 14:42 log.39107.0
drwxr-xr-x 2 rexbatch fife 4096 May 5 14:42 .
rw-r--r- 1 rexbatch fife 1113 May 5 14:42 log.39107.2

-bash-4.2$ condor_q -better-analyze 39107.0

-- Schedd: jobsubdevgpvm01.fnal.gov : <131.225.240.23:9615?...
The Requirements expression for job 39107.000 is

(TARGET.Arch  "X86_64") && (TARGET.OpSys  "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) &&
(TARGET.HasFileTransfer)

Job 39107.000 defines the following attributes:

JobUniverse = 5
RequestDisk = ifThenElse(JobUniverse != 7,10000000,16000)
RequestMemory = ifThenElse(JobUniverse != 7,2000,10)

The Requirements expression for job 39107.000 reduces to these conditions:

Slots
Step Matched Condition
----- -------- ---------
[0] 15923 TARGET.Arch "X86_64"
[1] 15923 TARGET.OpSys "LINUX"
[3] 14079 TARGET.Disk >= RequestDisk
[5] 9509 TARGET.Memory >= RequestMemory
[6] 8551 [3] && [5]

No successful match recorded.
Last failed match: Tue May 5 14:44:17 2020

Reason for last match failure: no match found

39107.000: Run analysis summary ignoring user priority. Of 2135 machines,
97 are rejected by your job's requirements
1721 reject your job because of their own requirements
0 match and are already running your jobs
0 match but are serving other users
317 are able to run your job

-bash-4.2$

#4 Updated by Marc Mengel 5 months ago

Notes from jobsub_lite code review:

  1. ✓ Dictionary from arguments -- use vars(x) over x.__dict__
  2. ✓ Maybe move arg parser to separate file: many loc, not otherwise complicated.
  3. ✓ Possilby other splitouts? Tarfiles? dagnabbit parser?
  4. File versus params on command line -- if fixed lose if's about it and debug prints
  5. Add unit test/example comments to methods, esp. unit converter; would make it clearer.
  6. Document taking either upper/lower case.
  7. ✓ verify file: prefix in argparse(?) use type=func for argparse to check it.
  8. ✓ Look at using condor bindings to check what schedd's are there,
  9. ✓ and possilby also for submitting.
  10. ✓ Use rdcs round-robin DNS and drop picking server from list on tarball upload
  11. -Review n ways to upload files: question: still support drobpox:/tardir: on -f parameters?
  12. Authentication: for current stuff need cigetcert/myproxy call;
  13. for future stuff need suitable tokens bits.
  14. Check more if dagnabbit should support nested <parallel> ... <serial> </serial> </parallel>
  15. typo requirement in template files.
  16. ✓ Looks much easier to maintain, and easier to find things.
  17. Do we want % formatting or .format() ? I'm in the habit of % formatting, bears discussion.
  18. ✓ Should we use os.path() versus formatted paths?
  19. Put future suggestions in ticket; make subtickets on this ticket or obvious stuff just fix and commit.

Thanks to Vito, Bruno, and especially Shreyas for attending.

#5 Updated by Marc Mengel 2 months ago

So there is an example of using scitokens in a submit file:

https://github.com/htcondor/scitokens-credmon/blob/master/examples/submit/scitoken_example/single_scitoken.submit

which is supposed to let you say what token bits your job needs (?).
Do we add these to our jobsub_lite templates?

There are notes at scitokens-credmon: Testing the credmon...

Also https://cdcvs.fnal.gov/redmine/projects/glideinwms/wiki/CredMonScitokensDocker

#6 Updated by Marc Mengel 2 months ago

Also adding Jobsub_lite_design_document per Tanya's request.

Also available in: Atom PDF