Project

General

Profile

Bug #16775

avoid a round trip if tarball is in pnfs

Added by Dennis Box almost 4 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
06/06/2017
Due date:
02/01/2018
% Done:

90%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration: 241

Description

also check for tarball size > (say 100 M) then complain and tell them to put in pnfs instead
Ken will ask (Gabe? Someone else?) if 100M is right size for limit?

History

#1 Updated by Dennis Box over 3 years ago

  • Target version changed from v1.2.4 to v1.2.5

#2 Updated by Dennis Box over 3 years ago

  • Target version changed from v1.2.5 to v1.2.7

#3 Updated by Dennis Box over 3 years ago

  • Target version changed from v1.2.7 to v1.2.6

related to #17828, these are all essentially the same ticket. "Do the right thing with -f file://pnfs, --tar-file-name file://pnfs, --dropbox file://pnfs... jobsub-submit file://pnfs....

#4 Updated by Dennis Box over 3 years ago

  • Target version changed from v1.2.6 to v1.2.7

In general: do the right thing with -f dropbox:// file:// and /pnfs

These are all related tickets ( and should be merged into one maybe?) #17167 #17828 #16775 #12871 #13785

#5 Updated by Dennis Box over 3 years ago

  • Assignee changed from Dennis Box to Shreyas Bhat
  • Priority changed from Normal to High
  • Target version changed from v1.2.7 to v1.2.6

This has suddenly jumped up in priority.

Background:
Jobsub_submit has a potentially useful feature --tar_file_name documented here:

[dbox@novagpvm01 ~]$ jobsub_submit -G nova --help --jobsub-server fifebatch.fnal.gov
Usage: jobsub_submit [Client Options] [Server Options] file://user_script [user_script_args]

[a bunch of stuff]

--tar_file_name=dropbox://PATH/TO/TAR_FILE
specify tarball to transfer to worker node. TAR_FILE
will be copied to the jobsub server and added to the
transfer_input_files list. TAR_FILE will be accessible
to the user job on the worker node via the environment
variable $INPUT_TAR_FILE.

A (poorly documented) feature of this option is that if /PATH/TO/TAR_FILE is a directory, a tar file of this directory is created, then transported to the jobsub_server via curl, and on to worker nodes via whatever mechanism condor has been configured with. If there are many processes in the cluster and/or the resulting tarball is large, it renders the jobsub server I/O bound, unable to service other requests. The feature as implemented is deprecated, but is now needed for bluearc decommissioning.

The feature needs to be modified so on the client side jobsub_submit creates the tarball somewhere ifdh can access it, like pnfs scratch or dcache. On the server side the jobsub job wrapper that is generated and run on the worker node needs to be modified so it transfers the created tarfile in via ifdh and untars it, rather than relying on condor to do it.

#6 Updated by Shreyas Bhat over 3 years ago

Rough steps:
  1. Recreate behavior using production, and then test jobsub server
    1. Create dummy dir with a few files, dummy script that untars $INPUT_TAR_FILE and does ls or something in a job
    2. Submit job with that dir given, make sure it works as advertised. See if I can follow the file around from the submit node, to the schedd, to the worker node
  2. Make sure that jobsub_submit now creates the tarball in pnfs scratch space. WHERE WOULD BE A GOOD PLACE? EXPERIMENT SCRATCH, SINCE WE KNOW THE EXPERIMENT FROM THE -g OPTION?
  3. Change wrapper script so that it transfers in the file using ifdh and untars it.
  4. Test on test cluster

#7 Updated by Shreyas Bhat over 3 years ago

The expected current behavior is that we should be able to use --tar_file_name=tardir://PATH/TO/DIR, and this will make jobsub tar up the dir, transfer the tarball to the jobsub server, transfer it to the worker node, and then untar it. This works for the dropbox:// URI when we provide a pre-tarred file. It does not, however, for the directory option. Note that the URI patterns are in client/constants.py

When --tar_file_name is specified, the code starting at client/jobsubClient.py:146 parses the dropbox URI (if it's tardir://, then get_directory_tar_map call at line 152 does the actual parsing). It should be noted that in get_directory_tar_map, we use uri2path (defined at line 1449) to grab the path to the file, and then os.path.basename gives us a filename to name our tarball. IMPORTANT: This means that if you pass something like tardir://PATH/TO/DIR/, os.path.basename will return a blank string, which will cause the untarring in the job to fail, since it's going to try to untar $_CONDOR_JOB_IWD.

The function dropbox_upload at line 248 does the actual upload of the tar file to the jobsub server via curl, as Dennis mentioned above.

On the server side, though, there's an issue with the current implementation, that can be seen in lib/groupsettings/JobSettings.py. At line 270, we have the following:

        if settings['tar_file_name']:
            settings['tar_file_basename'] = os.path.basename(
                settings['tar_file_name'])

So if the tar_file_name is given as /path/to/dir, what will happen is that settings['tar_file_basename'] is set to "dir", rather than "dir.tar". This becomes a problem when later on, the wrapper script is auto-generated (line 947):

if 'tar_file_basename' in settings:
            f.write(
                "export INPUT_TAR_FILE=${_CONDOR_JOB_IWD}/%s\n" % settings['tar_file_basename'])

So in this case, $INPUT_TAR_FILE becomes ${_CONDOR_JOB_IWD}/dir, and then, the very next line written to the wrapper script is:

if [ -e "$INPUT_TAR_FILE" ]; then tar xvf "$INPUT_TAR_FILE" ; fi

So we try to untar a non-existent tarball called "dir", rather than "dir.tar". We can see that in a test job I ran, ,

Thu Jan 4 20:46:12 UTC 2018 BEGIN EXECUTION ./test_untar.sh
Starting run now
Checking that $INPUT_TAR_FILE is set
/storage/local/data1/condor/execute/dir_15071/test_dir                            ## This was $INPUT_TAR_FILE
/storage/local/data1/condor/execute/dir_15071:
total 36
-rw-r--r-- 1 sbhat sbn     0 Jan  4 20:46 _condor_stderr
-rw-r--r-- 1 sbhat sbn     0 Jan  4 20:46 _condor_stdout
-rwxr-xr-x 1 sbhat sbn  6617 Jan  4 20:46 condor_exec.exe
drwxrwxr-x 2 sbhat sbn  4096 Jan  4 20:46 jsb_tmp
drwxrwxr-x 3 sbhat sbn  4096 Jan  4 20:46 no_xfer
-rwxrwxr-x 1 sbhat sbn   418 Jan  4 20:46 test_untar.sh
-rw------- 1 sbhat sbn 12408 Jan  4 20:46 x509cc_sbhat_Analysis

/storage/local/data1/condor/execute/dir_15071/jsb_tmp:
total 12
-rw-rw-r-- 1 sbhat sbn  60 Jan  4 20:46 JOBSUB_ERR_FILE
-rw-rw-r-- 1 sbhat sbn 169 Jan  4 20:46 JOBSUB_LOG_FILE
-rwxrwxr-x 1 sbhat sbn 812 Jan  4 20:46 ifdh.sh

/storage/local/data1/condor/execute/dir_15071/no_xfer:
total 4
drwxrwxr-x 3 sbhat sbn 4096 Jan  4 20:46 0

/storage/local/data1/condor/execute/dir_15071/no_xfer/0:
total 4
drwxrwxr-x 2 sbhat sbn 4096 Jan  4 20:46 TRANSFERRED_INPUT_FILES

/storage/local/data1/condor/execute/dir_15071/no_xfer/0/TRANSFERRED_INPUT_FILES:
total 0
Thu Jan 4 20:46:12 UTC 2018 ./test_untar.sh COMPLETED with exit status 0

So before changing where the tar file for the directory is created on the jobsub server (behavior that seems to work as designed, looking at /fife/local/scratch/dropbox/, we'll need to address this.

An example of the correct (and expected) behavior, using a pre-created tar is .

One more issue - CVMFS doesn't seem to be mounted on my jobsub dev server, so I'm submitting test jobs from my test client (fermicloud362) to the production servers for now. Eventually, I'll need to figure out the CVMFS problem.

#8 Updated by Shreyas Bhat over 3 years ago

  • Status changed from New to Work in progress

Got CVMFS mounted on my test jobsub server (fermicloud035). Running test job there to make sure I can reproduce behavior seen yesterday in production.

#9 Updated by Shreyas Bhat over 3 years ago

Behavior reproduced:

Submit command:

./jobsub/client/jobsub_submit.py -G nova --debug --jobsub-server=fermicloud035.fnal.gov  --tar_file_name=tardir:///home/sbhat/test_dir file:///home/sbhat/test_untar.sh

Dropbox created with tarred dir:

[root@fermicloud035 e1823da154acb2a89c51da04460d200e490c5f61]# pwd
/fife/local/scratch/dropbox/nova/sbhat/e1823da154acb2a89c51da04460d200e490c5f61
[root@fermicloud035 e1823da154acb2a89c51da04460d200e490c5f61]# ls
test_dir.tar
[root@fermicloud035 e1823da154acb2a89c51da04460d200e490c5f61]# tar -tf test_dir.tar
d
c
a
b

But for reasons brought up yesterday, tar file doesn't show up in job, and wrapper script tries to untar test_dir rather than test_dir.tar:

[root@fermicloud035 688.0@fermicloud035.fnal.gov]# pwd
/fife/local/scratch/uploads/nova/sbhat/688.0@fermicloud035.fnal.gov
[root@fermicloud035 688.0@fermicloud035.fnal.gov]# cat *.out
Fri Jan 5 16:44:50 CST 2018 BEGIN EXECUTION ./test_untar.sh
Starting run now
Checking that $INPUT_TAR_FILE is set
/var/lib/condor/execute/dir_2586613/test_dir
/var/lib/condor/execute/dir_2586613:
total 36
-rw-r--r-- 1 sbhat sbhat     0 Jan  5 16:44 _condor_stderr
-rw-r--r-- 1 sbhat sbhat     0 Jan  5 16:44 _condor_stdout
-rwxr-xr-x 1 sbhat sbhat  6626 Jan  5 16:44 condor_exec.exe
drwxrwxr-x 2 sbhat sbhat  4096 Jan  5 16:44 jsb_tmp
drwxrwxr-x 3 sbhat sbhat  4096 Jan  5 16:44 no_xfer
-rwxrwxr-x 1 sbhat sbhat   418 Jan  5 16:44 test_untar.sh
-rw------- 1 sbhat sbhat 12429 Jan  5 16:44 x509cc_sbhat_Analysis

/var/lib/condor/execute/dir_2586613/jsb_tmp:
total 12
-rw-rw-r-- 1 sbhat sbhat  60 Jan  5 16:44 JOBSUB_ERR_FILE
-rw-rw-r-- 1 sbhat sbhat 159 Jan  5 16:44 JOBSUB_LOG_FILE
-rwxrwxr-x 1 sbhat sbhat 812 Jan  5 16:44 ifdh.sh

/var/lib/condor/execute/dir_2586613/no_xfer:
total 4
drwxrwxr-x 3 sbhat sbhat 4096 Jan  5 16:44 0

/var/lib/condor/execute/dir_2586613/no_xfer/0:
total 4
drwxrwxr-x 2 sbhat sbhat 4096 Jan  5 16:44 TRANSFERRED_INPUT_FILES

/var/lib/condor/execute/dir_2586613/no_xfer/0/TRANSFERRED_INPUT_FILES:
total 0
Fri Jan 5 16:44:50 CST 2018 ./test_untar.sh COMPLETED with exit status 0

#10 Updated by Shreyas Bhat about 3 years ago

As mentioned before, JobSettings.py (in /lib/groupsettings) writes the wrapper script in such a way that if you pass in a directory to get tarred up, the wrapper script will try to untar the directory name "my_dir", rather than the tar file that was created on the dropbox server "my_dir.tar". I've made a tentative change to fix that and committed it in my branch. It does need to be strengthened though - the way I wrote it, we're assuming the file in the dropbox will be a tar file if ".tar" wasn't in the --tar_file_name argument. I'm not sure that's true - I don't think it is.

The other issue, then, that I pointed out last update, is that the tarred up file doesn't get transferred to the worker node in this case. This is because in jobsubClient.py, line 192, we take the literal argument passed to jobsub and add that to the tranfer_input_files list (line 214), if it's a dropbox uri. I added a few print statements, and we can see that (the lines where I say "Arg is" - those values get added ):

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit.py -G nova --debug --jobsub-server=fermicloud035.fnal.gov  --tar_file_name=tardir:///home/sbhat/test_dir file:///home/sbhat/test_untar.sh
SERVER_ARGS:  ['--tar_file_name=tardir:///home/sbhat/test_dir', 'file:///home/sbhat/test_untar.sh']
CLIENT_ARGS:  {'dag': False, 'help': False, 'acctRole': None, 'jobid_output_only': False, 'debug': True, 'dropboxServer': None, 'acctGroup': 'nova', 'jobsubServer': 'fermicloud035.fnal.gov'}
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
/usr/bin/cigetcert -s fermicloud035.fnal.gov -n -o /tmp/x509up_u501
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
stdout: Checking if /tmp/x509up_u501 can be reused ..... yes

stderr:
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
tar_file=test_dir.tar tar_path=/home/sbhat/test_dir cwd=/home/sbhat
creating tar of /home/sbhat/test_dir
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
SERVER RESPONSE CODE: 200
Arg is:  --tar_file_name
Arg is:  tardir:///home/sbhat/test_dir
Arg is:  file:///home/sbhat/test_untar.sh
ltdict={'stime': '01/08/18 15:54:15', 'etime': '01/09/18 17:38:57'}
URL            : https://fermicloud035.fnal.gov:8443/jobsub/acctgroups/nova/jobs/ LS10YXJfZmlsZV9uYW1lIHRhcmRpcjovLy9ob21lL3NiaGF0L3Rlc3RfZGlyIEAvaG9tZS9zYmhhdC90ZXN0X3VudGFyLnNo

CREDENTIALS    : {'cert': '/tmp/x509up_u501', 'proxy': '/tmp/x509up_u501', 'key': '/tmp/x509up_u501'}

SUBMIT_URL     : https://fermicloud035.fnal.gov:8443/jobsub/acctgroups/nova/jobs/

SERVER_ARGS_B64: --tar_file_name tardir:///home/sbhat/test_dir @/home/sbhat/test_untar.sh

checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
/fife/local/scratch/uploads/nova/sbhat/2018-01-09_164216.834052_9184

tar_file_name is  tardir:///home/sbhat/test_dir

tar_file_basename set to test_dir.tar

/fife/local/scratch/uploads/nova/sbhat/2018-01-09_164216.834052_9184/test_untar.sh_20180109_164217_2664112_0_1_.cmd

submitting....

Submitting job(s).

1 job(s) submitted to cluster 697.

JobsubJobId of first job: 697.0@fermicloud035.fnal.gov

Use job id 697.0@fermicloud035.fnal.gov to retrieve output

JOBSUB SERVER CONTACTED     : https://fermicloud035.fnal.gov:8443
JOBSUB SERVER RESPONDED     : https://fermicloud035.fnal.gov:8443
JOBSUB SERVER RESPONSE CODE : 200 (Success)
JOBSUB SERVER SERVICED IN   : 0.989804983139 sec
JOBSUB CLIENT FQDN          : fermicloud362.fnal.gov
JOBSUB CLIENT SERVICED TIME : 09/Jan/2018 16:42:17 

Since the args are not in dropbox URI form, they don't get added at all. So that's the next thing to fix.

#11 Updated by Shreyas Bhat about 3 years ago

The issue stated yesterday is resolved. I've added a condition to jobsubClient so that if a server arg does match the DIRECTORY_SUPPORTED_URI (tardir://), then we grab the dropox:// URI already generated and use that as the arg. My test worked:

Submission:

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit.py -G nova --debug --jobsub-server=fermicloud035.fnal.gov  --tar_file_name=tardir:///home/sbhat/test_dir file:///home/sbhat/test_untar.sh
SERVER_ARGS:  ['--tar_file_name=tardir:///home/sbhat/test_dir', 'file:///home/sbhat/test_untar.sh']
CLIENT_ARGS:  {'dag': False, 'help': False, 'acctRole': None, 'jobid_output_only': False, 'debug': True, 'dropboxServer': None, 'acctGroup': 'nova', 'jobsubServer': 'fermicloud035.fnal.gov'}
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
/usr/bin/cigetcert -s fermicloud035.fnal.gov -n -o /tmp/x509up_u501
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
stdout: Checking if /tmp/x509up_u501 can be reused ..... yes

stderr:
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
tar_file=test_dir.tar tar_path=/home/sbhat/test_dir cwd=/home/sbhat
creating tar of /home/sbhat/test_dir
checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
SERVER RESPONSE CODE: 200
Arg is:  --tar_file_name
Arg is:  tardir:///home/sbhat/test_dir
woo!
dropbox://test_dir.tar
{'dropbox://test_dir.tar': 'c165758a1f7f641e73791b28a31cd723e25c6606'}
{'tardir:///home/sbhat/test_dir': 'dropbox://test_dir.tar'}
c165758a1f7f641e73791b28a31cd723e25c6606
{u'url': u'/jobsub/acctgroups/nova/dropbox/c165758a1f7f641e73791b28a31cd723e25c6606/test_dir.tar', u'path': u'/fife/local/scratch/dropbox/blahblah/nova/sbhat/c165758a1f7f641e73791b28a31cd723e25c6606/test_dir.tar', u'host': u'fermicloud035.fnal.gov'}
Arg is:  file:///home/sbhat/test_untar.sh
Transfer input files is:  /fife/local/scratch/dropbox/blahblah/nova/sbhat/c165758a1f7f641e73791b28a31cd723e25c6606/test_dir.tar
ltdict={'stime': '01/09/18 22:25:28', 'etime': '01/11/18 00:25:24'}
URL            : https://fermicloud035.fnal.gov:8443/jobsub/acctgroups/nova/jobs/ LS1leHBvcnRfZW52PVpYaHdiM0owSUZSU1FVNVRSa1ZTWDBsT1VGVlVYMFpKVEVWVFBTOW1hV1psTDJ4dlkyRnNMM05qY21GMFkyZ3ZaSEp2Y0dKdmVDOWliR0ZvWW14aGFDOXViM1poTDNOaWFHRjBMMk14TmpVM05UaGhNV1kzWmpZME1XVTNNemM1TVdJeU9HRXpNV05rTnpJelpUSTFZelkyTURZdmRHVnpkRjlrYVhJdWRHRnlPdz09IC0tdGFyX2ZpbGVfbmFtZSAvZmlmZS9sb2NhbC9zY3JhdGNoL2Ryb3Bib3gvYmxhaGJsYWgvbm92YS9zYmhhdC9jMTY1NzU4YTFmN2Y2NDFlNzM3OTFiMjhhMzFjZDcyM2UyNWM2NjA2L3Rlc3RfZGlyLnRhciBAL2hvbWUvc2JoYXQvdGVzdF91bnRhci5zaA==

CREDENTIALS    : {'cert': '/tmp/x509up_u501', 'proxy': '/tmp/x509up_u501', 'key': '/tmp/x509up_u501'}

SUBMIT_URL     : https://fermicloud035.fnal.gov:8443/jobsub/acctgroups/nova/jobs/

SERVER_ARGS_B64: --export_env=ZXhwb3J0IFRSQU5TRkVSX0lOUFVUX0ZJTEVTPS9maWZlL2xvY2FsL3NjcmF0Y2gvZHJvcGJveC9ibGFoYmxhaC9ub3ZhL3NiaGF0L2MxNjU3NThhMWY3ZjY0MWU3Mzc5MWIyOGEzMWNkNzIzZTI1YzY2MDYvdGVzdF9kaXIudGFyOw== --tar_file_name /fife/local/scratch/dropbox/blahblah/nova/sbhat/c165758a1f7f641e73791b28a31cd723e25c6606/test_dir.tar @/home/sbhat/test_untar.sh

checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
/fife/local/scratch/uploads/nova/sbhat/2018-01-10_115110.260476_654

tar_file_name is  /fife/local/scratch/dropbox/blahblah/nova/sbhat/c165758a1f7f641e73791b28a31cd723e25c6606/test_dir.tar

tar_file_basename set to test_dir.tar

/fife/local/scratch/uploads/nova/sbhat/2018-01-10_115110.260476_654/test_untar.sh_20180110_115110_2684250_0_1_.cmd

submitting....

Submitting job(s).

1 job(s) submitted to cluster 710.

JobsubJobId of first job: 710.0@fermicloud035.fnal.gov

Use job id 710.0@fermicloud035.fnal.gov to retrieve output

JOBSUB SERVER CONTACTED     : https://fermicloud035.fnal.gov:8443
JOBSUB SERVER RESPONDED     : https://fermicloud035.fnal.gov:8443
JOBSUB SERVER RESPONSE CODE : 200 (Success)
JOBSUB SERVER SERVICED IN   : 0.977149963379 sec
JOBSUB CLIENT FQDN          : fermicloud362.fnal.gov
JOBSUB CLIENT SERVICED TIME : 10/Jan/2018 11:51:10

And the job output:

d
c
a
b
Wed Jan 10 11:51:21 CST 2018 BEGIN EXECUTION ./test_untar.sh
Starting run now
Checking that $INPUT_TAR_FILE is set
/var/lib/condor/execute/dir_2684362/test_dir.tar
/var/lib/condor/execute/dir_2684362:
total 60
-rw-r--r-- 1 sbhat sbhat     0 Jan 10 11:51 _condor_stderr
-rw-r--r-- 1 sbhat sbhat     8 Jan 10 11:51 _condor_stdout
-rw-r--r-- 1 sbhat sbhat     5 Jan  4 09:51 a
-rw-r--r-- 1 sbhat sbhat     5 Jan  4 09:51 b
-rw-r--r-- 1 sbhat sbhat     5 Jan  4 09:51 c
-rwxr-xr-x 1 sbhat sbhat  6702 Jan 10 11:51 condor_exec.exe
-rw-r--r-- 1 sbhat sbhat     5 Jan  4 09:51 d
drwxrwxr-x 2 sbhat sbhat  4096 Jan 10 11:51 jsb_tmp
drwxrwxr-x 3 sbhat sbhat  4096 Jan 10 11:51 no_xfer
-rw-r--r-- 1 sbhat sbhat   168 Jan 10 11:51 test_dir.tar
-rwxrwxr-x 1 sbhat sbhat   418 Jan 10 11:51 test_untar.sh
-rw------- 1 sbhat sbhat 12433 Jan 10 11:51 x509cc_sbhat_Analysis

/var/lib/condor/execute/dir_2684362/jsb_tmp:
total 12
-rw-rw-r-- 1 sbhat sbhat  61 Jan 10 11:51 JOBSUB_ERR_FILE
-rw-rw-r-- 1 sbhat sbhat 164 Jan 10 11:51 JOBSUB_LOG_FILE
-rwxrwxr-x 1 sbhat sbhat 812 Jan 10 11:51 ifdh.sh

/var/lib/condor/execute/dir_2684362/no_xfer:
total 4
drwxrwxr-x 3 sbhat sbhat 4096 Jan 10 11:51 0

/var/lib/condor/execute/dir_2684362/no_xfer/0:
total 4
drwxrwxr-x 2 sbhat sbhat 4096 Jan 10 11:51 TRANSFERRED_INPUT_FILES

/var/lib/condor/execute/dir_2684362/no_xfer/0/TRANSFERRED_INPUT_FILES:
total 0
Wed Jan 10 11:51:21 CST 2018 ./test_untar.sh COMPLETED with exit status 0
So now:
  1. The dir is tarred up at submission point by the client
  2. The new tarred file is added to the list of files to transfer into the job
  3. The wrapper script now correctly writes out which tar file to unpack in the job
  4. The tarfile is downloaded from the dropbox correctly
  5. The tarfile gets unwrapped

So far, the progress has been in getting all but the first and last steps (which were previously working correctly) in place.

#12 Updated by Shreyas Bhat about 3 years ago

Dennis and I discussed the changes. As I mentioned in a previous comment, getting 3 (in the last comment) to work relied on a tenuous check (that ".tar" was in the filename) that I wanted to do away with in the final version. We're going to go with passing a new setting in from the client (which already determines whether the passed in arg is a directory or a tarball) to the server, and then checking that setting rather than checking the name.

Another point - apparently the dCache folks weren't too happy with the idea of us copying tarballs to dCache as the standard --tar_file_name behavior. We'll discuss at tomorrow's FIFE meeting alternatives to that.

#13 Updated by Shreyas Bhat about 3 years ago

Pushed that change to branch here:

https://cdcvs.fnal.gov/redmine/projects/jobsub/repository?utf8=%E2%9C%93&rev=16775

If the client determines that the --tar_file_name URI is "tardir://", it will append --is_tardir to the server args. The server side (JobSettings.py) looks for that arg, and if it's set, will adjust the tar_file_basename setting as described previously. This is a much stronger check.

#14 Updated by Shreyas Bhat about 3 years ago

  • Due date set to 02/01/2018

The decision is to use /pnfs/<expt>/scratch for now. We need to code this in a way so that it's easily changeable.

I discussed this with Dennis, and our plan is to change the client so that it queries the server (check the authentication queries for good coding practice) where to upload the files, then uses ifdh to do so (for that, change the dropbox_upload method). The server will store the info in the jobsub.ini file.

The wrapper script will have to be changed to ifdh cp from that location.

#15 Updated by Shreyas Bhat about 3 years ago

Made some progress on the web-server side. Need to discuss some details with Dennis.

I set up another URL (%s/jobsub/acctgroups/%s/tdirdropboxlocation/), that links to methods in tardir_dropbox_location.py (which is heavily modeled on auth_methods.py). I've called it in the client, but for some reason, I'm getting a 404. I tried just changing some log statements on the server side, and nothing seems to change what's actually output. I'll need to discuss this with Dennis.

#16 Updated by Shreyas Bhat about 3 years ago

Restarting the webserver fixed it.

The API I want to create will be at /acctgroups/<group>/tardirdropboxlocation/. So I'm going to need to:
  • Server side: * Put info into jobsub.ini (do one acctgroup first, then do a default and test) * Create a new module to implement the above API * Create a helper function in jobsub.py * Instantiate new class from AccountingGroupsResource
  • Client side * Rough-code in jobsubClient to query the server. * Move the URL into the constants file and test that * If that works, move that client code into its own function like the auth query is

Server:

Put info into jobsub.ini (do one acctgroup first, then do a default and test)

This field in jobsub.ini is called "tardir_dropbox_location". I've implemented it with nova.

Create a new module to implement the above API

I've created a new python module called tardir_dropbox_location.py, modeled after auth_methods.py. The doGET method calls jobsub.get_tardir_dropbox (described below) to get the dropbox location from jobsub.py. If it gets a string, it'll return that in a JSON document. If it gets None back, it'll throw a 404.

Create a helper function in jobsub.py

get_tardir_dropbox instantiates a JobsubConfigParser, parses the file to look for tardir_dropbox_location, and returns it. If a VO doesn't have a tardir_dropbox_location key, the code will return None.

Instantiate new class from AccountingGroupsResource

self.tardirdropboxlocation = TarballDropboxLocationResource() in init of AccountingGroupsResource. (accounting_group.py)

#17 Updated by Shreyas Bhat about 3 years ago

Client:

Rough-code in jobsubClient to query the server

Just copied code from auth query, hardcoded URL. Successful:

{"out": "/pnfs/nova/scratch"}

Move the URL into the constants file and test that

JOBSUB_TARDIR_DROPBOX_LOCATION_URL_PATTERN is the name of the new constant. Subbing this in the client code worked.

If that works, move that client code into its own function like the auth query is

The method is called tardirDropboxLocation, and it follows essentially the same logic as the serverAuthMethods method of the jobsubClient class. Testing worked

All these changes are committed and pushed to the feature branch on Redmine.

#18 Updated by Shreyas Bhat about 3 years ago

Default value: Added tardir_dropbox_location to defaults section of jobsub.ini (value /pnfs/%%s/scratch), so that we assume that the value should be /pnfs/<acctgroup>/scratch, unless otherwise stated in an expt section. I tested submitting with a subgroup, and it seems that that's added later on in the submission chain, so we're OK.

The dropbox helper function get_tardir_dropbox tries to sub in the acctgroup, and returns that new string if it succeeds, and the old one if it doesn't. So if the function got /pnfs/nova/scratch back from the file, the sub fails and it returns /pnfs/nova/scratch. If it had gotten the default /pnfs/%s/scratch, the substitution would succeed, and it'd return the new string.

#19 Updated by Shreyas Bhat about 3 years ago

Next steps:

  • Make "jobsub_stage" dir in /pnfs/<expt>/scratch * Store our tarball in dir that's named SHA-1 of tarball * So we can check, and if it's already there, don't re-upload
  • Create new code path for this. Don't take out existing code.

Dennis will work on the JDF and wrapper script updates to pull the file out of /pnfs/<expt>/scratch/jobsub_stage.

#20 Updated by Shreyas Bhat about 3 years ago

Notes from meeting with M. Mengel, D. Box:

  • Want to make dir on /pnfs/<expt>/scratch - OK
  • ifdh mkdir_p? --- YES
  • python bindings? --- YES, if possible
  • Cleanup: ** ifdh apply - go through files, match some sort of pattern (GLOB), run command X on them. This is so we can do ifdh rmdir ** files disappear after a while ** you can actively clean it out if you want, but we probably don't need to
  • Upload: ** Avoid ifdh ls if possible ** Turn retries off ** So what we should do is create a different code path (if-else) to see if tar_dropbox_location is a /pnfs path. If so, just try and check the (possibly-existing) error (check for "File exists" or "exists"). If that error is returned, move on. ** Stick username on front of tarfile name. e.g. /pnfs/<expt>/scratch/jobsub_stage/<hash>/<grid_user>_filename.tar. Need to look for that on transfer out too. (We think it's grid_user)
  • Python bindings:
    • setup ifdhc
    • Make sure we set up the one that's built against OUR version of python
    • -bash-4.1$ ups list -aK+ ifdhc
      • ...
      • "ifdhc" "v2_1_0" "Darwin" "" ""
      • "ifdhc" "v2_2_3" "Linux64bit+2.6-2.12" "" ""
      • "ifdhc" "v2_2_3" "Darwin64bit+15.6.0" "python27" ""
      • "ifdhc" "v2_2_3" "Linux64bit+3.10-2.17" "python27" ""
      • "ifdhc" "v2_2_3" "Linux64bit+3.10-2.17" "e14" ""
      • "ifdhc" "v2_2_3" "Linux64bit+2.6-2.12" "e14" ""
      • "ifdhc" "v2_2_3" "Linux64bit+3.10-2.17" "" ""
    • Need to change ups so that setting up jobsub_client also sets up ifdhc
    • Can use D0:user@hostname:path for scp, etc.
  • ifdh stage via: --- Not now, but worth playing with it later (esp. for CVMFS)
    • Put file in a staging area.
    • Use an IFDH_Stage area in dCache
    • ifdh_copyback in a cron job --- finishes up any transfers that aren't done.
  • possible web service to throttle before we try to transfer in, or transfer out (we'd basically check this)

#21 Updated by Shreyas Bhat about 3 years ago

The latest commit allows us to use ifdh to transfer files to the directory specified in the jobsub INI file. It queries the server for the location it should send files to, and then makes the jobsub_stage dir if needed via an ifdh mkdir_p. We then try to write the bzipped tarred dir tarfile in using ifdh cp. If the file already exists, we catch that error and search for "File exists" in the error text. If so, we just move on. If it's some other error, we exit out.

I changed the default format from gzip to bz2 because the latter is time-independent. This way, the hash of identical directories' tar files will be the same, even if those directories were tarred up at different times. This allows us to not upload multiple identical tarballs.

What's left:
  • Authentication. Should be good as long as EXPERIMENT and GROUP are set (ifdh is smart enough on interactive nodes to figure it out).
  • Make sure that all dropbox:// and tardir:// URIs get their files handed off to pnfs

#22 Updated by Shreyas Bhat about 3 years ago

We decided that both tarred dirs and tar files (and indeed any file that's given a dropbox:// or tardir:// URI) should be treated the same. That is, we want any file that's specified with the "--tar_file_name" with any URI to be transferred via ifdh to the location specified in the jobsub ini file.

I've changed the code in jobsubClient.py to this end - so that we do the following:

  1. Generate the dropbox URI-hash map for any files passed in with the dropbox:// URI (like we do currently).
  2. Create a tarball from any directory that's specified with the tardir:// URI. Create a dropbox URI for that, and add it to the dropbox URI-hash map. Create a new map that maps the dropbox URI to the original tardir URI (we do this currently too).
  3. Transfer the file via ifdh. Use the dropbox-hash map to generate the dir where the file should land (if it doesn't already exist).
    1. Check the dropbox key in the tardir-dropbox URI map to see if it's one of the items. If it is, we know the created tarfile from the tardir is in the current working directory. Get that, and find the tarfile from the dropbox URI.
    2. If the dropbox key is not in the tardir-dropbox URI map as an item, we know our source file is exactly where the user told us in the dropbox:// URI passed to jobsub. Use that.
  4. Check the error from the ifdh instance, handle it like detailed above.
  5. If we're good to go, instead of returning an exit code (since we really don't need it), we instead return the destination that the file has been transferred to.
  6. That destination is used as the value for the server argument --tar_file_name instead of the original URI. This way, the server can pass on to the job JDF and the wrapper script where exactly to look for the file.

Here are examples from using a tardir and using a dropbox:

[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit.py -G nova --subgroup testsubgroup --debug --jobsub-server=fermicloud035.fnal.gov  --tar_file_name=tardir:///home/sbhat/test_dir file:///home/sbhat/test_untar.sh -n
SERVER_ARGS:  ['--subgroup', 'testsubgroup', '--tar_file_name=tardir:///home/sbhat/test_dir', 'file:///home/sbhat/test_untar.sh', '-n']
CLIENT_ARGS:  {'dag': False, 'help': False, 'acctRole': None, 'jobid_output_only': False, 'debug': True, 'dropboxServer': None, 'acctGroup': 'nova', 'jobsubServer': 'fermicloud035.fnal.gov'}

...

File /home/sbhat/test_dir.tar uploaded to /pnfs/nova/scratch/jobsub_stage/8b65ceca677730ee0015c4d7cf903e22eda1727a/test_dir.tar

...

SERVER_ARGS_B64: --subgroup testsubgroup --tar_file_name /pnfs/nova/scratch/jobsub_stage/8b65ceca677730ee0015c4d7cf903e22eda1727a/test_dir.tar @/home/sbhat/test_untar.sh -n

checking /etc/grid-security/certificates
Using CA_DIR: /etc/grid-security/certificates
/fife/local/scratch/uploads/nova/sbhat/2018-01-25_222041.485660_1914

tar_file_name is  /pnfs/nova/scratch/jobsub_stage/8b65ceca677730ee0015c4d7cf903e22eda1727a/test_dir.tar

tar_file_basename set to test_dir.tar

submitting....

Submitting job(s).

1 job(s) submitted to cluster 821.

JobsubJobId of first job: 821.0@fermicloud035.fnal.gov

Use job id 821.0@fermicloud035.fnal.gov to retrieve output

JOBSUB SERVER CONTACTED     : https://fermicloud035.fnal.gov:8443
JOBSUB SERVER RESPONDED     : https://fermicloud035.fnal.gov:8443
JOBSUB SERVER RESPONSE CODE : 200 (Success)
JOBSUB SERVER SERVICED IN   : 1.03104496002 sec
JOBSUB CLIENT FQDN          : fermicloud362.fnal.gov
JOBSUB CLIENT SERVICED TIME : 25/Jan/2018 22:20:42
[sbhat@fermicloud362 ~]$ ./jobsub/client/jobsub_submit.py -G nova --subgroup testsubgroup --debug --jobsub-server=fermicloud035.fnal.gov  --tar_file_name=dropbox:///home/sbhat/test_dir.tar file:///home/sbhat/test_untar.sh -n
SERVER_ARGS:  ['--subgroup', 'testsubgroup', '--tar_file_name=dropbox:///home/sbhat/test_dir.tar', 'file:///home/sbhat/test_untar.sh', '-n']
CLIENT_ARGS:  {'dag': False, 'help': False, 'acctRole': None, 'jobid_output_only': False, 'debug': True, 'dropboxServer': None, 'acctGroup': 'nova', 'jobsubServer': 'fermicloud035.fnal.gov'}
...

File /home/sbhat/test_dir.tar uploaded to /pnfs/nova/scratch/jobsub_stage/8b65ceca677730ee0015c4d7cf903e22eda1727a/test_dir.tar

...

SERVER_ARGS_B64: --subgroup testsubgroup --tar_file_name /pnfs/nova/scratch/jobsub_stage/8b65ceca677730ee0015c4d7cf903e22eda1727a/test_dir.tar @/home/sbhat/test_untar.sh -n

tar_file_name is  /pnfs/nova/scratch/jobsub_stage/8b65ceca677730ee0015c4d7cf903e22eda1727a/test_dir.tar

tar_file_basename set to test_dir.tar

submitting....

Submitting job(s).

1 job(s) submitted to cluster 823.

JobsubJobId of first job: 823.0@fermicloud035.fnal.gov

Use job id 823.0@fermicloud035.fnal.gov to retrieve output

JOBSUB SERVER CONTACTED     : https://fermicloud035.fnal.gov:8443
JOBSUB SERVER RESPONDED     : https://fermicloud035.fnal.gov:8443
JOBSUB SERVER RESPONSE CODE : 200 (Success)
JOBSUB SERVER SERVICED IN   : 1.00821590424 sec
JOBSUB CLIENT FQDN          : fermicloud362.fnal.gov
JOBSUB CLIENT SERVICED TIME : 25/Jan/2018 22:34:45

One note about this though: if dCache is slow (like it was during these tests), this can cause jobsub to seem to take a very long time. Perhaps we want to put some messages on screen about status "Transferring tarball to dropbox...", etc. Or maybe set a timeout for ifdh to follow.

Most recent changes have been checked into the redmine repo. I think that from the client side, we're pretty much done except for possible code cleanup or extra logging like mentioned above.

#23 Updated by Shreyas Bhat about 3 years ago

One extra note: perhaps the wrapper script could use the filename passed in the server arguments to decide whether or not to try to untar a file. Or try to untar it, and catch the exception if that doesn't work.

#24 Updated by Shreyas Bhat about 3 years ago

We're going to change the API URL to dropboxlocation. This will involve changes on the server side as well as the client side (just make sure we curl the right url).

#25 Updated by Shreyas Bhat about 3 years ago

This change has been made, and I tested with both tardir and dropbox URIs. I've committed and checked in the change on my feature branch.

#26 Updated by Shreyas Bhat about 3 years ago

Implemented API URL change

The API endpoint is now /acctgroups/<group>/dropboxlocation/. This involved changing the python module that implements that endpoint (which is now dropbox_location.py), the class that instantiates this (accounting_group.py), and changing the helper function in jobsub.py to query the new config option in jobsub.ini. The client also has been changed to query this new endpoint to get the location.

Implemented blocking mechanism per accounting group

After discussing with Dennis, we now also have a mechanism for the server to inform the jobsub client that they cannot use the dropbox. If we set dropbox_location = off, in the ini file under any group, then the server returns a 403 error at the endpoint listed above. To accomplish this, I also implemented a reduced version of ConfigParser.getboolean that uses the same value dictionary as the standard library. When passed a section and option, it will run the same config items modification method as JobsubConfigParser.get runs, and then mimics the logic of ConfigParser.getboolean to convert values such as "true", "1", "on", etc. to the python bool True, and "false", "0", "off", etc. to False. We then use getboolean as a check in the jobsub.py helper function that is called from dropbox_location.py. I tested this in the case that this feature was turned off and the case where it's on.

Things to do

  • Testing
  • If I'm in dir a, and tell jobsub to tar up dir b (somewhere else), do we end up tarring and transferring the right file? What about the dropbox uri?
  • Specify random directory for an acct group in jobsub.ini. It should still work
  • Make sure is_tardir (which I've disabled from the client end) doesn't matter anywhere in the server. It might still be an issue for the wrapper script (generated from ...../lib/groupsettings/JobSettings.py))

#27 Updated by Shreyas Bhat about 3 years ago

"If I'm in dir a, and tell jobsub to tar up dir b (somewhere else), do we end up tarring and transferring the right file? What about the dropbox uri?"
"Specify random directory for an acct group in jobsub.ini. It should still work"
Both work.

#28 Updated by Shreyas Bhat about 3 years ago

Other things to do:

IFDH pin files as they get copied up. Then we'll have a refresh script that should do a condor_q -g, get the tarballs from there, and pin them every time the script runs. We'll pin files for 2 days.

Issue: preliminary tests with ifdh pin haven't worked. Need to talk to Marc about that.

#29 Updated by Shreyas Bhat about 3 years ago

Made change to wrapper script generation (JobSettings.py) and tested. Change has been checked in.

#30 Updated by Shreyas Bhat about 3 years ago

Added feature that allows us to reset the clock for files needed by job. Instead of doing ifdh pin (which is no longer supported), we took Marc Mengel's suggestion of copying the first 16 bytes of any file that was needed, so that its "last access" time gets updated. This is done using globus-url-copy. The caveat is that since IFDH is the only thing in the submission chain doing a voms-proxy-init, and it stores the proxy in the non-default location, I needed to set X509_USER_PROXY temporarily while that globus-url-copy operation was taking place.

All changes are checked in.

The only thing left is the cleanup script (that will check the jobs for the tarballs they're using and reset their last access times similar to above), which Dennis is working on.

#31 Updated by Shreyas Bhat about 3 years ago

We are also now checking the tarball size.

#32 Updated by Shreyas Bhat about 3 years ago

  • % Done changed from 0 to 90

One last change was made - we now catch the errors if ifdh mkdir_p or ifdh cp fail, and raise a JobSubClientError. We want job submission to fail if there's a bad directory given, etc.

#33 Updated by Dennis Box about 3 years ago

  • Status changed from Work in progress to Closed

Also available in: Atom PDF