Jobsub_submit_dag saves multiple copies of files in payload.tgz
Jobsub_submit_dag saves multiple identical copies of files in payload.tgz.
#2 Updated by Dennis Box about 1 month ago
- Assignee set to Shreyas Bhat
- Status changed from New to Feedback
- Subject changed from Jobsub_submit_dag saves multiple copies of files in payload.tgz to Jobsub_submit_dag fails to clean up payload.tgz when submission fails
Changing subject name of this ticket from 'Jobsub_submit_dag saves multiple copies of files in payload.tgz' to 'Jobsub_submit_dag fails to clean up payload.tgz when submission fails'.
Proposed change is in branch 22048. Shreyas please review.
#3 Updated by Dennis Box about 1 month ago
- Assignee changed from Shreyas Bhat to Dennis Box
- Status changed from Feedback to Assigned
- Subject changed from Jobsub_submit_dag fails to clean up payload.tgz when submission fails to Jobsub_submit_dag saves multiple copies of files in payload.tgz
After playing around with jobsub_submit a bit I was able to reproduce Herbs original issue, multiple copies of the same file in the tarball. Changing the issue title and status back to original.
#4 Updated by Dennis Box about 1 month ago
- Assignee changed from Dennis Box to Shreyas Bhat
- Status changed from Assigned to Feedback
Reassigning to Shreyas for review. See branch 22408
commit the first: addresses INC000001091220
commit the second: addresses Herbs issue of multiple copies of the same file in payload.tgz
#5 Updated by Shreyas Bhat about 1 month ago
This looks good, with two exceptions (that we discussed a bit offline).
1) I'm concerned that users will try to throw tons of files into the tarball, and so if we now have a list of contents that we have to iterate through for each file (see https://cdcvs.fnal.gov/redmine/projects/jobsub/repository/revisions/22408/entry/client/jobsubClient.py#L762 ), that might slow down the submission. Instead, perhaps we should have a dict of Nones, so line 752 would be:
contents = dict()
And line 765 would be:
contents[b] = None
That would speed up the lookup.
2) The key for these contents is the basename as defined by python. Ostensibly, if a user had the same filename for two files (mydir1/coolfile and mydir2/coolfile), python's os.path.basename would give them the same value, which could result in files being left out of contents:
>>> import os >>> os.path.basename('/path/to/mydir1/coolfile') 'coolfile' >>> os.path.basename('/path/to/mydir2/coolfile') 'coolfile'
Perhaps we should key on the full path (os.path.abspath)?