Project

General

Profile

Bug #22408

Jobsub_submit_dag saves multiple copies of files in payload.tgz

Added by Herbert Greenlee 9 months ago. Updated 14 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
04/18/2019
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Jobsub_submit_dag saves multiple identical copies of files in payload.tgz.

History

#1 Updated by Dennis Box 8 months ago

  • Target version set to v1.3.1

#2 Updated by Dennis Box about 1 month ago

  • Assignee set to Shreyas Bhat
  • Status changed from New to Feedback
  • Subject changed from Jobsub_submit_dag saves multiple copies of files in payload.tgz to Jobsub_submit_dag fails to clean up payload.tgz when submission fails

Changing subject name of this ticket from 'Jobsub_submit_dag saves multiple copies of files in payload.tgz' to 'Jobsub_submit_dag fails to clean up payload.tgz when submission fails'.

Proposed change is in branch 22048. Shreyas please review.

#3 Updated by Dennis Box about 1 month ago

  • Assignee changed from Shreyas Bhat to Dennis Box
  • Status changed from Feedback to Assigned
  • Subject changed from Jobsub_submit_dag fails to clean up payload.tgz when submission fails to Jobsub_submit_dag saves multiple copies of files in payload.tgz

After playing around with jobsub_submit a bit I was able to reproduce Herbs original issue, multiple copies of the same file in the tarball. Changing the issue title and status back to original.

#4 Updated by Dennis Box about 1 month ago

  • Assignee changed from Dennis Box to Shreyas Bhat
  • Status changed from Assigned to Feedback

Reassigning to Shreyas for review. See branch 22408
commit the first: addresses INC000001091220
commit the second: addresses Herbs issue of multiple copies of the same file in payload.tgz

#5 Updated by Shreyas Bhat about 1 month ago

This looks good, with two exceptions (that we discussed a bit offline).

1) I'm concerned that users will try to throw tons of files into the tarball, and so if we now have a list of contents that we have to iterate through for each file (see https://cdcvs.fnal.gov/redmine/projects/jobsub/repository/revisions/22408/entry/client/jobsubClient.py#L762 ), that might slow down the submission. Instead, perhaps we should have a dict of Nones, so line 752 would be:

contents = dict()

And line 765 would be:

contents[b] = None

That would speed up the lookup.

2) The key for these contents is the basename as defined by python. Ostensibly, if a user had the same filename for two files (mydir1/coolfile and mydir2/coolfile), python's os.path.basename would give them the same value, which could result in files being left out of contents:

>>> import os
>>> os.path.basename('/path/to/mydir1/coolfile')
'coolfile'
>>> os.path.basename('/path/to/mydir2/coolfile')
'coolfile'

Perhaps we should key on the full path (os.path.abspath)?

#6 Updated by Shreyas Bhat about 1 month ago

  • Assignee changed from Shreyas Bhat to Dennis Box
  • Status changed from Feedback to Under Discussion

#7 Updated by Dennis Box 14 days ago

  • Status changed from Under Discussion to Resolved

merged to master



Also available in: Atom PDF