Project

General

Profile

Bug #5805

Attacking parameter set bloat

Added by Rob Kutschke over 6 years ago. Updated almost 6 years ago.

Status:
Closed
Priority:
High
Category:
Metadata
Target version:
Start date:
04/01/2014
Due date:
% Done:

100%

Estimated time:
Spent time:
Occurs In:
Scope:
Internal
Experiment:
Mu2e
SSI Package:
FHiCL
Duration:

Description

We have long surmised that the reason the Mu2e last-stage jobs take 30 minutes to open the art input file is because of massive bloat in the parameter set registry ( and therefore on the RootFileDB inside the art event-data disk file ).

I can show that, for typical mixed files, the size of the disk file is 2.2 GB and that the size on-disk of the RootFileDB is 1.2 GB! The registry contains over 1,000,000 parameter sets. So we have two problems: we have a huge hit on disk space and it takes forever the open the input files.

Andrei has long believed that the bloat in the ParameterSetRegistry is caused by three variables:

firstRun
firstSubRun
baseSeed

I bet he is right. When we run a grid process we set

firstRun:grid_cluster_number
firstSubRun:grid_process_number
baseSeed:random_number

Our script that scrubs the grid output for failed jobs also looks for duplicate base seeds. This algorithm effectively guarantees unqiue event IDs (until the grid starts recycling cluster numbers ).

I have done some homework that will allow us to test the above hypotheses. A quick look suggests that we can easily confirm them but doing a proper test is beyond by regexp ability.

On mu2egpvm*, in the directory
/mu2e/data/users/kutschke/ParameterSetBloat

There are 4 files:
ls -s *.txt
512 beamFlashOnMothers.txt
985056 dataMerged.txt
197408 detectorBeamFlash_f3.txt
197408 detectorBeamFlash.txt

These files contain a text image of the contents of the ParameterSetRegistry found in 4 of Andrei's files.
Each line in ech file is one parameter set, in the one-liner string representation.

If someone has the regexp ability please try the following on all 4 files. Try it on the small file first and
when it works, try it on the bigger files:

1)
substitute "firstRun:* " to "firstRun: " everywhere in the file
What I mean is simply erase the value of the parameter.
same for the other two variables

2)
wc original_file
sort -u modified_file | wc

I hope that dataMerged.txt will drop from over 1,000,000 lines to as few as a few hundred.

If this proves to be true we can discuss a strategy for fixing these files: I think this involves compressing the ParameterSets, rebuilding the registry and then remapping all of the parameterset Ids in all of the Provenance objects.

And it is time to revist the question about controlled and uncontrolled information in configurations. I don't know what the right answer is but the present solution is not viable.


Related issues

Related to art - Feature #5217: File opening slow due to parsing of saved configuration.Closed01/17/2014

Associated revisions

Revision b7cf3c2e (diff)
Added by Christopher Green about 6 years ago

Implement use of fhicl-cpp improvements for issue #5805.

History

#1 Updated by Andrei Gaponenko over 6 years ago

Hi Rob,

Are you looking for something like this?

cd /mu2e/data/users/kutschke/ParameterSetBloat

for i in beamFlashOnMothers.txt detectorBeamFlash.txt detectorBeamFlash_f3.txt dataMerged.txt; do echo "$i orig: $(sort -u $i|wc -l)  tweaked: $(sed -e 's/baseSeed:[^ ]*//g' -e 's/firstRun:[^ ]*//g' -e 's/firstRun:[^ ]*//g' beamFlashOnMothers.txt |sort -u |wc -l)"; done

beamFlashOnMothers.txt orig: 594  tweaked: 295
detectorBeamFlash.txt orig: 212270  tweaked: 295
detectorBeamFlash_f3.txt orig: 212292  tweaked: 295
dataMerged.txt orig: 1059010  tweaked: 295

Andrei

#2 Updated by Rob Kutschke over 6 years ago

Thanks Andrei - that's the regexp I was looking for.

I fixed two typo's and here are my results.

beamFlashOnMothers.txt orig: 594 tweaked: 97 size: 62 kB
detectorBeamFlash.txt orig: 212270 tweaked: 1782 size: 13 MB
detectorBeamFlash_f3.txt orig: 212292 tweaked: 1803 size: 13 MB
dataMerged.txt orig: 1059010 tweaked: 8475 size: 64 MB

where I added the size by hand, using ls -s

The typos were:
- repeated the removal of firstRun and did not remove firstSubRun
- all 4 runs were for beamFlashOnMothers.txt - the other 3 files were not done

Just for kicks I also removed the source.fileNames parameter and it really strips things down. I am not proposing that we do this but I wanted to document where the space is going.

beamFlashOnMothers.txt orig: 96 size: 39 kB
detectorBeamFlash.txt orig: 120 size: 51 kB
detectorBeamFlash_f3.txt orig: 140 size: 60 kB
dataMerged.txt orig: 152 size: 62 kB

I think that we have the data we need to discuss an solution - both a tool to fix existing files and a long term solution for making new files.

#3 Updated by Christopher Green over 6 years ago

  • Category set to Metadata
  • Status changed from New to Accepted
  • Priority changed from Normal to High
  • Target version set to 1.09.03
  • Experiment Mu2e added
  • Experiment deleted (-)
  • SSI Package - added
  • SSI Package deleted ()

We believe we have a scheme that needs to be analyzed and fleshed out before we can give a time estimate.

#4 Updated by Christopher Green over 6 years ago

  • Target version changed from 1.09.03 to 1.10.00

#5 Updated by Christopher Green over 6 years ago

  • Target version changed from 1.10.00 to 1.14.00

#6 Updated by Christopher Green over 6 years ago

  • Target version changed from 1.14.00 to 1.11.00

#7 Updated by Christopher Green almost 6 years ago

  • SSI Package FHiCL added
  • SSI Package deleted (-)

#8 Updated by Christopher Green almost 6 years ago

  • Status changed from Accepted to Resolved
  • Assignee set to Christopher Green
  • % Done changed from 0 to 100

We believe we have implemented a set of measures which should mitigate the size and parsing time of files with large numbers of saved parameters. Please let us know your experiences. Please note however, that although the read-in time should be significantly less, the write-out time might increase somewhat.

#9 Updated by Christopher Green almost 6 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF