Attacking parameter set bloat
We have long surmised that the reason the Mu2e last-stage jobs take 30 minutes to open the art input file is because of massive bloat in the parameter set registry ( and therefore on the RootFileDB inside the art event-data disk file ).
I can show that, for typical mixed files, the size of the disk file is 2.2 GB and that the size on-disk of the RootFileDB is 1.2 GB! The registry contains over 1,000,000 parameter sets. So we have two problems: we have a huge hit on disk space and it takes forever the open the input files.
Andrei has long believed that the bloat in the ParameterSetRegistry is caused by three variables:
I bet he is right. When we run a grid process we set
Our script that scrubs the grid output for failed jobs also looks for duplicate base seeds. This algorithm effectively guarantees unqiue event IDs (until the grid starts recycling cluster numbers ).
I have done some homework that will allow us to test the above hypotheses. A quick look suggests that we can easily confirm them but doing a proper test is beyond by regexp ability.
On mu2egpvm*, in the directory
There are 4 files:
ls -s *.txt
These files contain a text image of the contents of the ParameterSetRegistry found in 4 of Andrei's files.
Each line in ech file is one parameter set, in the one-liner string representation.
If someone has the regexp ability please try the following on all 4 files. Try it on the small file first and
when it works, try it on the bigger files:
substitute "firstRun:* " to "firstRun: " everywhere in the file
What I mean is simply erase the value of the parameter.
same for the other two variables
sort -u modified_file | wc
I hope that dataMerged.txt will drop from over 1,000,000 lines to as few as a few hundred.
If this proves to be true we can discuss a strategy for fixing these files: I think this involves compressing the ParameterSets, rebuilding the registry and then remapping all of the parameterset Ids in all of the Provenance objects.
And it is time to revist the question about controlled and uncontrolled information in configurations. I don't know what the right answer is but the present solution is not viable.
#1 Updated by Andrei Gaponenko over 6 years ago
Are you looking for something like this?
cd /mu2e/data/users/kutschke/ParameterSetBloat for i in beamFlashOnMothers.txt detectorBeamFlash.txt detectorBeamFlash_f3.txt dataMerged.txt; do echo "$i orig: $(sort -u $i|wc -l) tweaked: $(sed -e 's/baseSeed:[^ ]*//g' -e 's/firstRun:[^ ]*//g' -e 's/firstRun:[^ ]*//g' beamFlashOnMothers.txt |sort -u |wc -l)"; done beamFlashOnMothers.txt orig: 594 tweaked: 295 detectorBeamFlash.txt orig: 212270 tweaked: 295 detectorBeamFlash_f3.txt orig: 212292 tweaked: 295 dataMerged.txt orig: 1059010 tweaked: 295
#2 Updated by Rob Kutschke over 6 years ago
Thanks Andrei - that's the regexp I was looking for.
I fixed two typo's and here are my results.
beamFlashOnMothers.txt orig: 594 tweaked: 97 size: 62 kB
detectorBeamFlash.txt orig: 212270 tweaked: 1782 size: 13 MB
detectorBeamFlash_f3.txt orig: 212292 tweaked: 1803 size: 13 MB
dataMerged.txt orig: 1059010 tweaked: 8475 size: 64 MB
where I added the size by hand, using ls -s
The typos were:
- repeated the removal of firstRun and did not remove firstSubRun
- all 4 runs were for beamFlashOnMothers.txt - the other 3 files were not done
Just for kicks I also removed the source.fileNames parameter and it really strips things down. I am not proposing that we do this but I wanted to document where the space is going.
beamFlashOnMothers.txt orig: 96 size: 39 kB
detectorBeamFlash.txt orig: 120 size: 51 kB
detectorBeamFlash_f3.txt orig: 140 size: 60 kB
dataMerged.txt orig: 152 size: 62 kB
I think that we have the data we need to discuss an solution - both a tool to fix existing files and a long term solution for making new files.
#3 Updated by Christopher Green over 6 years ago
- Category set to Metadata
- Status changed from New to Accepted
- Priority changed from Normal to High
- Target version set to 1.09.03
- Experiment Mu2e added
- Experiment deleted (
- SSI Package - added
- SSI Package deleted (
We believe we have a scheme that needs to be analyzed and fleshed out before we can give a time estimate.
#8 Updated by Christopher Green almost 6 years ago
- Status changed from Accepted to Resolved
- Assignee set to Christopher Green
- % Done changed from 0 to 100
We believe we have implemented a set of measures which should mitigate the size and parsing time of files with large numbers of saved parameters. Please let us know your experiences. Please note however, that although the read-in time should be significantly less, the write-out time might increase somewhat.