Bug #10433
a scaling problem
Description
Hello,
I noticed that art becomes slower and slower as I process more files
in a single job. The attached plot time-per-file.pdf illustrates the
problem. It shows the time stamp difference between consecutive
"Successfully opened file" printouts, in seconds, vs file number. The
job ran over 51 input file, so there are 50 data points. You can see
that the second file was opened about 300 seconds after the first one.
But it took art about an hour to switch from file 50 to file 51.
Note that the input files have the same size to about 1%. (Except for
the last one, which is slightly smaller.)
Job memory consumption grew up to ~6GB, but swapping was not a
problem. The process ran at 97% average CPU/wall clock time.
To reproduce, please run JobConfig/cd3/beam/beam_flt4detflash.fcl
from Mu2e Offline v5_4_7 over the files in
/mu2e/data/users/gandr/20151006a.cd3-beam-flt4detflash/mu2egridInDir/
Andrei
History
#1 Updated by Marc Paterno over 5 years ago
Andrei, do you have the result of running the TimeTracker on a job like this? If not, could you run again (even a shorter job, say 10 files) using the TimeTracker?
#2 Updated by Kyle Knoepfel over 5 years ago
- File Screen Shot 2015-10-09 at 1.35.47 PM.png Screen Shot 2015-10-09 at 1.35.47 PM.png added
- File Screen Shot 2015-10-09 at 1.35.13 PM.png Screen Shot 2015-10-09 at 1.35.13 PM.png added
Yes, it will be good to see what the TimeTracker
says--that will tell us if it's a problem inside the event calls. However, since I already had your particular version of Mu2e set up, I ran allinea
for an art
job with only five of the input files. The ParameterSet
importation via sqlite
is likely the culprit. See the attached files. The width of the green bars indicate the time spent on executing that particular call, which is clearly increasing each subsequent time sqlite_step
is called.
#3 Updated by Andrei Gaponenko over 5 years ago
Hi,
A TimeTracker printout from running on 10 files is below. A complete log file can be found at
/mu2e/data/users/gandr/20151009a.flt4-tt/timetest.log/mu2e/data/users/gandr/20151009a.flt4-tt/timetest.log
Andrei
======================================================================================================================================== TimeTracker printout (sec) Min Avg Max Median RMS nEvts ======================================================================================================================================== Full event 0.000103659 0.00106275 1.06409 0.00071814 0.00285319 2017753 ---------------------------------------------------------------------------------------------------------------------------------------- detPath:trackerHits:SelectStepPointsByTime 0 0.000133844 0.426743 8.05e-05 0.00126244 2017753 detPath:caloHits:SelectStepPointsByTime 0 9.44635e-05 1.06301 5.1293e-05 0.00129193 2017753 detPath:caloROHits:SelectStepPointsByTime 0 6.14807e-05 0.184917 4.0978e-05 0.000578495 2017753 detPath:detectorFilter:FilterG4Out 0 0.000331726 0.343128 0.000221227 0.00108373 2017753 detPath:TriggerResults:TriggerResultInserter 0 2.33566e-05 0.0242809 2.0888e-05 9.50824e-05 2017753 end_path:genCountLogger:GenEventCountReader 0 2.54201e-06 0.0160496 2.198e-06 2.9023e-05 2017753 end_path:hitTimesTrackerIn:ExtremeHitTimes 0 1.4978e-05 0.0209895 1.3434e-05 7.77562e-05 2017753 end_path:hitTimesTrackerOut:ExtremeHitTimes 0 1.08261e-05 0.023088 9.762e-06 6.58366e-05 2017753 end_path:hitTimesCaloIn:ExtremeHitTimes 0 2.28057e-05 0.0855061 2.0494e-05 0.000117435 2017753 end_path:hitTimesCaloOut:ExtremeHitTimes 0 1.78253e-05 0.018126 1.6059e-05 8.39066e-05 2017753 end_path:hitTimesTOCIn:ExtremeHitTimes 0 3.03768e-05 0.13114 2.7558e-05 0.000145848 2017753 end_path:hitTimesTOCOut:ExtremeHitTimes 0 2.451e-05 0.0190557 2.2266e-05 9.33055e-05 2017753 end_path:beamFlashOutput:RootOutput 0 6.85115e-05 0.506996 8.72e-06 0.00131588 2017753 detPath:compressPVDetector:CompressPhysicalVolumes 0 1.72222e-05 0.00855963 1.493e-05 6.99335e-05 55978 ========================================================================================================================================
#4 Updated by Christopher Green over 5 years ago
I think Kyle's Allinea MAP analysis provides the smoking gun, here. The slowdown due to the size of the existing database is caused by the index used to prevent duplicate entries and speed lookup based on ID. We could do something along the lines of maintaining a local ParameterSetID
-> row number unorderd_map
for lookup and collision detection and dropping the index. We'd still be looking up the ID in a "table", but it would be our "table". Whether this would actually provide the performance improvement desired would require measurement.
#5 Updated by Andrei Gaponenko over 5 years ago
Christopher Green wrote:
The slowdown due to the size of the existing database is caused by the index used to prevent duplicate entries
A user option to provide sorted events on input should address this. Even better, several options: (1) event numbers sorted within each subrun, (2) sorted events and subruns, (3) sorted events, subruns, and runs.
For Mu2e MC samples (1) is automatically satisfied, and we can do (2) and (3) when needed.
#6 Updated by Christopher Green over 5 years ago
This is the ParameterSet database and the index is of parameter set IDs. It has nothing to do with events, sorted or otherwise.
#7 Updated by Christopher Green over 5 years ago
- Category set to I/O
- Status changed from New to Resolved
- Assignee set to Christopher Green
- Target version set to 1.17.03
- SSI Package art added
- SSI Package deleted (
)
We believe that the circumstances in which you are seeing this problem will be mitigated by the solution to issue #10491. If this is not the case, please let us know immediately.
#8 Updated by Kyle Knoepfel about 5 years ago
- % Done changed from 0 to 100
#9 Updated by Kyle Knoepfel about 5 years ago
- Status changed from Resolved to Closed