Problem reading some output files produced using art v3_04_00
After Mu2e upgraded to art v3_04_00 I ran a sequence of simulation jobs. All of the jobs succeeded with art status=0, but about 1/2 of the .art output files produced are unreadable by art. All the .art files are readable by interactive root, and I can see no difference in structure or (qualitative) content between files that art can read compared to the ones that it can't read. Jobs running on the .art output files that can be read proceed normally and produce sensible output. Running one of the jobs that produced unreadable .art output interactively produced a valid, readable output .art file.
I note one possible issue: Mu2e Offline recently upgraded to python3, but our grid submission scripts still rely on python2, so the job submission was done using a different setup than the actual jobs. The jobs were run using a code tarball based on python3, compiled from the head of Mu2e Offline master branch.
An example output file that isn't readable by art is:
while an output file that is readable from the same cluster is:
The fcl for the jobs that produced readable/unreadable output can be found in:
An example of the error message produced by art when trying to open the unreadable file is below:
23-Dec-2019 10:00:52 CST Initiating request to open input file "/pnfs/mu2e/scratch/users/brownd/workflow/default/outstage/27049748/00/00001/dig.brownd.CeEndpoint-mix.MDC2020.001002_00000001.art" 23-Dec-2019 10:00:54 CST Opened input file "/pnfs/mu2e/scratch/users/brownd/workflow/default/outstage/27049748/00/00001/dig.brownd.CeEndpoint-mix.MDC2020.001002_00000001.art" %MSG-s ArtException: FileDumperOutput:dumper@Construction 23-Dec-2019 10:00:54 CST ModuleConstruction cet::exception caught in art ---- FileReadError BEGIN ---- FatalRootError BEGIN Fatal Root Error: @SUB=TBufferFile::ReadClassBuffer Could not find the StreamerInfo for version 2 of the class art::ProcessHistory, object skipped at offset 108 ---- FatalRootError END ---- FileReadError END %MSG Art has completed and will exit with status 21.
#1 Updated by David Brown 8 months ago
I reran the same jobs, and the exact same subset produced unreadable files as before, which proves the problem isn't due to environment differences on the (random) grid node the job lands on. I also ran valgrind on one of the jobs which produces invalid output on the grid, there invalid reads inside Root and G4 but nothing in Mu2e code.
#4 Updated by David Brown 7 months ago
Additional info: I submitted batch jobs to process the readable output subset from these jobs, but those fail with an art exception (below). The same jobs succeed when run interactively
on mu2ebuild01. I suspect this exception and the original unreadable file problem are different manifestations of the same issue.
One possible explanation is these problems are caused by submitting jobs using an older version of art (v3_03_01) than the executable run on the grid node (v3_04_00). That is currently required as the mu2e grid submission tools still require python2. The setup script executed on the worker node should insulate us from the submission version, but maybe some environment is bleeding through?
%MSG-s ArtException: PostEndJob 03-Jan-2020 20:58:17 UTC ModuleEndJob ---- EventProcessorFailure BEGIN EventProcessor: an exception occurred during current event processing ---- ScheduleExecutionFailure BEGIN Path: ProcessingStopped. ---- FatalRootError BEGIN Fatal Root Error: @SUB=TDataMember::GetUnitSize Can not determine sizeof(array<art::Ptr<mu2e::StrawGasStep>,2>) The above exception was thrown while processing module SelectRecoMC/SelectRecoMC run: 1002 subRun: 0 event: 3 ---- FatalRootError END Exception going through path RecoPath ---- ScheduleExecutionFailure END ---- EventProcessorFailure END %MSG Art has completed and will exit with status 1.
#5 Updated by Kyle Knoepfel 7 months ago
- Status changed from Assigned to Feedback
Dave, using two different versions of art definitely sounds suspect, and ensuring complete insulation of shells is often difficult. Based on Rob's comment offline, it sounds like a Python-3 compatible version of jobsub is now available. However, even if that is not the case, you can setup a Python-2 qualified version of art 3.04 using the 'py2' qualifier:
setup art v3_04_00 -q +e19:+prof:+py2
I recommend building and submitting the jobs with the same version of art, and then let us know if you're still unable to read the files.
#7 Updated by David Brown 7 months ago
This problem is now understood to come from the error autoloading schema noted by Kyle. Mu2e will address this by providing include files so root doesn't have to autoload. The only possible art issue remaining is why art continued to run and gave return code 'success' after the 'fatal error' autoloading the schema. Disabling autoload in root (if possible) would also be a sensible improvement.
#8 Updated by Raymond Culbertson 7 months ago
Years ago when we first started having autoloader errors,
Philippe told us they were harmless. I also wonder why we have seen
no real problem until now. And why it fails only sometimes.
So there is a something more complex then just the
ROOT_INCLUDE_PATH path is wrong/incomplete. I hope we can drill
down to understand it.