Project

General

Profile

Bug #23797

Problem reading some output files produced using art v3_04_00

Added by David Brown about 2 months ago. Updated 9 days ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
12/23/2019
Due date:
% Done:

100%

Estimated time:
4.00 h
Spent time:
Scope:
Internal
Experiment:
Mu2e
Duration:

Description

After Mu2e upgraded to art v3_04_00 I ran a sequence of simulation jobs. All of the jobs succeeded with art status=0, but about 1/2 of the .art output files produced are unreadable by art. All the .art files are readable by interactive root, and I can see no difference in structure or (qualitative) content between files that art can read compared to the ones that it can't read. Jobs running on the .art output files that can be read proceed normally and produce sensible output. Running one of the jobs that produced unreadable .art output interactively produced a valid, readable output .art file.

I note one possible issue: Mu2e Offline recently upgraded to python3, but our grid submission scripts still rely on python2, so the job submission was done using a different setup than the actual jobs. The jobs were run using a code tarball based on python3, compiled from the head of Mu2e Offline master branch.

An example output file that isn't readable by art is:
/pnfs/mu2e/scratch/users/brownd/workflow/default/outstage/27049748/00/00001/dig.brownd.CeEndpoint-mix.MDC2020.001002_00000001.art
while an output file that is readable from the same cluster is:
/pnfs/mu2e/scratch/users/brownd/workflow/default/outstage/27049748/00/00000/dig.brownd.CeEndpoint-mix.MDC2020.001002_00000000.art

The fcl for the jobs that produced readable/unreadable output can be found in:
/pnfs/mu2e/scratch/users/brownd/workflow/fcl/CeEndpointMix.tgz

An example of the error message produced by art when trying to open the unreadable file is below:

23-Dec-2019 10:00:52 CST  Initiating request to open input file "/pnfs/mu2e/scratch/users/brownd/workflow/default/outstage/27049748/00/00001/dig.brownd.CeEndpoint-mix.MDC2020.001002_00000001.art" 
23-Dec-2019 10:00:54 CST  Opened input file "/pnfs/mu2e/scratch/users/brownd/workflow/default/outstage/27049748/00/00001/dig.brownd.CeEndpoint-mix.MDC2020.001002_00000001.art" 
%MSG-s ArtException:  FileDumperOutput:dumper@Construction  23-Dec-2019 10:00:54 CST ModuleConstruction
cet::exception caught in art
---- FileReadError BEGIN
  ---- FatalRootError BEGIN
    Fatal Root Error: @SUB=TBufferFile::ReadClassBuffer
    Could not find the StreamerInfo for version 2 of the class art::ProcessHistory, object skipped at offset 108
  ---- FatalRootError END
---- FileReadError END
%MSG
Art has completed and will exit with status 21.

History

#1 Updated by David Brown about 2 months ago

I reran the same jobs, and the exact same subset produced unreadable files as before, which proves the problem isn't due to environment differences on the (random) grid node the job lands on. I also ran valgrind on one of the jobs which produces invalid output on the grid, there invalid reads inside Root and G4 but nothing in Mu2e code.

#2 Updated by Kyle Knoepfel about 2 months ago

  • Description updated (diff)

#3 Updated by Kyle Knoepfel about 2 months ago

  • Assignee set to Kyle Knoepfel
  • Status changed from New to Assigned

Dave, do you have the log file corresponding to the job that produced the corrupt art/ROOT file?

#4 Updated by David Brown about 2 months ago

Additional info: I submitted batch jobs to process the readable output subset from these jobs, but those fail with an art exception (below). The same jobs succeed when run interactively
on mu2ebuild01. I suspect this exception and the original unreadable file problem are different manifestations of the same issue.

One possible explanation is these problems are caused by submitting jobs using an older version of art (v3_03_01) than the executable run on the grid node (v3_04_00). That is currently required as the mu2e grid submission tools still require python2. The setup script executed on the worker node should insulate us from the submission version, but maybe some environment is bleeding through?

%MSG-s ArtException:  PostEndJob 03-Jan-2020 20:58:17 UTC ModuleEndJob
---- EventProcessorFailure BEGIN
  EventProcessor: an exception occurred during current event processing
  ---- ScheduleExecutionFailure BEGIN
    Path: ProcessingStopped.
    ---- FatalRootError BEGIN
      Fatal Root Error: @SUB=TDataMember::GetUnitSize
      Can not determine sizeof(array<art::Ptr<mu2e::StrawGasStep>,2>)
      The above exception was thrown while processing module SelectRecoMC/SelectRecoMC run: 1002 subRun: 0 event: 3
    ---- FatalRootError END
    Exception going through path RecoPath
  ---- ScheduleExecutionFailure END
---- EventProcessorFailure END
%MSG
Art has completed and will exit with status 1.

#5 Updated by Kyle Knoepfel about 1 month ago

  • Status changed from Assigned to Feedback

Dave, using two different versions of art definitely sounds suspect, and ensuring complete insulation of shells is often difficult. Based on Rob's comment offline, it sounds like a Python-3 compatible version of jobsub is now available. However, even if that is not the case, you can setup a Python-2 qualified version of art 3.04 using the 'py2' qualifier:

setup art v3_04_00 -q +e19:+prof:+py2

I recommend building and submitting the jobs with the same version of art, and then let us know if you're still unable to read the files.

#6 Updated by Kyle Knoepfel about 1 month ago

My previous comment (#23797-5) should have said "e19" instead of "e17". Comment has been corrected.

#7 Updated by David Brown about 1 month ago

This problem is now understood to come from the error autoloading schema noted by Kyle. Mu2e will address this by providing include files so root doesn't have to autoload. The only possible art issue remaining is why art continued to run and gave return code 'success' after the 'fatal error' autoloading the schema. Disabling autoload in root (if possible) would also be a sensible improvement.

#8 Updated by Raymond Culbertson about 1 month ago

Years ago when we first started having autoloader errors,
Philippe told us they were harmless. I also wonder why we have seen
no real problem until now. And why it fails only sometimes.
So there is a something more complex then just the
ROOT_INCLUDE_PATH path is wrong/incomplete. I hope we can drill
down to understand it.

#9 Updated by Kyle Knoepfel about 1 month ago

  • Estimated time set to 4.00 h
  • Status changed from Feedback to Assigned

We will improve the error-handling for this type of situation.

#10 Updated by Kyle Knoepfel about 1 month ago

  • % Done changed from 0 to 100
  • Status changed from Assigned to Resolved

The ROOT custom error-handling system that art uses was reimplemented with commits:

#11 Updated by Kyle Knoepfel 9 days ago

  • Target version set to 1.02.01

#12 Updated by Kyle Knoepfel 9 days ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF