Project

General

Profile

Bug #23427

Fragment constructor needs bulletproofing

Added by Thomas Junk about 1 month ago. Updated 20 days ago.

Status:
Reviewed
Priority:
Normal
Category:
Known Issues
Target version:
-
Start date:
10/15/2019
Due date:
% Done:

100%

Estimated time:
Experiment:
DUNE
Co-Assignees:
Duration:

Description

A problem has cropped up reading some ProtoDUNE-SP data with artdaq_core v3_05_06. It is more rare than our last problem, and so it probably stems from a data corruption issue from our DAQ. But it causes our offline jobs to crash on a bad memory free call.

The input file (xrootd url):

root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/64/07/52/np04_raw_run005181_0001_dl1.root

The problem is in the containerCTB fragments in the 20th event in the file. Running our unpacker job produces this traceback:

Program terminated with signal SIGABRT, Aborted.
#0 0x00007f440fa19377 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007f440fa19377 in raise () from /lib64/libc.so.6
#1 0x00007f440fa1aa68 in abort () from /lib64/libc.so.6
#2 0x00007f440fa5bec7 in _libc_message () from /lib64/libc.so.6
#3 0x00007f440fa646b9 in _int_free () from /lib64/libc.so.6
#4 0x00007f43faf33652 in artdaq::QuickVec<unsigned long long>::~QuickVec
(this=0x3802ed10, __in_chrg=<optimized out>)
at
/cvmfs/larsoft.opensciencegrid.org/products/artdaq_core/v3_05_06/include/artdaq-core/Core/QuickVec.hh:368
#5 artdaq::Fragment::~Fragment (this=0x3802ed10, __in_chrg=<optimized out>)
at
/cvmfs/larsoft.opensciencegrid.org/products/artdaq_core/v3_05_06/include/artdaq-core/Data/Fragment.hh:85
#6 std::default_delete<artdaq::Fragment>::operator() (this=<optimized out>,
_ptr=0x3802ed10)
at
/cvmfs/larsoft.opensciencegrid.org/products/gcc/v7_3_0/Linux64bit+3.10-2.17/include/c++/7.3.0/bits/unique_ptr.h:78
#7 std::unique_ptr<artdaq::Fragment, std::default_delete<artdaq::Fragment>
>::~unique_ptr (this=<optimized out>, __in_chrg=<optimized out>)
at
/cvmfs/larsoft.opensciencegrid.org/products/gcc/v7_3_0/Linux64bit+3.10-2.17/include/c++/7.3.0/bits/unique_ptr.h:268
#8 PDSPCTBRawDecoder::produce (this=0x746ec00, evt=...)
at
/nashome/d/dladams/dev/dudev02/workdir/srcs/dunetpc/dune/Protodune/singlephase/RawDecoding/PDSPCTBRawDecoder_module.cc:105
#9 0x00007f44132129a8 in art::EDProducer::produceWithFrame (this=0x746ec00,
e=..., frame=...)
....

Our code, PDSPCTBRawDecoder, can be found at /cvmfs/dune.opensciencegrid.org/products/dune/dunetpc/v08_32_00/source/dune/Protodune/singlephase/RawDecoding/PDSPCTBRawDecoder_module.cc.

If I run valgrind on it, the job does not crash but the output does have several instances of "invalid write" and "invalid read"

valgrind --leak-check=yes art --nskip 19 -n 5 -c RunRawDecoder.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune/np04/beam/detector/None/raw/06/64/07/52/np04_raw_run005181_0001_dl1.root >& vgrind1.out

MSG
24270 Invalid write of size 8
24270 at 0x214CF219: QuickVec (QuickVec.hh:348)
24270 by 0x214CF219: artdaq::Fragment::Fragment(unsigned long) (Fragment.cc:24)
24270 by 0x21DF458F: at (ContainerFragment.hh:181)
24270 by 0x21DF458F: operator[] (ContainerFragment.hh:211)
24270 by 0x21DF458F: PDSPCTBRawDecoder::produce(art::Eventx%x
) (PDSPCTBRawDecoder_module.cc:105)
24270 by 0x58DF9A7: art::EDProducer::produceWithFrame(art::Event&, art::ProcessingFrame const&) (EDProducer.cc:88)
24270 by 0x5976580: art::detail::Producer::doEvent(art::EventPrincipal&, art::ModuleContext const&, std::atomic<unsigned long>&, std::atomic<unsigned long>&, std::atomic<unsigned long>&) (Producer.cc:121)
24270 by 0x5E2B4DD: art::Worker::runWorker(art::EventPrincipal&, art::ModuleContext const&) (Worker.cc:591)
24270 by 0x5E2D414: operator() (Worker.cc:569)
24270 by 0x5E2D414: void hep::concurrency::SerialTaskQueueChain::runFunc<art::(anonymous namespace)::RunWorkerFunctor>(art::(anonymous namespace)::RunWorkerFunctor const&) (SerialTaskQueueChain.h:83)
24270 by 0x5E2D5FB: operator() (SerialTaskQueueChain.h:69)
24270 by 0x5E2D5FB: hep::concurrency::QueuedTask<void hep::concurrency::SerialTaskQueueChain::passDown<art::(anonymous namespace)::RunWorkerFunctor>(unsigned int, art::(anonymous namespace)::RunWorkerFunctor const&)::{lambda()#1}>::execute() (SerialTaskQueue.h:87)

....

24270 Invalid read of size 4
24270 at 0x214CF31A: touch (RawFragmentHeader.hh:236)
24270 by 0x214CF31A: artdaq::Fragment::Fragment(unsigned long) (Fragment.cc:39)
24270 by 0x21DF458F: at (ContainerFragment.hh:181)
24270 by 0x21DF458F: operator[] (ContainerFragment.hh:211)
24270 by 0x21DF458F: PDSPCTBRawDecoder::produce(art::Event&) (PDSPCTBRawDecoder_module.cc:105)
24270 by 0x58DF9A7: art::EDProducer::produceWithFrame(art::Event&, art::ProcessingFrame const&) (EDProducer.cc:88)
24270 by 0x5976580: art::detail::Producer::doEvent(art::EventPrincipal&, art::ModuleContext const&, std::atomic<unsigned long>&, std::atomic<unsigned long>&, std::atomic<unsigned long>&) (Producer.cc:121)
24270 by 0x5E2B4DD: art::Worker::runWorker(art::EventPrincipal&, art::ModuleContext const&) (Worker.cc:591)
24270 by 0x5E2D414: operator() (Worker.cc:569)

...

24270 Invalid write of size 4
24270 at 0x214CF31F: touch (RawFragmentHeader.hh:237)
24270 by 0x214CF31F: artdaq::Fragment::Fragment(unsigned long) (Fragment.cc:39)
24270 by 0x21DF458F: at (ContainerFragment.hh:181)
24270 by 0x21DF458F: operator[] (ContainerFragment.hh:211)
24270 by 0x21DF458F: PDSPCTBRawDecoder::produce(art::Event&) (PDSPCTBRawDecoder_module.cc:105)
24270 by 0x58DF9A7: art::EDProducer::produceWithFrame(art::Event&, art::ProcessingFrame const&) (EDProducer.cc:88)

I ran the same decoder job successfully with artdaq_core v3_04_01. Our own module, PDSPCTBDecoder_module.cc has not changed since then.

Most fragments on most events still decode fine. As mentioned above, it may be a case of corrupt data not being handled properly. But there seems not to be any way we can check for corruption before instantiating and deleting the fragment.

Associated revisions

Revision 4f174db0 (diff)
Added by Eric Flumerfelt about 1 month ago

Add a check to make sure that ContainerFragment::at does not call the Fragment
consturctor with a negative size argument. Resolves #23427.

Revision 39dd6969 (diff)
Added by Eric Flumerfelt 21 days ago

Add a check to make sure that ContainerFragment::at does not call the Fragment
consturctor with a negative size argument. Resolves #23427.

History

#1 Updated by Thomas Junk about 1 month ago

I read in the first five events of that file with valgrind running, and did not see "Invalid" in the output.

#2 Updated by Eric Flumerfelt about 1 month ago

  • % Done changed from 0 to 100
  • Status changed from New to Resolved

#3 Updated by Eric Flumerfelt about 1 month ago

  • Category deleted (Needed Enhancements)
  • Experiment deleted (DUNE)

Resolution branch is artdaq-core:bugfix/23427_ContainerFragment_HandleTinyFragments

#4 Updated by Eric Flumerfelt about 1 month ago

  • Assignee set to Eric Flumerfelt
  • Category set to Known Issues
  • Experiment DUNE added

#5 Updated by Tingjun Yang about 1 month ago

Hi Eric,

Thanks for the fix. I wonder if we can request a patch release for artdaq_core v3_04_20. This is what the current ProtoDUNE production uses in dunetpc v08_27_01.

Tingjun

#6 Updated by Tingjun Yang 26 days ago

Would it be OK to make a patch release based on artdaq_core v3_04_20? Also the fix should be merged into a new release that will eventually be used by art/larsoft?

Thanks.

#7 Updated by John Freeman 21 days ago

Doing a code review of commit 4f174db0da00af4417c6e14cac576b1e6cdb7bf0 on bugfix/23427_ContainerFragment_HandleTinyFragments, I have just a couple of questions:

1. Looking at this snippet:

// Subtract RawFragmentHeader::num_words here as Fragment consturctor will allocate n + detail::RawFragmentHeader::num_words(), and we want fragSize to be allocated.
frag.reset(new Fragment((fragSize(index)) / sizeof(RawDataType) - detail::RawFragmentHeader::num_words()));

...if the value of fragSize(index) isn't evenly divisible by sizeof(RawDataType), will fragSize(index)) / sizeof(RawDataType) truncate and therefore not provide enough space? E.g., if fragSize(index) is 100 bytes, the raw data type is 8 bytes, and the header is 24 bytes, will the argument to the Fragment constructor evaluate to 100/8 - 3 == 9, corresponding to 9 8-byte words of payload, or four bytes short of what we'd actually need? Or is this not a concern since the payload of the original fragment in the container is created out of 8-byte words anyway?

2. If someone passes the index of the last fragment in the container, then in the call to fragSize(index), it appears a cet::exception("ArgumentOutOfRange") exception would get thrown because index >= block_count() ; is this the desired behavior?

#8 Updated by Eric Flumerfelt 20 days ago

Those are good points...I'm pretty sure it's not possible to make Fragments which are not an integer number of RawDataType words, but we can always add + (fragSize(index) % sizeof(RawDataType) != 0 ? 1 : 0) if it makes you feel better.

ContainerFragments are 0-indexed, so asking for any index >= block_count is and should be an error. ContainerFragment assumes that the first Fragment is at offset 0, so the first entry in the index is the position of the start of Fragment 1, the second Fragment in the container.

#9 Updated by John Freeman 20 days ago

  • Status changed from Resolved to Reviewed

I see that the artdaq::Fragment's data is stored in a container of RawDataTypes, so my concern over truncation's moot. And C-style indexing means the way the last fragment in a container's handled is totally fine as well. Reviewed.

#10 Updated by Eric Flumerfelt 20 days ago

  • Co-Assignees John Freeman added
  • Co-Assignees deleted (Eric Flumerfelt)

#11 Updated by Tingjun Yang 20 days ago

I saw there were a new commit from yesterday. Do we need a new tag?



Also available in: Atom PDF