Project

General

Profile

Bug #23348

Memory leaks while reading artdaq Fragments from data files

Added by Eric Flumerfelt 12 months ago. Updated 12 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
09/27/2019
Due date:
% Done:

100%

Estimated time:
Experiment:
-
Co-Assignees:
Duration:

Description

This issue arose from close inspection of the DUNE data unpacking while debugging issue #23319.

Relevant Comments from that issue:

Updated by Tingjun Yang 1 day ago
Quote Edit Delete

Pengfei Ding wrote:

    Yeah, there are. "Succeeds" meant "art exit with 0" to me.

    [...]

Thanks. Memory usage seems high.
#27
Updated by Pengfei Ding 1 day ago

    File valgrind.out valgrind.out added

Quote Edit Delete

Attached file contains the valgrind output with

valgrind --leak-check=yes --tool=memcheck --track-origins=yes --num-callers=50 --log-file=valgrind.out --suppressions=${ROOTSYS}/etc/valgrind-root.supp lar -c  RunRawDecoder.fcl  np04_raw_run009701_0004_dl1.root -n 2

Peak resident memory usage was 2994.43 MB.

I have another valgrind job running over the full 50 events in the raw file (expect to finish in ~40 minutes.
#28
Updated by Pengfei Ding 1 day ago

    File valgrind2.out valgrind2.out added

Quote Edit Delete

Attached the valgrind output for the full 50 events in the file. (Possibly lost ~4.07GB. )

The issue is likely to be in line 961-962 of dunetpc/dune/Protodune/singlephase/RawDecoding/PDSPTPCRawDecoder_module.cc

==223821== LEAK SUMMARY:
==223821==    definitely lost: 5,745,989 bytes in 38,317 blocks
==223821==    indirectly lost: 137,621 bytes in 164 blocks
==223821==      possibly lost: 4,068,422,379 bytes in 54,396 blocks
==223821==    still reachable: 266,349,295 bytes in 176,181 blocks
==223821==                       of which reachable via heuristic:
==223821==                         newarray           : 95,704 bytes in 187 blocks
==223821==                         multipleinheritance: 6,768 bytes in 9 blocks
==223821==         suppressed: 1,365,831 bytes in 15,879 blocks
==223821== Reachable blocks (those to which a pointer was found) are not shown.
==223821== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==223821==
==223821== For counts of detected and suppressed errors, rerun with: -v
==223821== ERROR SUMMARY: 268772 errors from 217 contexts (suppressed: 487514 from 1738)

#29
Updated by Thomas Junk 1 day ago
Quote Edit Delete

Thanks for the news; that's what I saw in a test and thought I had just built dune_raw_data from the bugfix branch incorrectly. Looks like there will need to be some debugging, comparing a working job with this one side by side. The no valid nticks to check message comes from the RCE decoder.
#30
Updated by Thomas Junk 1 day ago
Quote Edit Delete

I spotted the trouble with failing to unpack the RCE fragments. There's a check on the integrity of the fragment that needs to look at the header and the trailer, and it was shifted relative to a hardcoded offset, which, when updated, allowed the integrity check to pass. I'll ask Eric what mixture of exposed offset methods is supported to get the size I need to do the check more robustly against changes in the future.

There's still a memory leak however.
#31
Updated by Thomas Junk 1 day ago
Quote Edit Delete

Commenting out the removeCachedProduct calls doesn't change the memory leak.
#32
Updated by Thomas Junk 1 day ago
Quote Edit Delete

I commented out the _processFELIX call and it still leaks, just more slowly.
#33
Updated by Thomas Junk 1 day ago
Quote Edit Delete

I trimmed the raw decoder module down to just two getByLabel calls, one for RCE container fragments, the other for FELIX container fragments. It runs very quickly, but still leaks memory.
#34
Updated by Thomas Junk 1 day ago
Quote Edit Delete

And replacing getByLabel calls with getValidHandle calls changes nothing -- leaks memory very quickly. No artdaq_core calls involved, just art.
#35
Updated by Pengfei Ding 1 day ago
Quote Edit Delete

Thomas Junk wrote:

    And replacing getByLabel calls with getValidHandle calls changes nothing -- leaks memory very quickly. No artdaq_core calls involved, just art.

Do you have a backtrace of using getValidHandle to compare with getByLabel? The common downstream call might indicate the origin of the memory leak.
#36
Updated by Pengfei Ding 1 day ago

    File Screen Shot 2019-09-26 at 1.12.33 PM.png Screen Shot 2019-09-26 at 1.12.33 PM.png added
    File Screen Shot 2019-09-26 at 1.11.54 PM.png Screen Shot 2019-09-26 at 1.11.54 PM.png added

Quote Edit Delete

Tried to run the Raw Decoder with memory debugging in DDT. It got segfault because "address not mapped to object". See attached screenshots for details. Problem seems to be with line 24 of getEntry.cc in art_root_io package.
#37
Updated by Kyle Knoepfel about 19 hours ago
Quote Edit Delete

I have started to investigate.
#38
Updated by Thomas Junk about 16 hours ago
Quote Edit Delete

Here's my super-stripped-down PDSPTPCRawDecoder_module.cc:

/dune/app/users/trj/splitter10/srcs/dunetpc/dune/Protodune/singlephase/RawDecoding/PDSPTPCRawDecoder_module.cc

which just has the two getByLabel calls and makes an empty RawDigit collection in the event (maybe even that can go, but it's a producer module so I figured it would have to produce something). I was going to rename it and put it on Tingjun's test branch. But it would be good to see if memory leaks happen when reading in other data products in other modules.

no_fast_clone.log (5.83 KB) no_fast_clone.log Kyle Knoepfel, 09/30/2019 11:46 AM
no_fast_clone.fcl (107 Bytes) no_fast_clone.fcl Kyle Knoepfel, 09/30/2019 11:46 AM
fast_clone.log (5.83 KB) fast_clone.log Kyle Knoepfel, 09/30/2019 11:46 AM
fast_clone.fcl (106 Bytes) fast_clone.fcl Kyle Knoepfel, 09/30/2019 11:46 AM
root_6.18_valgrind.log (75.4 KB) root_6.18_valgrind.log Kyle Knoepfel, 10/01/2019 09:03 AM
root_6.16_valgrind.log (38.8 KB) root_6.16_valgrind.log Kyle Knoepfel, 10/01/2019 09:03 AM
massif.out.437219 (796 KB) massif.out.437219 Kyle Knoepfel, 10/01/2019 10:10 AM
massif.out.50_events (360 KB) massif.out.50_events Kyle Knoepfel, 10/01/2019 10:31 AM

Related issues

Related to artdaq - Bug #23319: Problems reading ProtoDUNE-SP raw data file with artdaq_core v3_05_02Closed09/24/2019

History

#1 Updated by Eric Flumerfelt 12 months ago

  • Related to Bug #23319: Problems reading ProtoDUNE-SP raw data file with artdaq_core v3_05_02 added

#2 Updated by Thomas Junk 12 months ago

For what it's worth, I ran valgrind on

lar -n 10 -c RunRawDecoder.fcl root://fndca1.fnal.gov:1094/pnfs/fnal.gov/usr/dune/tape_backed/dunepro/protodune-sp/raw/2019/detector/test/None/00/00/97/01/np04_raw_run009701_0004_dl1.root

with an older dunetpc, v08_28_01, which is built on root v6_16_00. As noted in the associated ticket, it doesn't have a big leak.

==6321== LEAK SUMMARY:
==6321==    definitely lost: 5,724,190 bytes in 38,179 blocks
==6321==    indirectly lost: 40,011 bytes in 603 blocks
==6321==      possibly lost: 3,545,856 bytes in 53,457 blocks
==6321==    still reachable: 201,330,025 bytes in 202,554 blocks
==6321==                       of which reachable via heuristic:
==6321==                         newarray           : 93,944 bytes in 177 blocks
==6321==                         multipleinheritance: 7,880 bytes in 9 blocks
==6321==         suppressed: 0 bytes in 0 blocks
==6321== Reachable blocks (those to which a pointer was found) are not shown.
==6321== To see them, rerun with: --leak-check=full --show-leak-kinds=all

#3 Updated by Thomas Junk 12 months ago

I tried my super-stripped-down module just reading raw::RawDigits instead of artdaq::Fragments from a decoded file instead of a raw file, just to see if it leaked too with the new root, and it does not. The input data size is larger because it was uncompressed.

#4 Updated by Thomas Junk 12 months ago

And I misspoke in the previous ticket about my super-stripped-down version of the raw decoder which just calls getByLabel not calling anything in artdaq. Indeed, the constructor artdaq::Fragment::Fragment() gets called during fragment readin.

#5 Updated by Thomas Junk 12 months ago

I checked out and built the branch bugfix/23345_ContainerFragment_FixFragmentAllocation of artdaq-core and ran my stripped down fragment reader and it still leaks.

#6 Updated by Kyle Knoepfel 12 months ago

See the attached configuration files and job logs. The jobs only read the input file referenced above (all 50 events) and write the products to an output file, with no processing in between. The following maximum RSS is reached:

Fast-cloning enabled 296 MB
Fast-cloning disabled 4725 MB

The memory growth for the no-fast-cloning job occurs throughout the course of the no-fast cloning job. Will continue to debug.

#7 Updated by Kyle Knoepfel 12 months ago

  • Status changed from New to Assigned

#8 Updated by Eric Flumerfelt 12 months ago

I was running against my old container test file (https://artdaq.fnal.gov/TestDataFiles/RawFragmentHeaderV1_example_file.root), and noticed that it was using far too much memory and taking much longer than with RawFragmentHeaderV2_example_file.root. I found that by replacing the TRACEN calls in Fragment.hh with TLOG calls, the memory usage and processing speed was much improved.

Fix is on artdaq-core:bugfix/23348_Fragment_ReplaceTRACENWithTLOG

#9 Updated by Lynn Garren 12 months ago

Kyle, a test build of artdaq_core "v3_05_06" (no tag) is available on scisoftbuild01 under /home/garren/scratch/products.

#10 Updated by Kyle Knoepfel 12 months ago

Eric, with what version of artdaq_core did you start to notice a difference in TRACEN vs. TLOG behavior?

I'm running the bare bones job (just reading/writing the fragments) with version v3_05_02, which appears to be a place where only std::cout was used in the Fragment.hh file. However, I'm still seeing substantial memory growth.

#11 Updated by Eric Flumerfelt 12 months ago

I was working with develop. I changed the couts to TRACEN between artdaq_core v3_05_03 and v3_05_04, while fixing the bug that was causing DUNE jobs to crash.

#12 Updated by Kyle Knoepfel 12 months ago

As the input file (np04_raw_run009701_0004_dl1.root) contains only art/artdaq constructs, I was able to reproduce the memory growth by building artdaq_core v3_04_20 against both ROOT 6.16 and 6.18. Specifically, running over 5 events in the input file:

artdaq_core commit art_root_io version canvas_root_io version ROOT version VmPeak (MB) VmHWM (MB)
222453e 1.00.07 (s86) 1.03.04 6.16/00 1046.39 698.094
222453e 1.00.09 (s91) 1.03.06 6.18/04 1310.09 956.346

The job reads in the input file (fast-cloning disabled) and then writes the products immediately to disk. If I increase the number of events to 50 (the full file), the ROOT 6.18/04 build results in an over 4 GB memory growth.

The only substantive change in art_root_io is the change in canvas_root_io versions; the only substantive change in canvas_root_io is the change in ROOT versions and the addition of a regression test that is not accessible to artdaq_core.

Please see the attached valgrind logs for the jobs referenced above. The most substantial difference between the two valgrind logs appears to be the extra block of "possibly lost" memory for the ROOT 6.18 version:

...

==420317== 417,144,328 bytes in 16 blocks are possibly lost in loss record 12,133 of 12,133
==420317==    at 0x4C2AAF8: operator new[](unsigned long) (vg_replace_malloc.c:423)
==420317==    by 0x105BC9EF: TBuffer::TBuffer(TBuffer::EMode, int) (TBuffer.cxx:85)
==420317==    by 0x10C9C2BD: TBufferIO::TBufferIO(TBuffer::EMode, int) (TBufferIO.cxx:51)
==420317==    by 0x10C8CBD1: TBufferFile::TBufferFile(TBuffer::EMode, int) (TBufferFile.cxx:89)
==420317==    by 0x11A29E67: R__InitializeReadBasketBuffer(TBuffer*, int, TFile*) (TBasket.cxx:419)
==420317==    by 0x11A2A119: TBasket::ReadBasketBuffers(long long, int, TFile*) (TBasket.cxx:504)
==420317==    by 0x11A3A10B: TBranch::GetBasketImpl(int, TBuffer*) (TBranch.cxx:1245)
==420317==    by 0x11A3A85C: TBranch::GetBasketAndFirst(TBasket*&, long long&, TBuffer*) (TBranch.cxx:1362)
==420317==    by 0x11A3B30A: TBranch::GetEntry(long long, int) (TBranch.cxx:1589)
==420317==    by 0x11A4E274: TBranchElement::GetEntry(long long, int) (TBranchElement.cxx:2652)
==420317==    by 0x11A4DE3F: TBranchElement::GetEntry(long long, int) (TBranchElement.cxx:2596)
==420317==    by 0x11A4DE3F: TBranchElement::GetEntry(long long, int) (TBranchElement.cxx:2596)
==420317== 
==420317== LEAK SUMMARY:
==420317==    definitely lost: 13,463 bytes in 97 blocks
==420317==    indirectly lost: 270,311 bytes in 76 blocks
==420317==      possibly lost: 417,150,634 bytes in 44 blocks
==420317==    still reachable: 48,273,146 bytes in 77,422 blocks
==420317==                       of which reachable via heuristic:
==420317==                         newarray           : 58,376 bytes in 95 blocks
==420317==                         multipleinheritance: 928 bytes in 2 blocks
==420317==         suppressed: 497,781 bytes in 5,287 blocks
==420317== Reachable blocks (those to which a pointer was found) are not shown.
==420317== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==420317== 
==420317== For counts of detected and suppressed errors, rerun with: -v
==420317== ERROR SUMMARY: 1087 errors from 39 contexts (suppressed: 10420 from 407)

Philippe, do you have guidance?

#13 Updated by Philippe Canal 12 months ago

The possibly unreacheable is concerning and needs to be understood and then fixed BUT is an order of magnitude too small to explain the 4GB memory footprint.
Can we run massif (or any other memory profiler) on the failing example to see where the memory is being allocated?

#14 Updated by Kyle Knoepfel 12 months ago

It's an order of magnitude too small because the valgrind report applies to a job with only 5 events processed (instead of the full 50). I will run massif on the ROOT 6.18 build and post it once it's available.

#15 Updated by Kyle Knoepfel 12 months ago

massif output attached--need to run ms_print on it.

#16 Updated by Philippe Canal 12 months ago

Can you run it on the 50 events case?

#17 Updated by Kyle Knoepfel 12 months ago

50-event massif output attached.

#18 Updated by Philippe Canal 12 months ago

quite/strange .. I will need a debug build of ROOT to try to trace down what is different in this setup than any other (aka the 'leak' appears to be in the inner guts of TTree ...)

#19 Updated by Lynn Garren 12 months ago

Philippe, debug builds of root are available on scisoftbuild01. Do you need them elsewhere?

#20 Updated by Philippe Canal 12 months ago

yes .. I am lame/forgetful :). Could I be remembered how to setup and run the failing example with the debug build of ROOT?

Oh that ... and I don't have access to scisoftbuild01

Warning: Permanently added 'scisoftbuild01.fnal.gov' (ECDSA) to the list of known hosts.
Permission denied (gssapi-keyex,gssapi-with-mic).

#21 Updated by Kyle Knoepfel 12 months ago

Once you have permissions on scisoftbuild01, you can rerun the job by doing:

> . /products/setup
> PRODUCTS=/scratch/garren/products:$PRODUCTS
> setup artdaq_core v3_04_22 -q e17:debug:s91
> setup art_root_io v1_00_09 -q e17:debug
> art -c /home/knoepfel/no_fast_clone.fcl /home/knoepfel/input.root

The debugging tools:

> setup valgrind v3_14_0
> setup gdb

If you only want to run with 5 events, add the "-n 5" option to the art command line.

#22 Updated by Philippe Canal 12 months ago

The problem is resolved in https://github.com/root-project/root/pull/4461.
The 'unusual' feature of the file is that it is uncompressed which triggers a recent bug (v6.18) which lead to the usage of broken code (existing since v5.32 but actually never 'executed' per se)

#23 Updated by Tingjun Yang 12 months ago

Thank you Philippe for identifying and fixing the problem. Is the plan to release a new version of root with this bug fix so art/larsoft can use it? Thanks.

#24 Updated by Lynn Garren 12 months ago

Indeed. We are building a new stack with root v6_18_04b. A build of the nutools suite is now available. I am in the process of installing nutools v3_04_03 and artdaq_core v3_05_06 on cvmfs. It should be available in about an hour. Expecting to tag the larsoft release sometime this afternoon. We're pushing hard to get this out.

#25 Updated by Lynn Garren 12 months ago

Forgot to say that Kyle tested with root v6_18_04b and it looks good.

#26 Updated by Tingjun Yang 12 months ago

Sounds great. Thanks!

#27 Updated by Lynn Garren 12 months ago

  • % Done changed from 0 to 100
  • Status changed from Assigned to Resolved

Resolved with root v6_18_04b and artdaq_core v3_05_06 in larsoft v08_32_00.

#28 Updated by Eric Flumerfelt 12 months ago

  • Target version set to artdaq_core v3_05_07
  • Status changed from Resolved to Closed


Also available in: Atom PDF