Project

General

Profile

Investigation of Memory Leak in NetMonInput

A memory issue was noticed in ARTDAQ systems (first for LBNE-35t), where the Anonymous Pages Ganglia metric would climb over time until the system crashed due to running out of memory. Several different techniques were brought to bear in rooting out the source of this memory leak.

John's Investigation

To investigate this leak, as a starting point I used artdaq-demo at its then-HEAD, 51fc0d0e58f0928aa0ebf96c7bf4ee968755c553, as well as the versions of artdaq-core, etc., that commit would imply based on its quick-start.sh script. Runs were performed on lbnedaq7. Basically, the leak was tracked down in the following manner:

  • Looking at the increase in memory usage on a per-process basis with "top" and then "ps aux", clearly the lion's share of the leaking was done by the diskwriting aggregator.
    --Running a 2x2 (2 boardreaders, 2 eventbuilders, 0 aggregators) system with no diskwriting, the memory leak was basically negligible.
  • What was different about diskwriting aggregator's code as compared to the online monitoring aggregator?
    --Continuously copied data into shared memory (as opposed to reading it)
    --Used the NetMonInput source to read events sent by the eventbuilders
  • As the memory leak was almost a monotonic increase with a constant slope, the NetMonInputDetail::readNext function seemed to be an ideal place to look for a leak
  • NetMonInputDetail::readNext calls the function NetMonInputDetail::readAndConstructPrincipal. In this function, a TBufferFile object, "msg", has its ReadObjectAny function called repeatedly, with patterns like the following:
    p = msg.ReadObjectAny(event_aux_class); // "p" is of type void*, pointing to a block of memory returned by ReadObjectAny
    // In the actual code, debug statements would appear here; I'm omitting them for simplicity
    event_aux.reset(reinterpret_cast<art::EventAuxiliary*>(p)); // "event_aux" is a unique_ptr object
    

    When event_aux is eventually destructed, it would clean up the memory pointed to by "p". Given that the program didn't crash when this happened suggested to me that even though TBufferFile::ReadObjectAny created the block of memory, it must not have owned it (otherwise we'd have a double-deletion on our hands). For this reason the following code snippet was quite interesting:
    p = msg.ReadObjectAny(history_class); // Same "p" as before, now pointing to a new block of memory
    // In the actual code, debug statements would appear here; I'm omitting them for simplicity
    history.reset(new art::History(
                              *reinterpret_cast<art::History*>(p))); // "history" is a shared_ptr object
    

    This last line does the following:
  1. Takes the "p" pointer and recasts it to be a pointer to the art::History object in the memory pointed to by "p"
  2. Dereferences it - i.e., gives us the actual art::History object
  3. Passes that object to the copy constructor of art::History. As no constructors are declared in the art::History class, this is just the default copy constructor
  4. The "new" keyword allocates fresh memory to hold an art::History object before it gets copy-constructed
  5. It's this copy-constructed art::History object in the fresh memory which the "history" shared_ptr takes ownership of - NOT the memory created in the ReadObjectAny call on the first line
  6. The memory returned by the ReadObjectAny call has been orphaned - hence the leak.

To see the effect of this leak running a 2x2x2 artdaq-demo system with diskwriting off, take a look at the following plot, created Apr-26-2016:

This shows the use of anonymous pages (allocated memory not backed up by a file) for runs under two different artdaq compilations. The first one, I replaced the line

history.reset(new art::History(
                          *reinterpret_cast<art::History*>(p))); 

with
history.reset(reinterpret_cast<art::History*>(p)); 

and the second run, I used the original line. The results in the plot speak for themselves - the great majority in the increase of anonymous page usage over time was due to this line of code.

Eric's Investigation

I was able to collect detailed information about the memory allocations in ARTDAQ using TotalView's MemoryScape program. To do this, I had to link the BoardReader, EventBuilder, and Aggregator applications against the TotalView heap debugging library, libtvheap_64.so. I was able to collect information about the total heap size after initialization but before running, after running for some period of time (still in running state), and after issuing a stop command to the system. The program also gave some indications where in the source memory leaks were occurring, but this appeared to be imprecise as the areas in the source it was pointing to were not causing leaks.

Testing

To determine whether problems were being caused by leaks or simply stale allocations (after John found the root cause of our problems), I took the above-described snapshots for 1.5h of running and for 15h of running.

Ganglia Plots of Anonymous Pages for 4h (last 4h of 15h test), and Day (13:00-15:00 is 1.5h test, 15:00 to 08:00 is 15h test)

The "sawtooth" pattern is interesting. The large increase near the end is due to the collection of heap data by TotalView. From looking at this plot, we can see a ~50-100 MB increase in memory usage over the 15h run.

Heap Usage
Run Before "start" Command Before "stop" Command After "stop" Command
1.5h with disk writing
15h without disk writing

Conclusion: John's fix appears to have stabilized the memory usage by artdaq-demo. This technique has potential for identifying components that may be leaking memory and serve as a guide to more targeted approaches.

Ron's Investigation

In a similar (but completely different) approach, Ron created a library which was loaded using LD_PRELOAD, which overrode the memory management functions such as malloc, free, mmap, and others. The new functions included TRACE printouts so that the memory usage of the program can be viewed at a line-by-line level. Ron also created a perl script which goes through a trace buffer and keeps track of all un-free'd mallocs and prints a list of these potentially-leaking malloc calls. Using the TRACE timestamp and other TRACEs in the region, it is possible to identify exactly which line of code is leaking memory. We did discover, however, that care must be taken to make sure that you are actually capturing the part of the program which is leaking memory, and that your traces must be relatively fine-grained to make a positive identification.

Image Files