Project

General

Profile

Support #12952

Help NOvA investigate memory usage issues

Added by Brian Rebel over 3 years ago. Updated about 3 years ago.

Status:
Closed
Priority:
High
Assignee:
-
Category:
User Code
Target version:
-
Start date:
06/17/2016
Due date:
% Done:

100%

Estimated time:
Spent time:
Scope:
Internal
Experiment:
NOvA
SSI Package:
Duration:

Description

One of the NOvA analyses is seeing a blow up of memory usage when looping over data. Information that should take up about 400 bytes per object on disk seems to blow up to 2 MB per object when read into memory. I suspect part of the issue is the use of std::shared_ptrs. I already know where in the code the blow up occurs, but could use some critical eyes on the code to help figure out where we can make improvements.

Would the artists be able to help us out?

massif.out.fnex_unsigned_char (1.36 MB) massif.out.fnex_unsigned_char Massif output after changing the flat_map keys to unsigned char Brian Rebel, 06/28/2016 02:06 PM
massif.out.fnex_flat_map (1.2 MB) massif.out.fnex_flat_map Massif output using flat_map instead of unordered_map Brian Rebel, 06/28/2016 02:09 PM

History

#1 Updated by Kyle Knoepfel over 3 years ago

  • Status changed from New to Feedback

We propose meeting to understand what steps have been taken thus far and what environment-setup we would need to reproduce the issue.

#2 Updated by Brian Rebel over 3 years ago

Sounds like a good plan. Can we meet Monday afternoon (today?), say 3 pm?

#3 Updated by Marc Paterno over 3 years ago

Brian, please come by WH 9SW Monday afternoon at 3PM.

#4 Updated by Brian Rebel over 3 years ago

See you then.

#5 Updated by Brian Rebel over 3 years ago

We have already run massif on a job exhibiting this behavior. You can find the massif output file at

/nova/app/users/brebel/nova_testing/fnex_test/massif.out.fnex_flat_map

which is accessible from any of the novagpvm nodes.

#6 Updated by Brian Rebel over 3 years ago

At the meeting we got several suggestions for how to improve the memory usage. In order of least impact on the code to most, they were:

1. Change the flat_map<std::string, fnex::Var> to flat_map<unsigned char, fnex::Var> in FNEX/utilities/VarVals.h
2. Remove the virtual functions from FNEX/utilities/Var.h as the vtable is very large compared to the data being stored
3. Change the use of shared_ptr to unique_ptr to avoid carrying around the information about the number of instances

I have implemented the first already and say about a 20% reduction in the memory, going from 1.66GB used on my MacBookPro to 1.36GB.

We need to include Jeremy Wolcott in the conversation for the next 2. He is available Tuesday or Wednesday for a conversation by phone and is then gone for a couple of weeks on vacation and conference travel. I suggest Wednesday @ 10:30 am.

#7 Updated by Brian Rebel over 3 years ago

I have run massif with the latest version of the code (changing the flat_map key from std::string to unsigned char) and the output file is attached to this issue, massif.out.fnex_unsigned_char. The total usage drops from 1.96 GB using flat_maps keyed by std::strings (massif.out.fnex_flat_map).

#8 Updated by Kyle Knoepfel over 3 years ago

Thank you for the massif outputs. Indeed, the boost::container::flat_map no longer appears in the output, which is a clear improvement. To assess the remaining memory usage, consider the element type of the boost::flat_map, which for this particular case is std::pair<unsigned char, std::shared_ptr<IndepVar>>. The breakdown of the memory occupied by each element is shown in this diagram:

                                       16 [from 12 = 8 (virtual table pointer) + 4 (float)]
                                          │
std::pair<unsigned char, std::shared_ptr<Var>>
                 │                │
             8 [from 1]           ├──── pointer to Var: 8
                                  │
                                  └──── reference counter (and associated pointer): 16

In tabular form:

  • 8 bytes: unsigned char (should be only 1 byte, but 7 bytes of padding are introduced for alignment reasons)
  • 24 bytes: std::shared_ptr<Var> corresponding to the reference counting and the pointer to Var
  • 16 bytes: the IndepVar object allocated on the heap. A virtual table pointer introduces 8 bytes, the datum is 4 bytes, and an extra 4 bytes of padding are introduced, again, for alignment.
  • 48 bytes total

Succinctly put, that is 48 bytes, or 44 bytes overhead, for accessing a float, which is 4 bytes. Since there are 20 such elements per event, that results in 960 bytes per event, of which only 80 bytes contain data.

Options to consider:
  • Moving from shared_ptr to unique_ptr: A shared_ptr introduces 16 bytes of overhead for reference counting. Changing to a unique_ptr, if possible, will bring the element size to 32 bytes instead of 48. If it is deemed to be not possible to make this switch, make sure that std::make_shared is used instead of using a direct std::shared_ptr; using the free function performs one allocation instead of two.
     
  • Removing run-time polymorphism of IndepVar: Since virtual functions are present/overridden in the inheritance chain of IndepVar, a pointer to a virtual table is stored for each instance of IndepVar. Removing the polymorphism (e.g. through use of templates) will reduce the memory footprint of the std::pair to 40 bytes. Doing so would then possibly eliminate the need for using an std::shared_ptr at all, allowing just the IndepVar type to be stored in the pair. If retaining the interface through run-time polymorphism is required, consider containing a vector<float> member in a class template that inherits from an interface base class:
    class VarCollection {
    public:
       virtual std::vector<float> const& vars() const = 0;
    };
    
    template <typename T>
    class SpecificCollection : public VarCollection {
    public:
      ...
    private:
      std::vector<float> const& vars() const override {...}
    
      std::vector<float> vars_;
    };
    
    This would remove most of the overhead but still retain some degree of polymorphism.
     
  • Using a 20-member struct: The most compact representation of storing twenty floats with different names is to just store twenty separate floats in a struct. While maybe not easily usable in other circumstances, such a struct would not result in padding of any kind (due to a well-aligned structure), and the use of pointers (8 bytes each) would be unnecessary. Instead of an object that takes 960 bytes of memory, it would take the minimum 80 bytes of memory necessary to encode the same information.

We are happy to have further discussion with you if desired.

#9 Updated by Kyle Knoepfel about 3 years ago

Can you give us an indication of where things stand with this issue? If you do not need further interaction with us at this time, we would like to close this issue. Thanks.

#10 Updated by Brian Rebel about 3 years ago

Yes, please go ahead and close the ticket. We will open a new one should we need more assistance.

Thanks for all your help.

#11 Updated by Kyle Knoepfel about 3 years ago

  • Status changed from Feedback to Closed
  • % Done changed from 0 to 100

Thanks, Brian.



Also available in: Atom PDF