DDS Memory Footprint increases with number of nodes
The memory usage on any node scales with the number of nodes in the system. While we probably have enough memory available to handle this, it may be a sign of some unnecessary inefficiency creeping into our system.
Our intended usage certainly doesn't require many nodes to know about each other. E.g., DCMs and Buffer Nodes never send messages to each other. We have a multicast configuration in use that groups DCMs by diblock, Buffer Nodes by group, and separates control traffic from message logging traffic. It is clear from Ganglia and overall performance that this configuration works as intended.
This issue is for tracking work on understanding/resolving the memory footprint issue.
#2 Updated by Peter Shanahan over 5 years ago
It occasionally happens that the DCMApplication on one or two DCMs in a large partition will never progress beyond "yellow" (active, but not responsive) during Reserve Resources. The solution is to use the DAQApplicationManager to stop the DCMApp, stop and start DDS on that DCM, restart the application.
It was observed that the memory footprint of this one DCMs was far smaller than others, about 15-20%. However, it worked absolutely normally during configuration and data-taking, indicating that the large memory footprint really serves no useful role.
#3 Updated by Peter Shanahan over 5 years ago
From the OSPL DDS deployment manual, it sounds like discovery can be limited to nodes that need to know about each other using Scope and Role. It says Scope can contain a comma-separated list of wildcarded Role expressions. Only nodes withe a matching Role would be discovered.
On tried Role=DBxx for diblock xx DCMs, BNGyy for Buffer Node group yy buffer nodes, and a catch-all MANAGER role for everything else.
Manager nodes had Scope="DB*,BNG*,MANAGER". For Buffer Nodes and DCMs, I tried Scope="MANAGER", and then Scope="MANAGER,DBxx" or "MANAGER,BNGyy" as appropriate.
In all cases, networking crashed shortly after startup.
Using Tracing, I see that spliced and networking are interpreting these as I intended.
By starting networking by hand and attaching the debugger, I see that the crash happens in
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaaac2a5940 (LWP 27397)]
0x000000000040af30 in nw_rolescopeMatching ()
#0 0x000000000040af30 in nw_rolescopeMatching ()
#1 0x000000000040c68f in nw_discoveryReaderMain ()
#2 0x000000000041772a in nw_runnableDispatcher ()
#3 0x00002aaaabbfd561 in os_startRoutineWrapper ()
#4 0x000000339b00683d in start_thread () from /lib64/libpthread.so.0
#5 0x000000339a4d526d in clone () from /lib64/libc.so.6
No symbol table is loaded. Use the "file" command.
To see whether the code was doing something unimaginably stupid with a string length based on the wildcarding, I just made Scope and Role=MANAGER everywhere. (This is in a stand alone test between 2 nodes). That changed nothing.