Project

General

Profile

Profiling DPM

There are three tiers to the DPM design. The top tier is the interface to ACNET. The middle tier manages jobs (dishes out device requests to respective FE workers and scales and sends the results to the ACNET client.) The lowest tier are the FE workers which pool requests to a front-end and split the results to the correct jobs.

Without measuring, here's a few design decisions which may be non-optimal:

  • Each Erlang process has its own heap. When a message is sent between processes, the data is copied into the target process' heap. In DPM the FE worker gets a packet from ACNET and sends portions to interested jobs. The jobs scale the data and send the results to ACNET client processes. These process, in turn, send the packet to the ACNET connection. This is at least 5 copies. It may not be as bad as it sounds because binary data over 64 bytes is stored in a binary heap that all processes share, so they don't get copied each time. At the very least, we could have the FE worker scale the data and send it to the ACNET client, which would cut out the copying to and from the middle tier.
  • All ACNET traffic flows through an ACNET handle. In Erlang, a handle is managed by an Erlang process. It is important to make the Erlang process as efficient as possible.

We need to do some profiling to see it these are truly a problem.

Investigations

These are some of the efforts we made towards improving DPM (successes and failures.)

Improving erl-acnet

All requests to DPM are sent to the DPMD ACNET handle. This means all replies funnel through a single process. On CLX20, the ACNET process associated with the DPMD handle had the highest number of "reductions" (the Erlang measurement of CPU usage):

(dpm@clx20)> i().
Pid               Initial Call          Heap        Reds Msgs  Registered        Current Function        Stack
...
<0.108.0>         acnet:init/1          4185    37379042    0  dpmd              gen_fsm:loop/7             10
...
ok
(dpm@clx20)> 

The eprof module was used to profile this process.

(dpm@clx20)3> eprof:start().
{ok,<0.304.22>}
(dpm@clx20)4> eprof:start_profiling([list_to_pid("<0.108.0>")]).
profiling
(dpm@clx20)5> 

Let it run for a while. Add more clients to bump up its CPU usage. Then stop the profiling and print the results.

(dpm@clx20)5> eprof:stop_profiling().
profiling_stopped
(dpm@clx20)6> eprof:analyze().       

****** Process <4607.108.0>    -- 100.00 % of profiled time *** 
FUNCTION                         CALLS        %      TIME  [uS / CALLS]
--------                         -----  -------      ----  [----------]
inet:getaddr/2                  225956     0.34     48342  [      0.21]
inet:getaddr/3                  225956     0.41     58377  [      0.26]
gen_fsm:handle_msg/7            257821     0.48     69697  [      0.27]
inet:stop_timer/1               225956     0.53     75655  [      0.33]
lists:foldl/3                    63730     0.60     85934  [      1.35]
gen_fsm:decode_msg/9            257821     0.61     88357  [      0.34]
inet:getaddr_tm/3               225956     0.62     89211  [      0.39]
prim_inet:ip4_to_bytes/1        225956     0.63     90086  [      0.40]
inet_udp:send/4                 225956     0.66     94973  [      0.42]
inet:start_timer/1              225956     0.67     96473  [      0.43]
erlang:port_get_data/1          225956     0.69     99530  [      0.44]
gen_fsm:from/1                  257821     0.87    125662  [      0.49]
inet_udp:getaddr/1              225956     0.87    125696  [      0.56]
prim_inet:sendto/4              225956     0.89    128139  [      0.57]
inet_udp:getserv/1              225956     0.89    128423  [      0.57]
prim_inet:ip_to_bytes/1         225956     0.97    139523  [      0.62]
inet:getaddrs_tm/3              225956     1.04    149233  [      0.66]
inet_db:lookup_socket/1         225956     1.08    154733  [      0.68]
gen_fsm:cancel_timer/1          225956     1.15    164614  [      0.73]
gen_fsm:start_timer/2           225956     1.31    187981  [      0.83]
erlang:port_command/2           225956     1.47    211870  [      0.94]
gen_fsm:loop/7                  257821     2.43    348955  [      1.35]
gen_fsm:dispatch/4              257821     2.54    365032  [      1.42]
gen_fsm:reply/2                 225956     3.11    446678  [      1.98]
erlang:cancel_timer/1           225956     3.50    502784  [      2.23]
erlang:start_timer/3            225956     3.67    528307  [      2.34]
gen_udp:send/4                  225956     7.24   1041286  [      4.61]
orddict:find/2                 7189731    22.66   3257766  [      0.45]
erts_internal:port_command/3    225956    38.07   5473146  [     24.22]
----------------------------  --------  -------  --------  [----------]
Total:                        13513598  100.00%  14376463  [      1.06]
ok
(dpm@clx20)7>

Almost 40% of it's time was in erts_internal:port_command/3. This is the low-level interface to the UDP layer, so there isn't much we can do here. In fact, this tells us that 40% of its time was spent sending data to the socket, which is its primary job.

The next CPU hog is orddict:find/2. This function doesn't cost much (only ~1/2 microsecond per call!) but it gets called a lot. In fact, it gets called 30x more than the other functions. The Erlang ACNET module uses the ordered dictionary module (orddict) to map request and reply IDs to processes. Any incoming or outgoing packet will require a look-up in this dictionary. This module was written before Erlang had native maps so, in this case, the ordered dictionary was implemented using Erlang tuples and lists. Since OTP17, Erlang has supported native maps in the language. So let's modify the routines that use orddict (commit erl-util:6f9c3c13). Here's the new profile:

****** Process <4607.108.0>    -- 100.00 % of profiled time *** 
FUNCTION                        CALLS        %      TIME  [uS / CALLS]
--------                        -----  -------      ----  [----------]
string:strip/2                     73     0.00        72  [      0.99]
string:strip_right/2              511     0.00       235  [      0.46]
lists:nth/2                      7156     0.01      1491  [      0.21]
inet:getaddr/2                 155698     0.36     38228  [      0.25]
inet:getaddr/3                 155698     0.44     46384  [      0.30]
inet:stop_timer/1              155698     0.57     59962  [      0.39]
gen_fsm:handle_msg/7           201749     0.58     61383  [      0.30]
prim_inet:ip4_to_bytes/1       155698     0.64     67445  [      0.43]
inet_udp:send/4                155698     0.66     69141  [      0.44]
inet:getaddr_tm/3              155698     0.69     72662  [      0.47]
inet:start_timer/1             155698     0.72     75633  [      0.49]
erlang:port_get_data/1         155698     0.74     77830  [      0.50]
gen_fsm:decode_msg/9           201749     0.76     80614  [      0.40]
inet_udp:getserv/1             155698     0.94     98777  [      0.63]
prim_inet:sendto/4             155698     0.97    101835  [      0.65]
inet_udp:getaddr/1             155698     1.03    108303  [      0.70]
prim_inet:ip_to_bytes/1        155698     1.04    109417  [      0.70]
gen_fsm:from/1                 201749     1.05    111053  [      0.55]
gen_fsm:cancel_timer/1         149892     1.17    123410  [      0.82]
inet:getaddrs_tm/3             155698     1.19    125094  [      0.80]
inet_db:lookup_socket/1        155698     1.19    125186  [      0.80]
gen_fsm:start_timer/2          149892     1.30    136702  [      0.91]
erlang:port_command/2          155698     1.34    141111  [      0.91]
gen_fsm:loop/7                 201750     2.77    291656  [      1.45]
erlang:cancel_timer/1          149892     3.61    380837  [      2.54]
gen_fsm:reply/2                155735     3.65    384793  [      2.47]
erlang:start_timer/3           149892     3.77    397570  [      2.65]
lists:foldl/3                   93041     5.66    596900  [      6.42]
gen_udp:send/4                 155698     8.87    935129  [      6.01]
gen_fsm:dispatch/4             201749    12.09   1274461  [      6.32]
erts_internal:port_command/3   155698    42.22   4452155  [     28.59]
----------------------------  -------  -------  --------  [----------]
Total:                        4511696  100.00%  10545469  [      2.34]
ok

There are no more entries for map look-ups. This means the internal maps are much faster than the orddict so they don't end up in the statistics gathering.[1] Previously, gen_fsm:dispatch/4 only used 3% of the CPU. Now it's almost 12%. erts_internal:port_command/3 is now up to 42%. This change has definitely made an improvement. The 20% CPU usage we saved is now distributed among the remaining functions (the ACNET task takes 20% less time to do its job, however).


1 You could argue that, since maps are native, they won't have function calls to trace, so you can't really tell how much time it saved. This is partially true; some map manipulation is inline, so its CPU costs are hidden. However, the call to orddict:find/2 was replaced with a call to maps:find/2 and I would expect that function call would be in the stats, if it used a lot of time.