Project

General

Profile

Bug #16222

modbus front-ends show over 1s delay in reading response since 2/28/17, possibly correlated with bigsave

Added by Dennis Nicklaus over 3 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
-
Start date:
04/17/2017
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Two main modbus front-ends, clx41 and clx57, now show occasional instances of very long read delays, like this

{rd_bad_perf,<0.16571.20>}     2017-02-28
15:42:48     2017-02-28
15:43:52     3     reply sent 1007 mS after event time for request {reading_request, [{device_request,282093,12,...

That's the first one I find in my email history of logs. It's generally only 1 or two replies delays, most often 1006 ms, always with a lengthy list of devices.

On clx57, I didn't add the new connection to cmts1ssa8 until 3/9. But it was restarted several times 3/1.

One guess is that this may be correlated with one or more connections not being active. For instance, at CMTF, the cryomodule was removed the end of Feb., and all the SSAs were turned off. Maybe some of the connections at clx41 were off at these same times? We should handle this better and give back an immediate error status if the remote PLC isn't connected, but maybe that isn't working properly.

The first one of such a large delay I see on clx41 is on 3/8/17.

{rd_bad_perf,<0.9738.3>}     2017-03-08
16:49:39     2017-03-08
16:49:39     1     reply sent 540 mS after event time for request {reading_request, [{device_request,284566,4, 

Clx41 is more inconsistent, but he most common times still recur: 12:36, 18:36, 22:36.


h2.
Below is a table of clx57 log delays
{rd_bad_perf,<0.11310.0>}     2017-03-02
06:36:14     2017-03-02
06:36:14     1     reply sent 820 mS after event time for request {reading_request, [{device_request,277585,12, <<33,4,0,0,0,0,48,0>>, 2,0,0,false}, 
{rd_bad_perf,<0.26481.0>}     2017-03-03
06:36:20     2017-03-03
06:36:20     1     reply sent 792 mS after event time for request {reading_request, [{device_request,277585,12, <<33,4,0,0,0,0,48,0>>, 2,0,0,false}, 
{rd_bad_perf,<0.3859.1>}     2017-03-03
22:35:57     2017-03-03
22:35:57     1     reply sent 1004 mS after event time for request {reading_request, [{device_request,282093,12, <<25,6,0,0,0,0,2,0>>, 4,0,0,false}, 
{rd_bad_perf,<0.16499.1>}     2017-03-04
18:36:06     2017-03-04
18:36:06     1     reply sent 682 mS after event time for request {reading_request, [{device_request,282093,12, <<25,6,0,0,0,0,2,0>>, 4,0,0,false}, {
{rd_bad_perf,<0.27889.1>}     2017-03-05
12:36:10     2017-03-05
12:36:10     1     reply sent 660 mS after event time for request {reading_request, [{device_request,282093,12, <<25,6,0,0,0,0,2,0>>, 4,0,0,false}, 
{rd_bad_perf,<0.13008.2>}     2017-03-06
16:34:05     2017-03-06
16:34:37     2     reply sent 1008 mS after event time for request {reading_request, [{device_request,282093,12, <<25,6,0,0,0,0,2,0>>, 4,0,0,false}, 
[others in the next couple seconds]
{rd_bad_perf,<0.2347.0>}     2017-03-07
12:36:07     2017-03-07
12:36:07     1     reply sent 1006 mS after event time for request {reading_request, [{device_request,282093,12, <<25,6,0,0,0,0,2,0>>, 4,0,0,false}, 
{rd_bad_perf,<0.23832.0>}     2017-03-08
22:36:18     2017-03-08
22:36:18     1     reply sent 1006 mS after event time for request {reading_request, [{device_request,282093,12, <<25,6,0,0,0,0,2,0>>, 4,0,0,false}, 
...
{rd_bad_perf,<0.19735.1>}     2017-04-15
22:36:06     2017-04-15
22:36:06     1     reply sent 1006 mS after event time for request {reading_request, [{device_request,282093,12, <<25,6,0,0,0,0,2,0>>, 4,0,0,false}, 
{rd_bad_perf,<0.28155.1>}     2017-04-16
12:36:15     2017-04-16
12:36:15     1     reply sent 1007 mS after event time for request {reading_request, [{device_request,282093,12, <<25,6,0,0,0,0,2,0>>, 4,0,0,false}, 
{rd_bad_perf,<0.1415.2>}     2017-04-16
22:36:09     2017-04-16
22:36:10     2     reply sent 1006 mS after event time for request {reading_request, [{device_request,282093,12, <<25,6,0,0,0,0,2,0>>, 4,0,0,false}, 


h2.
CLX41
{rd_bad_perf,<0.6494.1>}     2017-04-08
12:36:13     2017-04-08
12:36:13     1     reply sent 455 mS after event time for request {reading_request, [{device_request,284594,12, <<57,3,0,0,0,0,3,3>>, 
{rd_bad_perf,<0.9852.1>}     2017-04-08
18:36:15     2017-04-08
18:36:15     1     reply sent 1006 mS after event time for request {reading_request, [{device_request,284594,12, 
{rd_bad_perf,<0.12103.1>}     2017-04-08
22:36:17     2017-04-08
22:36:17     1     reply sent 810 mS after event time for request {reading_request, [{device_request,284594,12, <<57,3,0,0,0,0,3,3>>, 
{rd_bad_perf,<0.6606.2>}     2017-04-10
22:36:22     2017-04-10
22:36:22     1     reply sent 282 mS after event time for request {reading_request, [{device_request,284594,12, <<57,3,0,0,0,0,3,3>>,
{rd_bad_perf,<0.31608.0>}     2017-04-14
18:36:15     2017-04-14
18:36:15     1     reply sent 717 mS after event time for request {reading_request, [{device_request,284566,4, <<57,1,0,0,0,0,1,1>>,


Related issues

Related to Erlang Front-end Framework - Bug #14805: erl-modbus device occasionally see long delaysAssigned12/12/2016

Related to Erlang Front-end Framework - Support #16183: cmts1ssa8 has many timeouts -> 57 -84 errrorsClosed04/12/2017

History

#1 Updated by Richard Neswold over 3 years ago

  • Related to Bug #14805: erl-modbus device occasionally see long delays added

#2 Updated by Dennis Nicklaus over 3 years ago

  • Status changed from New to Assigned
A few thoughts:
Zeroing in on the ~1s response time, my attention is drawn to two places.
  1. In the modbus code, the only place we really use 1s is the interval for sampling of the PLC segments. And that should be quite asynchronous to the acnet-interfacing modbus_driver.erl code and not really affect it. The actual TCP read timeouts are 200ms. The timeout for making a new connection is 2s. We don't specify a gen_tcp:send() timeout, so by default it is infinity. All those should be independent of the _driver anyway also.
  2. The other place 1s appears that I think about is in attr_reader:processReadAttributes as the overall timeout limit on a whole read list.

Potential action items:

  • I believe the attr_reader limit is sort of arbitrary (although mostly appropriate). Perhaps changing it to something like 1100ms would be a good idea just to make it stick out a bit more. Would having it greater than our commonly used 1Hz read rate cause other bad problems? Maybe make it 900?
  • Next, we have to remember (I think) that this happens during a big save, which apparently taxes the frontend more or in ways that we do not really understand. clx57 doesn't really have that many devices (~400-ish?) and while they are pretty much all modbus, the problem isn't necessarily with modbus.
  • If it is the attr_reader limit, we should see that code's error (57 -107) in the errors log. So I looked for it, and there they are, matching up with our email alerts. http://www-bd.fnal.gov/errors/servlet/ErrorHistory?type=day&date=sun&class=mooc&error=-27335

Here is a very small sample:

...
T:7RSDC2  Reading        57 -107  Modbus R&K SSA   CLX57E   c00s00m00  GETS1           DCE21    12:36  GETSR           DUE33    12:36           2 
T:7RSDCV  Reading        57 -107  Modbus R&K SSA   CLX57E   c00s00m00  GETS1           DCE21    12:36  GETSR           DUE33    12:36           2 
T:7RSMCS  Status         57 -107  Modbus R&K SSA   CLX57E   c00s00m00  GETS1           DCE21    12:36  BigSave         DCE29    12:38           3 
T:7RSPID  Reading        57 -107  Modbus R&K SSA   CLX57E   c00s00m00  GETS1           DCE21    12:36  GETSR           DUE33    12:36           2
...

Not necessarily in response to the BigSave's request, but certainly simultaneous with it.
The attr_reader-driver reading response collection (mentioned above) is the only place that 57 -107 error is generated.

So I guess the real question is, "Why does a big save stress out the modbus drivers so much?" Maybe we need to review and/or profile that code to find where code that we think is doing nothing turns out to be pretty slow.

#3 Updated by Richard Neswold over 3 years ago

Dennis Nicklaus wrote:

I believe the attr_reader limit is sort of arbitrary (although mostly appropriate). Perhaps changing it to something like 1100ms would be a good idea just to make it stick out a bit more. Would having it greater than our commonly used 1Hz read rate cause other bad problems? Maybe make it 900?

Interesting: setting the timeout to something unusual and recognizable. I think 900ms would be better, even if it results in more timeout errors.

Next, we have to remember (I think) that this happens during a big save, which apparently taxes the frontend more or in ways that we do not really understand. clx57 doesn't really have that many devices (~400-ish?) and while they are pretty much all modbus, the problem isn't necessarily with modbus.

Does reading all of these devices result in ~400 MODBUS requests? The framework, obviously, has no support to merge MODBUS registers into a bank (assuming you can read a bank of registers -- I'm not too familiar with MODBUS.) Maybe make an array device that spans the other devices and mark the single devices as not BigSave-able?

So I guess the real question is, "Why does a big save stress out the modbus drivers so much?" Maybe we need to review and/or profile that code to find where code that we think is doing nothing turns out to be pretty slow.

Good idea. Big Saves should simply read all the devices once. We probably need to write a simple DPM client to read all the devices on these front-ends constantly.

#4 Updated by Dennis Nicklaus over 3 years ago

Richard Neswold wrote:

Does reading all of these devices result in ~400 MODBUS requests?

No. The TCP request goes out asynchronously and is collected by a separate process. Each sampling does collect a segment of upto 125 registers all at once. Acnet reads out of data-pooled data results of these responses already pooled in the main driver process.

Good idea. Big Saves should simply read all the devices once. We probably need to write a simple DPM client to read all the devices on these front-ends constantly.

Agreed. That's a good way to make our own "big save".
I did look and for a slow (735ms) reply on clx41 on 4/17, there were 300 devices in the delayed response 1Hz request.
On clx57 on 4/16, a response which took 1007ms at 12:36 had 351 devices in the list. That same list only took 72ms (which still generates a "delayed response" warning alarm) at 18:36. And it happens thousands of times a day with no delayed response. So in the 1007 case, we don't think the whole list takes 1007. Maybe it only gets called after 1000 and then it is just 7ms of quickly generating 351 timeouts, or somewhere in between. The 351 isn't too surprising since I know of at least 320 devices being datalogged at 1Hz. SQL tells me there are 729 devices total on the node (most with just a single property)

#5 Updated by Dennis Nicklaus over 3 years ago

OK. New theory. When making my last post I noticed that clx57 didn't have any big delayed response warnings after 15:00 on 4/17 or this (4/18) morning. And when starting to experiment with reading all the devices I also noticed that they've turned on all 8 CMTF SSAs. One of my original conjectures was that having the remote "PLC" off might be a part of the problem, and this lonely one datapoint supports that. I will check emails again tomorrow.

Looking at the _driver code, one thing that is synchronous with the TCP-managing process is reading function for the "status" attribute which calls something which results in a gen_server:call into the TCP server. When the remote PLC end is disconnected, the TCP process (modbus.erl) will potentially spend 2s (connection timeout) of every 30 (retry timeout) just blocked waiting on the TCP connection. genserver:call() default timeout is 5s.

In the instance of the CMTS SSAs, there are Acnet devices for the "status" attribute (T:nRSMCS, 1<=n<=8), but they are not data logged, so they will generally only get read on a bigsave, or someone else intentionally reading them.

So if I blame the delays on the connection device and then the attr_reader max list timeout, how is it that the delay maxes out at 1s+epsilon instead of somewhere up to 2s (the connection timeout)?

I also get notification of the connections having been remade in our alarms email:

{"cmts1ssa6",modbus_connected}     2017-04-17
15:00:20     2017-04-17
15:00:20     1     Modbus Connection (Re-)Established {"cmts1ssa6",502}
{"cmts1ssa7",modbus_connected}     2017-04-17
15:00:22     2017-04-17
15:00:22     1     Modbus Connection (Re-)Established {"cmts1ssa7",502}
{"cmts1ssa1",modbus_connected}     2017-04-17
15:00:43     2017-04-17
15:00:43     1     Modbus Connection (Re-)Established {"cmts1ssa1",502}
{"cmts1ssa2",modbus_connected}     2017-04-17
15:00:44     2017-04-17
15:00:44     1     Modbus Connection (Re-)Established {"cmts1ssa2",502}
{"cmts1ssa3",modbus_connected}     2017-04-17
15:00:46     2017-04-17
15:00:46     1     Modbus Connection (Re-)Established {"cmts1ssa3",502}
{"cmts1ssa4",modbus_connected}     2017-04-17
15:00:48     2017-04-17
15:00:48     1     Modbus Connection (Re-)Established {"cmts1ssa4",502}
{"cmts1ssa5",modbus_connected}     2017-04-17
15:00:50     2017-04-17
15:00:50     1     Modbus Connection (Re-)Established {"cmts1ssa5",502}

#6 Updated by Richard Neswold over 3 years ago

Dennis Nicklaus wrote:

So if I blame the delays on the connection device and then the attr_reader max list timeout, how is it that the delay maxes out at 1s+epsilon instead of somewhere up to 2s (the connection timeout)?

This line sets the upper limit of waiting for all the readings:

source:attr_reader.erl@c559fb7c#L100

It appears that upper bound is 1000ms.

#7 Updated by Dennis Nicklaus over 3 years ago

Richard Neswold wrote:

Dennis Nicklaus wrote:

So if I blame the delays on the connection device and then the attr_reader max list timeout, how is it that the delay maxes out at 1s+epsilon instead of somewhere up to 2s (the connection timeout)?

This line sets the upper limit of waiting for all the readings:

source:attr_reader.erl@c559fb7c#L100

It appears that upper bound is 1000ms.

Yeah, what I now realize that I was forgetting yesterday when I wrote that was an understanding of which process generates which timeout.
That 1000ms timeout is not in the context of the driver process at all, it is just waiting on responses/messages from the driver process. So the driver process can be hung forever and that 1000ms timeout will still go into effect.

#8 Updated by Dennis Nicklaus over 3 years ago

Dennis Nicklaus wrote:

and this lonely one datapoint supports that. I will check emails again tomorrow.

confirmed by the overnight alarm emails.

#9 Updated by Dennis Nicklaus over 3 years ago

  • Status changed from Assigned to Work in progress
Now testing on clx57
Relevant code changes:
  • dev-modbus|3dd9f9a1 (change modbus code so we aren't doing a gen_server:call() for connection status.
  • 1a034100 (change reading list timeout to 900)

#10 Updated by Richard Neswold over 3 years ago

  • Related to Support #16183: cmts1ssa8 has many timeouts -> 57 -84 errrors added

#11 Updated by Dennis Nicklaus over 3 years ago

  • Status changed from Work in progress to Closed

I proved that the above commits solve the problem by turning off SSAs 1-5 overnight so that they would go into disconnected (and retrying to connect) status. We got no really long rd_bad_perf alarms from all the big save times in the last 24 hours (on either clx57 or 41).



Also available in: Atom PDF