Project

General

Profile

Bug #14346

erlang front-end lookup request fails after full acnet restart (seen on clx42)

Added by Dennis Nicklaus almost 3 years ago. Updated 11 months ago.

Status:
Feedback
Priority:
Normal
Category:
Erlang client for LOOKUP service
Start date:
11/01/2016
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Sometimes, when all a console's acnet infrastructure is restarted, the lookup:get_epics_map(Nodename) fails when it should return a list of #lookup_epicsentry_struct@s. "Fails" here means it returns an empty list, I'm pretty sure. Most recently seen on @CLX42 when restarted 11/1/16 at 1:14. Jim S. says CLX42 was fully restarted because there was a new version of acnetd. I have seen this before in similar circumstances. It seems like there is some sort of race condition, that something in the environment is not yet fully initialized (maybe acnetd doesn't have its node tables yet?). The lookup request would have been issued at 1:14:19 on 11/1.

History

#1 Updated by Dennis Nicklaus almost 3 years ago

From clx42: /var/log/acnet

Nov  1 01:14:13 clx42 acnetd[7008]: received a termination signal
Nov  1 01:14:13 clx42 acnetd[7008]: process was asked to terminate
Nov  1 01:14:15 clx42 updown: error receiving command ack (command = 1) -- Connection refused
Nov  1 01:14:16 clx42 acnetd[19466]: ACNET (Linux-x86_64) services are now active on host 131.225.120.42:6801
Nov  1 01:14:16 clx42 nodesd: reading ACNET node table from database
Nov  1 01:14:17 clx42 nodesd: failed DNS lookup on m40tor.fnal.gov
Nov  1 01:14:17 clx42 nodesd: failed DNS lookup on midcct.fnal.gov
Nov  1 01:14:18 clx42 nodesd: successfully read 1144 ACNET nodes
Nov  1 01:14:18 clx42 dpm: connected to acnetd('DPM   ' @ '      ') port = 55887
Nov  1 01:14:18 clx42 dpm: The dpm service executable NEVER uses the new DPM
Nov  1 01:14:18 clx42 dpm: dpm_dpminit
Nov  1 01:14:18 clx42 pld: connected to acnetd('PLD   ' @ '      ') port = 46185
Nov  1 01:14:19 clx42 dpm: connected to acnetd('STATES' @ '      ') port = 55887
Nov  1 01:14:20 clx42 PK0001_21: connected to acnetd('PK0001' @ '      ') port = 59361

#2 Updated by Dennis Nicklaus almost 3 years ago

When I said the lookup req was at 1:14:19, actually, a message gets logged at 1:14:19 which was logged just after the lookup request was sent. The last log message before the lookup req still had the 1:14:18 timestamp. Erlang front-end was started at 1:14:18.

#3 Updated by Dennis Nicklaus almost 3 years ago

I modified the CA device driver to retry if it gets an empty result set for epics names.
commit:dev-ca|ebdd390d

#4 Updated by Richard Neswold almost 2 years ago

Did this fix the problem?

#5 Updated by Richard Neswold about 1 year ago

  • Category set to ACSys/FE Framework

Set category field.

#6 Updated by Richard Neswold 11 months ago

  • Description updated (diff)
  • Assignee deleted (Dennis Nicklaus)
  • Target version set to ACSys/FE v1.7

maybe acnetd doesn't have its node tables yet?

acnetd doesn't daemonize until it gets its node table loaded. This means any scripts that start acnetd won't proceed until acnetd is fully ready to go. Now, if a bad node table was downloaded, all bets are off...

Dennis, is this still a problem? Or can we close out this issue?

#7 Updated by Richard Neswold 11 months ago

The main problem is lookup:get_epics_map/1 doesn't do a good job at reporting errors; a caller can't tell if an empty list means there aren't any EPICs devices associated with the node or that an error occurred preventing. Dennis, how many front-ends use this service? Should this issue get transferred to the erl-lookup project? Should we change the function's return to {ok,_} / {error,_} values? Or throw an exception?

#8 Updated by Richard Neswold 11 months ago

  • Project changed from Erlang Front-end Framework to LOOKUP
  • Category changed from ACSys/FE Framework to Erlang client for LOOKUP service
  • Status changed from New to Feedback
  • Assignee set to Richard Neswold
  • Target version deleted (ACSys/FE v1.7)

Moved this to the LOOKUP service project because this error would have been diagnosed earlier if the lookup library reported a better error.

Modification: erl-client|00d9248e

This modification reports all errors to the log, although it still returns an empty list. I did this so clients wouldn't have to be changed but they'd still get the error messages. I think a longer-term fix would be to return an error (or throw an exception) in the client API.



Also available in: Atom PDF