Project

General

Profile

Bug #11275

Milestone #10477: ANUB Startup

anub is randomly crashing once every few days

Added by John Diamond almost 5 years ago. Updated almost 5 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Category:
-
Start date:
12/21/2015
Due date:
% Done:

100%

Estimated time:
Spent time:
Duration:

Description

I found anub had crashed sometime between Friday night and Sunday morning this weekend. The same thing happened earlier in the week and could possibly be a memory leak (received an FTP_OUT_OF_MEM error when I tried to plot).

History

#1 Updated by John Diamond almost 5 years ago

Noticed that both the cycle device task and beam energy loss task crashed on a data access exception.

#2 Updated by John Diamond almost 5 years ago

Changed the priority of the Cycle Device task from 5 to 65 to match the BEL task.

#3 Updated by John Diamond almost 5 years ago

Suspect that there might be a race condition in ::attach / ::detach so I added a mutex for the EvtTrigTclkEventGenerator listener table.

#4 Updated by John Diamond almost 5 years ago

  • Status changed from New to Assigned

Problems persist. Despite adding a mutex to the EvtTrigTclkEventGenerator listener table and running the Cycle Device & BEL tasks at the same priority, we still get a data access exception in the BEL Task after a couple of hours of operation. Suspect that we could have an errant iterator in the BEL average calculation code. Disabled that method for now and we'll see if the problem persists.

#5 Updated by John Diamond almost 5 years ago

Implemented CD's TRACE facility for logging error messages, see #11021.

#7 Updated by John Diamond almost 5 years ago

anub has been stable for over 36 hours now, assuming that the source of the data exception was in BELMachine::_updateAvgs().
Used the method linked aboved to de-queue the BEL samples > 1 hour old in ::_updateAvgs(). Re-compiled and deployed to anub. Will wait another 36 hours to see if we solved the problem.

#8 Updated by John Diamond almost 5 years ago

  • Status changed from Assigned to Closed
  • % Done changed from 0 to 100

Stable now for 2 days, closing.



Also available in: Atom PDF