Project

General

Profile

Bug #15978

mblm40 exhausts heap memory

Added by Dennis Nicklaus over 2 years ago. Updated over 1 year ago.

Status:
Accepted
Priority:
Normal
Assignee:
Start date:
03/27/2017
Due date:
% Done:

0%

Estimated time:
Duration:

Description

On 3/27, the MCR called and said that MI BLMs 10, 40, and 60N were not responding.
Due to the console being hung, I couldn't really do much investigation, but by loggin into mblm40's serial port/crate controller I could see a steady stream of these messages:

Msgrecv: REQ Alloc error 0x0
0xea35cf0 (tIpRcv): memPartAlloc: block too big - 68 in partition 0xea0dcd0.

On 10, even though D31 showed it hung and most device readbacks (e.g. parameter page) got no response, we were able to FTP a device (I:LI110).

And, over the course of the many minutes we were watching before rebooting, 10 and 60N did start and then stop again responding to D31 node polls at least once.

History

#1 Updated by Richard Neswold over 2 years ago

Since it was task tIpRcv, the memory was exhausted in the ACNET memory partition. So there was a resource problem in the ACNET connections. I don't know what services besides MOOC are in those front-ends but, if there are any nonstandard services, they could be the cause of the leaks (the VxWorks ACNET library requires the clients to properly lock and delete buffers after they're used.)

I would think if there were excessive requests, their resources would get automatically cleaned when the requesters go away. Do these front-ends communicate with other front-ends? For instance, if a different front-end tries to read devices from these front-ends and it gets rebooted, the request won't get cleaned and would possibly leak the resources.

#2 Updated by Dennis Nicklaus over 2 years ago

I looked at the Setting method of the HV card. There is no error bounds checking on the value of "chan". A bad SSDN somewhere could potentially be causing problems.
(I mentioned this a week ago in an email and thought I would add it here to track it.)

#3 Updated by Richard Neswold over 1 year ago

Dennis, you should reassign this to Jimmy. I'd do it, but it won't let me since I'm not the creator of the issue.

#4 Updated by Dennis Nicklaus over 1 year ago

  • Assignee changed from Charles Briegel to Jianming You

#5 Updated by Dennis Nicklaus over 1 year ago

Just an update to my comment above (No. 2 -- chan range checking). I remember talking with Charlie about this shortly after making the observation, and he said he was going through the code and correcting it, although he didn't really believe that it was the cause of any of this bad behaviour. I don't know whether any changes he made for that got committed.

#6 Updated by Jianming You over 1 year ago

  • Status changed from New to Accepted


Also available in: Atom PDF