mblm40 exhausts heap memory
On 3/27, the MCR called and said that MI BLMs 10, 40, and 60N were not responding.
Due to the console being hung, I couldn't really do much investigation, but by loggin into mblm40's serial port/crate controller I could see a steady stream of these messages:
Msgrecv: REQ Alloc error 0x0
0xea35cf0 (tIpRcv): memPartAlloc: block too big - 68 in partition 0xea0dcd0.
On 10, even though D31 showed it hung and most device readbacks (e.g. parameter page) got no response, we were able to FTP a device (I:LI110).
And, over the course of the many minutes we were watching before rebooting, 10 and 60N did start and then stop again responding to D31 node polls at least once.
#1 Updated by Richard Neswold over 2 years ago
Since it was task
tIpRcv, the memory was exhausted in the ACNET memory partition. So there was a resource problem in the ACNET connections. I don't know what services besides MOOC are in those front-ends but, if there are any nonstandard services, they could be the cause of the leaks (the VxWorks ACNET library requires the clients to properly lock and delete buffers after they're used.)
I would think if there were excessive requests, their resources would get automatically cleaned when the requesters go away. Do these front-ends communicate with other front-ends? For instance, if a different front-end tries to read devices from these front-ends and it gets rebooted, the request won't get cleaned and would possibly leak the resources.
#5 Updated by Dennis Nicklaus over 1 year ago
Just an update to my comment above (No. 2 -- chan range checking). I remember talking with Charlie about this shortly after making the observation, and he said he was going through the code and correcting it, although he didn't really believe that it was the cause of any of this bad behaviour. I don't know whether any changes he made for that got committed.