Project

General

Profile

Bug #14368

acnetd is frequently reporting QUEUE_FULL on idle clients

Added by Richard Neswold almost 3 years ago. Updated almost 3 years ago.

Status:
Closed
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
Start date:
11/02/2016
Due date:
% Done:

0%

Estimated time:
Duration:

Description

Charlie King and I noticed (in ACNET logs) that we're occasionally getting ACNET_QUEUEFUL status when communicating with new DPM. Doing a tcpdump on the DPM node shows a low to moderate rate of ACNET requests being sent to DPM. DPM itself has low CPU usage. Yet, when acnetd generates a report, we see DPM has close to 256 pending requests! Something is not adding up.


Related issues

Related to DPM - Bug #13703: Inconsistent behavior with D105Assigned08/29/2016

History

#1 Updated by Richard Neswold almost 3 years ago

  • Status changed from Assigned to Feedback
  • Assignee deleted (Charles King)

When acnetd forwards a request to a client, it expects the client to acknowledge it. acnetd keeps track of how many requests haven't been acknowledged as a way of measuring how busy the client is. If acnetd measures 256 un-ACKed requests, it drops the current request and returns a QUEUE FULL status. Charlie King and I performed some diagnostics and discovered acnetd doesn't handle a race-condition correctly.

The way it's supposed to work is:

  1. acnetd receives a request.
  2. If the client's "pending request" count is less than 256, it bumps the counter and sends the request data to the client.
  3. The client receives the request and sends an acknowledgement to acnetd
  4. acnetd decrements the counter.

There is a race condition which exposed a bug:

  1. acnetd receives a request.
  2. If the client's "pending request" count is less than 256, it bumps the counter and sends the request data to the client.
  3. Before the client receives the request and sends an ACK, a CANCEL arrives for the request.
  4. acnetd frees up resources for the request and sends the client a CANCEL packet.
  5. The client sends the ACK, but acnetd sends it a REPLY ID not found error.

The problem is that, in step 4, the count of pending requests doesn't get decremented. This means that a service that receives requests that are followed quickly by a cancel have a chance of their counter getting out of sync. New DPM, and its service discovery protocol, experiences a lot of this type of traffic. The pending request counter on the DPMD handle slowly accumulates. Within 24 hours, it's near the limit of 256 and acnetd starts throttling request traffic to it, resulting in QUEUE FULL being sent to applications.

We've been running the new, fixed version on firus-gate along with new DPM. It's handling hundreds of DRF2 requests and its pending count stays at zero. We will release this new version in the morning.

#2 Updated by Richard Neswold almost 3 years ago

  • Related to Bug #13703: Inconsistent behavior with D105 added

#3 Updated by Richard Neswold almost 3 years ago

  • Status changed from Feedback to Closed

Charlie committed the changes and released a new version of acnetd. Jim Smedinghoff will restart the DPM nodes to use the new, fixed version.



Also available in: Atom PDF