Project

General

Profile

ACNET Large Packets

A group of Control Department members1 have devised a way to exchange messages of any size using the ACNET protocol. Our primary goal was to make the only visible change in the ACNET API, where clients can now specify buffers larger than 64K. The network protocol cannot change. This way, we can migrate different ACNET source code at different rates and the corresponding nodes can still communicate (albeit without large packets.) We believe we've developed a protocol that meets this specification and can transmit large packets efficiently.

This Wiki article will focus on our design but we also have an article on how this impacts clients using the local acnetd command API.

Introduction

ACNET has, historically, been a network transport based on datagrams. Migrating to UDP payloads brought the routing abilities of TCP/IP to ACNET, but still kept the datagram limitations, including payload size. ACNET tasks trying to send packets larger than the datagram size limit used their own protocols. Worse still, our primary data acquisition protocols, RETDAT and GETS32, make no attempt at supporting large packets so applications using our standard DAQ libraries can’t receive large packets. We wish to design a method to remove this restriction with minimal impact to ACNET tasks (ideally, no change would have to be made to current clients.)

Architecture

All ACNET services, if they support large packets, will advertise a task handle of LNGMSG. Clients do not communicate directly with the LNGMSG handles, only the LNGMSG tasks communicate with each other.

When an ACNET client sends data with their request, reply, or USM, the ACNET service determines whether it can be sent via traditional means or if it needs to be handled by the LNGMSG task. The receiving task will simply see the data (large or small) and has no idea whether it was received directly through its handle or pieced together by its local LNGMSG task.

LNGMSG Communications

The LNGMSG tasks communicate with USMs. The tasks should do whatever validation they can on each packet to make sure the communication state remains uncorrupted.

The following table shows the layout used to transmit a segment of a large message. This message is sent in the payload of a USM sent from one LNGMSG task to another. All fields in the LNGMSG protocol header are in network byte order.

Section Offset Data Comment
Long Message
Header
00: 00 0n Typecode (n is 0 or 1)
02: xx xx xfer ID
04: oo oo oo oo offset of current segment
08: ss ss ss ss full size of data
ACNET Header of
embedded payload
0c: ff ff flags
0e: ss ss ACNET status
10: st server trunk
11: sn server node
12: ct client trunk
13: cn client node
14: xx xx xx xx RAD50 server handle
18: ii ii client ID
1a: ii ii request ID
1c: nn nn segment length + 18
One segment of
large message
payload
1e: ?? start of N bytes of data

The large data will be broken into smaller packets and sent in ascending offset order. We will do measurements with various packet lengths to empirically determine the optimal size. Each payload will be prepended with the ACNET header of the original request with the slight modification that the message length field will get replaced with the length of the data in the current fragment added to the length of the ACNET header. The type code field will normally be 0, in which case the packet is the next segment of data. If the type code field is 1, the content is the same as in type code 0, but the sender also wants a RESUME message (shown in the following table) from the receiver to determine how to proceed with the transfer.

Offset Data Comment
00: 00 02 RESUME typecode
02: xx xx xfer ID
04: oo oo oo oo offset of current segment
08: ss ss ss ss full size of data

The xferId field is a value generated by the sending node. The sending node’s address and the xferId make a unique pair.

The offset and size fields are 32 bits. The size field will always contain the size of the full reply (once it has been pieced back together) and doesn’t include any of the ACNET header data. The offset will start at zero and, in each subsequent packet, will equal the previous offset added to the size of the data in the previous packet. The sender will transmit the packets as fast as it can and, after sending a group of segments, will ask the receiver where to proceed.

The receiving LNGMSG task monitors incoming USMs and does the following:

  • If the offset is zero, then a new reply is arriving. The receiver can use the size field to pre-allocate a buffer to hold the rest of the incoming data. After saving the data in the buffer, it sets the next expected offset to be equal to the size of data that was just received.
  • If the offset is non-zero, it checks to see if the offset and transfer ID matches a reply that is in progress. If a match is found, the data is appended to the buffer and the next expected offset is updated.
  • After appending the data, if the packet also asked for a response (typecode 1 in the long message header), the task will send a RESUME message (Figure 2) with the current expected offset.
  • If the offset is non-zero and a reply to a transfer ID is in progress but the offset is too high (i.e. a packet was dropped), the task ignores all type code 0 packets until the next type code 1 packet. When it arrives, a RESUME message is sent to the sender with the offset of the missing data.
  • When the transfer is complete, the last packet will also require a response. The receiver returns the expected offset (which at this point will be the size of the data) or a previous offset, if a packet was dropped.

Of course, the LNGMSG task needs to correctly free resources after the packet is sent or if the request gets canceled during the transfer.

Due to historic design choices, ACNET constrains packet lengths to even-sizes. We presume that large packets may be used to transfer binary data generated by third party libraries, so they won’t follow this convention. The size field should show the actual size, odd or not, and the receiver can drop the last byte of the last packet, if it exceeds the size of the data.

Alternate Algorithm

Our primary algorithm for large data transfers is a “sliding window”, where the sender hangs on to the data sent until the receiver confirms its delivery.

These protocol messages, it turns out, can also be used to implement a “bit-map” algorithm. In this algorithm, the sender sends every segment of the large packet and only asks for an acknowledgement after the last packet has been sent. The receiver builds its copy of the large packet with the segments it receives and keeps track of the holes. When the last packet arrives (with the ACK request), the receiver can go back and specify the earliest hole. With each filler sent, the sender always asks for a reply. The receiver replies with each hole’s offset until they are all filled.

This protocol doesn’t allow either participant to enforce an algorithm for an exchange. Mis-matched algorithms will still successfully transfer the data — just at a slightly less efficiency.

Example Exchanges

A couple of examples may help illustrate how the protocol works. In the following sections, the notation is

ACNET_TYPE(ACNET header fields){payload}

where ACNET_TYPE is the type of ACNET message (USM, REQ, or REP) and is immediately followed by a subset of header fields (not all fields are interesting for these examples.)

In this first example, a 100 Kbyte USM is sent successfully to the server. The sending node asks for a resume message in the third packet. Note the example follows the recommendation to use type code 1 in the first packet.

USM (from:LNGMSG@N1, to: LNGMSG@N2)

{tc:1, xferId:1, offset:0, total:100K, USM (from:CLI@N1, to:SER@N2){25KB of data}}
USM (from:LNGMSG@N2, to:LNGMSG@N1)

{tc:2, xferId:1, offset: 25K, total:100K}
USM (from:LNGMSG@N1, to:LNGMSG@N2)

{tc:0, xferId:1, offset:25K, total:100K, USM (from:CLI@N1, to:SER@N2){25KB of data}}
USM (from:LNGMSG@N1, to:LNGMSG@N2)

{tc:1, xferId:1, offset:50K, total:100K, USM (from:CLI@N1, to:SER@N2){25KB of data}}
USM (from:LNGMSG@N2, to:LNGMSG@N1)

{tc:2, xferId:1, offset: 75K, total:100K}
USM (from:LNGMSG@N1, to:LNGMSG@N2)

{tc:1, xferId:1, offset:75K, total:100K, USM (from:CLI@N1, to:SER@N2){25KB of data}}
USM (from:LNGMSG@N2, to:LNGMSG@N1)

{tc:2, xferId:1, offset: 100K, total:100K}

This next example is a request containing a 100K payload and where the second packet was dropped. The RESUME message recovers the data by requesting the sender restart at the missing offset.

USM (from:LNGMSG@N1, to: LNGMSG@N2)
{tc:1, xferId:2, offset:0, total:100K, REQ (from:CLI@N1, to:SER@N2, msgId:1){25KB of data}}
USM (from:LNGMSG@N2, to:LNGMSG@N1)

{tc:2, xferId:1, offset: 25K, total:100K}
USM (from:LNGMSG@N1, to:LNGMSG@N2)

{tc:0, xferId:2, offset:25K, total:100K, REQ (from:CLI@N1, to:SER@N2, msgId:1){25KB of data}}
(receiver doesn’t get the packet)
USM (from:LNGMSG@N1, to:LNGMSG@N2)

{tc:1, xferId:2, offset:50K, total:100K, REQ (from:CLI@N1, to:SER@N2, msgId:1){25KB of data}}
USM (from:LNGMSG@N2, to:LNGMSG@N1)

{tc:2, xferId:2, offset: 25K, total:100K}
USM (from:LNGMSG@N1, to:LNGMSG@N2)

{tc:0, xferId:2, offset:25K, total:100K, REQ (from:CLI@N1, to:SER@N2, msgId:1){25KB of data}}
USM (from:LNGMSG@N1, to:LNGMSG@N2)

{tc:0, xferId:2, offset:50K, total:100K, REQ (from:CLI@N1, to:SER@N2, msgId:1){25KB of data}}
USM (from:LNGMSG@N1, to:LNGMSG@N2)

{tc:1, xferId:2, offset:75K, total:100K, REQ (from:CLI@N1, to:SER@N2, msgId:1){25KB of data}}
USM (from:LNGMSG@N2, to:LNGMSG@N1)

{tc:2, xferId:2, offset: 100K, total:100K}

Recommendations

  • The first segment should use typecode 1, asking the receiver for a resume message. By doing this, part of the payload gets sent in addition to checking whether the receiver supports large messages (a timeout indicates no support.)
  • The last packet of the message must use typecode 1 to make sure the entire message was received.
  • The sender may vary the interval between ACK requests to adapt to network conditions. For instance, the sender might begin the transfer with an interval of 4 packets before asking for an ACK. If there isn’t an error, then 8 packets can be sent before the next ACK. If an error occurred, the sender reduces the interval of ACKs.

Error Conditions

This section tries to address some of the error conditions that can arise and how they should be handled.

A reply packet gets dropped.

The next packet will have an offset higher than expected. The receiver stops collecting packets and waits for the next request for a resume message. When asked, the receiver will reply with the next expected offset. The transmission will continue from that offset. If the first packet was dropped, then the offset will be 0.

The last packet is dropped.

The sender will set a short time-out waiting for the end-response. If it doesn’t receive it, it resends the last packet.

The sender doesn’t resume at the requested offset.

After sending a resume message, the transfer should eventually restart at the requested offset. If it doesn’t in a reasonable time (what is reasonable?), it resends the resume message.

The ACNET requestor doesn’t support large messages.

The first packet should use type code 1. If the receiver doesn’t reply, it doesn’t support large packets.

What if the sender stops unexpectedly? If the transfer never completes, the receiver will have orphaned resources.

We should give the sender a generous timeout in which to keep the transfer progressing (5 seconds? 10 seconds?) If this timeout is reached, we free up resources associated with the transfer. If the sender eventually continues the transfer, the receiver will make it start from the beginning.

Unresolved Details

How does this work with multicasts?

At first glance, it seems that multicasted requests and USMs should be prevented from participating in large messages. But thinking more about it, multicasting large data could be a killer feature of ACNET. Multicast USMs would probably be better handled by dropping the USM if any packet is dropped. For multicast requests, any of the receivers could ask for a retransmission. All others would see offsets less than their expected offset and would have to patiently wait for the stream to catch back up. Or the sender can send the repeats to the few nodes that needed it. This needs much more discussion.


1 Charlie Briegel, Brian Hendricks, Charlie King, Rich Neswold, Jim Patrick, Mike Sliczniak