Project

General

Profile

Bug #24879

Network mask for HWR and SSR1 (other?)

Added by Brian Chase about 2 months ago.

Status:
New
Priority:
Immediate
Assignee:
-
Start date:
08/28/2020
Due date:
% Done:

0%

Estimated time:
Duration:

Description

It has been determined that we have the wrong network mask and we have been screwing up the network ourselves. This needs to be fixed immediately.

I put back in the network level mitigations. (two "mac address-table static b8ca.3ab7.63c1 vlan 142 interface xx" lines in s-ad-c-6506-nml & s-ad-c-2960x-cmtf200.)

It looks like that has stopped the flooding that I can see on the cacti
plots I have up.

No need to stop the studies for now.

Thanks,

- Tim

On Wed, 26 Aug 2020, Brian E Chase wrote:

Tim,

I don”Ēt know for sure. I am trying to contact someone at CMTS.

Brian

On Aug 26, 2020, at 4:48 PM, Tim Zingelman <> wrote:

Which console is he working from? We can put in a network level mitigation if we know that. It does not appear to be cns75pc.

Thanks,

- Tim

On Wed, 26 Aug 2020, Brian E Chase wrote:

Tim,

Good detective work. Philip is doing studies right now. Should I ask him to stop until we get this resolved.

Brian

On Aug 26, 2020, at 3:37 PM, Tim Zingelman <> wrote:

We are now seeing the disruptive network traffic again.

Looking in detail at the network packets, llrf-pip2it-socfpga-hwr4 is talking directly to cns75pc via mac address rather than using the router as it should. When cns75 talks back, it properly uses the router, so the network does not see a return path and starts to flood the network with the data from llrf-pip2it-socfpga-hwr4.

Our best guess is that llrf-pip2it-socfpga-hwr4 has a netmask set to 255.255.0.0 (0xffff0000) rather than the correct 255.255.255.0 (0xffffff00).

Could you please check on that?

Thanks,

- Tim

On Tue, 25 Aug 2020, Tim Zingelman wrote:

Yes, please do start the displays back up.

We have removed our network level hack and are monitoring the network to verify that your temporary fix is effective.

Thank you!

- Tim

On Tue, 25 Aug 2020, Brian E Chase wrote:

Shri,
If you can start them up again then Tim can validate that the problem is cleared up on the network.
Brian
On Aug 25, 2020, at 2:16 PM, Shrividhyaa Sankar Raman <<mailto:>> wrote:
Hello Again,
I have made a temporary fix for this issue. I shall work on making the necessary changes to support this for the long run soon.
Best Regards,
Shri
________________________
From: Dennis J. Nicklaus <<mailto:>>
Sent: Tuesday, August 25, 2020 12:38 PM
To: Shrividhyaa Sankar Raman <<mailto:>>; Brian E Chase <<mailto:>>; Timothy E Zingelman <<mailto:>>
Cc: Daniel W Klepec <<mailto:>>; John E. Dusatko <<mailto:>>; Philip Varghese <<mailto:>>
Subject: Re: network traffic problem with llrf-pip2it-socfpga-* devices
If it is the labview displays, then just switching off the current labview displays isn't going to solve to problem long-term at all.
The problem will re-appear anytime the Labview displays are run as comfort displays with no interaction.
The Labview will need to be changed so that it periodically sends some no-op to the LLRF instrument.
On the other hand, if the problem is that the LLRF device still thinks it has a labview connection and is sending data but in reality the labview display has gone away, then that's a problem that will need to be addressed in the LLRF code.
Dennis
On 8/25/20 12:24 PM, Shrividhyaa Sankar Raman wrote:
Hello,
I believe the LabVIEW connection to these 3 devices must be on. Switching off the LabVIEW connection should reduce the traffic on this network.
Best,
Shri
________________________
From: Dennis J. Nicklaus <><mailto:>
Sent: Tuesday, August 25, 2020 12:20 PM
To: Brian E Chase <><mailto:>; Timothy E Zingelman <><mailto:>; Shrividhyaa Sankar Raman <><mailto:>
Cc: Daniel W Klepec <><mailto:>; John E. Dusatko <><mailto:>; Philip Varghese <><mailto:>
Subject: Re: network traffic problem with llrf-pip2it-socfpga-* devices
No, it isn't the Erlang frontend. That's on clx58, not 75. (and those
three aren't connected to the acnet/erlang frontend just yet.
Maybe labview running through cns75?
On 8/25/20 12:12 PM, Brian E Chase wrote:
Hi Tim,
I am including a list of possible suspects in this thread. I„Ę„Ū guessing that Shrividhyaa and Dennis are involved in this network communication. The LLRF nodes are all out at CMTF and we are bringing up ACNET for these nodes and I expect that cns75pc (131.225.142.147) is where the ERLANG front-end is running.
Brian
On Aug 25, 2020, at 12:03 PM, Tim Zingelman <><mailto:> wrote:
Hello,
Recently we've been seeing traffic on the network from llrf-pip2it-socfpga-ssr1 (131.225.118.91), llrf-pip2it-socfpga-hwr2 (131.225.118.88) & llrf-pip2it-socfpga-hwr3 (131.225.118.89) to cns75pc (131.225.142.147).
If there is someone else who should be included in this discussion, please reply-all and include them into the email thread.
This network traffic is one-way, meaning that the llrf-pip2it-socfpga-* devices send data to cns75pc and cns75pc does not send regular responses.
When this happens, the network soon starts wondering if cns75pc is still there or if it moved, and by industry standard behavior starts to send all the data to every port on the switch, hoping to get a response from cns75pc. This traffic overwhelms the devices on the other network ports, who are not expecting or wanting to see it.
Yesterday this caused the UCDHIN clock front end to fail sending the 20Hz clock multicasts, and disrupted other work as well.
The best solution is for the llrf-pip2it-socfpga-* code to be adjusted, so when it is sending streaming data to a destination, to also send something requiring a response at least once per minute. This could be a simple ICMP ping packet. When things are normal, this will keep the network doing direct routing of the traffic. In addition however, the llrf-pip2it-socfpga-* should, if the ping response does not come back, stop the data stream. Otherwise if the destination fails or is powered off, the traffic flooding will still happen, and disrupt the network.
A short term half-way solution would be to modify the receiving software running on the cnsXXpc to send a periodic message the the stream sources (again as simple as an icmp ping). This again would while everything is normal running keep the network doing direct routing of the traffic. It has the big disadvantage that if the program on the PC is stopped, or crashes, without signaling the stream sources to stop, then the flooding will again occur.
We currently have a very specific hack in place to prevent flooding from only the llrf-pip2it-socfpga-* devices which are attached to s-ad-c-2960x-cmtf200 sending only to cns75pc. If any other console is used, or if any of the llrf-pip2it-socfpga-* devices are on another switch, then this hack will fail to help and the network will again be disrupted.
Please feel free to call if you want clarification.
Thanks,

- Tim



Also available in: Atom PDF