Project

General

Profile

Expert spill server recovery

If any of the monitoring processes are pink in the Spill Server monitor, and/or the 1 Hz trigger has stopped incrementing in the trigger scalars (or the NuMI trigger when beam is KNOWN to be on), the backbone must be restarted.

on novadaq-near-master.fnal.gov:

/home/novadaq/DAQOperationsTools/bin/startBeamSpillBackBoneND.sh -z <partitionnumber>

Only execute the script from near-master, it will start the backbone across both detectors.

The backbone does not use dds messaging so it does not matter what partition is specified with setup_online.

The start script will kill any parts of the backbone that may be running before restarting. This should not be executed unless some parts of the chain are known to be down.

The script will spit some messages that look like errors, this is a byproduct of daemonizing the processes, it is OK.

The monitor should go green in 5-10 seconds followed by the counting of the trigger scalars. Further description of the spill server is below.

NuMI Beamline and General Spill Triggering Info

High-level description of how the NuMI beam trigger information is generated and should be interpreted can be found here:

NuMI Spills for NOvA

NOvA Spill Server

The spill server is broken up into three general pieces in order to transmit information between the different DAQ networks. The three components are:

  • Spill Server

This application must run on a TDU and is responsible for reading the accelerator signals. This application

  • Spill Forwarder (can run on an x86 machine)

This application receives spill information from the Spill Server and forwards it to another destination or destinations. Spill Forwarders can be chained together to bridge different firewalls.

  • Spill Receiver (can run on an x86 machine)

This application is a termination point for the XMLRPC communications chain. Typically it will receive input from a spill forwarder. The spill receiver will unpack the information that receives and repackage it as a DDS message that is then broadcast on the network so that the global trigger can pick it up.

Design

The spill server system is design with a server -> repeater -> receiver method.

The server is located on a tdu and sends information to a repeater agent using xmlrpc for data transport. The repeater can then re-transmit the information to one or more receivers again using xmlrpc for data transport. In the last hop, a receiver translates the information from xmlrpc into a DDS message that can be sent to the global trigger (GT) process.

It is the responsibility of the GT to then broadcast the actual trigger decisions to the buffer nodes and data logger systems.

Near Detector

For the near detector, all of the communications is between the tdu and novadaq-ctrl-trigger.fnal.gov. The diagram below shows the data flow and location of the applications.

Far Detector

For the far detector, the communications need to start at Fermilab in the Minos surface building with the TDU. From here the information needs to traverse the near detector (private) network in order to then be transmitted throughout of the Fermilab campus and into the Ash River campus over the wide area network connection (here the pseudo-wire that has been established between Ash River and FNAL).

To do this the spill repeater sends the output to multiple destinations. The first destination is the normal near detector path which allows for the operation of the near detector. The second destination is a repeater agent at the Ash River site that allows for the information to come in, and if need be, jump from the public to the private DAQ network. From here the design is the same with a spill receiver handling the translation into DDS and the GT broadcasting the trigger information.

TDU Initialization

To get the master TDU to readout correctly the following procedure must be followed:

  • Load the kernel module

This must be done after any reboot of the PowerPC as the root user (i.e. novadaq can not do this)

insmod /nova/novadaq/releases/development/tdu_kernel_module/tdu_kernel.ko

Or if running in production mode use the script: tdu_kernel_load
which sets up the devices to be accessed by a non-root user.

IF THIS HANGS....it will hang the entire PowerPC side of the TDU. What is happening is that one of the clocks is not set up right and the kernel module is blocked. To fix this, go to the ARM interface and issue a "0x22" to address "0x0" (the control register) this will start the clock. If that fails to un-jam the TDU, then from the TDUControl GUI you will need to first reboot the FPGA (from the expert menus on the firmware tab) and then Reboot the TDU (from the file menu).

Next, you will need to configure the TDU to decode and publish accelerator events. To do this:

  1. Enable Accelerator Event Decoding
  2. Enable the mask of events to decode (0xffff is all events)
  3. Enable interrupts from the FPGA

The sequence will look like:

tduControl set 0x0  0x0400
tduControl set 0x23 0xffff
tduControl set 0x24 0x0016

If this works then you can cat the interrupt table:

[root@tdu-03:~]$  cat /proc/interrupts 
           CPU0       
 16:      22590   IPIC   Level     serial
 18:     565362   IPIC   Level     mpc8xxx_spi
 21:          0   IPIC   Level     i2c-mpc
 22:          0   IPIC   Level     i2c-mpc
 23:      24471   IPIC   Edge      tdu_event
 32:      84681   IPIC   Level     eth0_g0_tx
 33:     283794   IPIC   Level     eth0_g0_rx
 34:          0   IPIC   Level     eth0_g0_er

The entry for tdu_event should be incrementing with each accelerator signal that is decoded.

IF the interrupts do not start up then there is a problem and the following steps will fix it.

  1. Reboot the FPGA from the TDUControl GUI interface
  2. Reboot the whole ARM board from the TDUControl GUI interface (under the file menu)

If this doesn't work then:

  1. Send a "scrub" to the master system
    1. Scrubs only work in firmwares post v2.13 due to a confusion in the lines
  2. It is also possible to send a scrub to the slave chain
    1. This should not be required if you are just trying to get the master to respond correctly
tduControl set 0x09 0x01

Or to scrub the slaves:

tduControl set 0x09 0x02

You can also try resetting the firmware in the master to its defaults:

tduControl set 0x09 0x20

And repeat the reboot of the FPGA and ARM.

MINOS Surface Building Master TDUs

There are three MTDUs located in a rack in the back room of the MINOS surface building. All TDUs can be accessed from the Near Detector cluster (generally novadaq-near-master.fnal.gov)

  • The top TDU in the rack is tdu-near-master-arm-03 (or for the ppc tdu-near-master-ppc-03). This unit provides the clock for timing chain 1 of the ND and also run the backup spill server for monitoring purposes.
  • The middle TDU is tdu-near-master-arm-01 (tdu-near-master-ppc-01) which runs the primary spill server. It is not connected to any detector timing chain.
  • The bottom TDU is tdu-near-master-arm-02 (tdu-near-master-ppc-02) which provides the clock for timing chain 2 of the ND. This is the primary timing chain that is being used by default for runs.

Spiller Server Startup

Each component of the spill server system needs a configuration file that specifies what ports it will run on and connect to.

Configuration

There are several configuration files which dictate which machines will host the spill-server-related applications and how they will talk to one another. Below is a table with the location and description of the relavent configurations:

NssTDUApp (the spill server)

novadaq-near-master:/nova/config/NearDet/NssTDUAppConfig-ND.xml -- Specifies which spill types are enabled and where the spills should be sent (to the ND Spill Forwarder)

NssSpillForwarder

novadaq-near-master:/nova/config/NearDet/NssSpillForwarderConfig.xml -- NearDet: a list of spill types to forward and their destinations
novadaq-far-master:/nova/config/FarDet/NssSpillForwarderConfig-AshRiver.xml -- FarDet version of above.

Spill Server Monitor

novadaq-near-master:/nova/config/NearDet/appmgr/Partition91/HostList.xml -- List of relevant hosts whose processes we want to report in the spill server monitor
novadaq-far-master:/nova/config/FarDet/appmgr/Partition91/HostList.xml -- FarDet version of above.
novadaq-near-master:/nova/config/NearDet/appmgr/Partition91/ProcessList.xml -- List of processes to watch
novadaq-far-master:/nova/config/FarDet/appmgr/Partition91/ProcessList.xml --FarDet version of above.

Getting Spill Server Running at Ash River

In order to get spills transmitted up to Ash River, a specific sequence of steps needs to be followed to set up the chain of forwarders and receivers. Note that the instructions below are unlikely to be necessary during a typical shift. For shift instructions, see the next section first.

1. On novadaq-near-trigger.fnal.gov a NssSpillForwarder needs to be started. To do this first source the setup script, use partition zero even though this is not explicitly required for most of the startup process.

setup_online -z 0

Then execute the program NssSpillForwarder. This program will look for an xml configuration file that is by default located in NovaSpillServer/config/NssSpillForwarderConfig.xml. Other configuration files can be specified in the command line with the -c option. This file specifies the spill server port to listen to, and the destination host and port to forward spills too.

NssSpillForwarder

The program will generate the following error condition until all destination receivers are started (i.e. both far and near detector receivers).
Forwarding spills received on port 7890.
Found 3 destinations.
Destination novadaq-ctrl-trigger.fnal.gov:7891 will receive spills NuMI, BNBtclk, AccelOneHztclk
Destination novadaq-far-trigger.fnal.gov:7891 will receive spills NuMI, BNBtclk, AccelOneHztclk
Destination localhost:56789 is disabled.
Error in XmlRpcClient::writeRequest: write error (error 111).
Spill forwarded: 7065512128001383 4
Error in XmlRpcClient::writeRequest: write error (error 111).
Spill forwarded: 7065512128001383 4
Error in XmlRpcClient::writeRequest: write error (error 111).
Spill forwarded: 7065512192001384 4
Error in XmlRpcClient::writeRequest: write error (error 111).
Spill forwarded: 7065512192001384 4
Error in XmlRpcClient::writeRequest: write error (error 111).
Spill forwarded: 7065512256001385 4

2. Next the spill server needs to be launched, this can only be done from a specific TDU, for example

ssh root@tdu-near-master-ppc-01

Next the software must be setup correctly for the TDU
setup_online -z 0 --xcompile --opt

Now the spill server can be started, but it also must be passed a configuration file. These files are also located in NovaSpillServer/config. For TDUs in the MINOS surface building execute
NssTDUApp -c srt://NovaSpillServer/config/NssTDUAppConfig-Minos.xml

3. The last step to get spills sent to the global trigger at the near detector is to set up a spill receiver. This is also done from novadaq-near-trigger.fnal.gov. First source the appropriate setup,

setup_online -z partionNumber

Where partionNumber is the desired partition to run on. Finally, the receiver can be started with
NssSpillReceiver -p Port

The Port specified must match the destination port in the NssSpillForwarderConfig file since this is the port on the receiver that is listening. At this point there is spill chain set up on the near detector and triggers should be received if the global trigger is listening. To check that spills are being received properly the eavesdropper script can be run. This should be done on the same machine as the receiver is running on (i.e. novadaq-near-trigger.fnal.gov) and in the same partition.
> setup_online -z partitionNumber
> NssSpillReceiverEavesDropper -p -1

Here the number after the -p option to the EavesDropper must be set to be -1 to indicate the NULL_PARTITION. If this script is run and no clock signals are coming in, something has gone wrong.
The above steps can be executed individually or from a script
/home/novadaq/DAQOperationsTools/bin/startBeamSpillBackBoneND.sh -z 1

This is typically run on novadaq-near-master.fnal.gov after sourcing the appropriate environment.

4. In order to pass the spills up to the Ash River detector, a forwarder must be setup first. This is done from novadaq-far-trigger.fnal.gov. First source the setup in the same style as for the near detector forwarder. Next, the forwarder can be started but be careful to pass the far detector configuration file

setup_online -z partitionNumber
NssSpillForwarder -c srt://NovaSpillServer/config/NssSpillForwarderConfig-AshRiver.xml

The important thing is that the source port in this configuration file matches the destination port in the near detector configuration file. This allows the far detector forwarder to listen to the forwarder on the near detector. Make sure the destination host is also on novadaq-far-trigger.fnal.gov and make note that the destination port is the one the far detector receiver should be listening on.

5. The final step is to start a receiver at the far detector. This is again done on novadaq-far-trigger.fnal.gov. The setup is first sourced

setup_online -z partitionNumber

It is important to know that the far detector partition does not have to match the near detector partition. The receiver is then started
NssSpillReceiver -p Port 

Where partionNumber is the far detector partition and Port is the destination port specified in the configuration file for the far detector forwarder.
At this point spills should be up and running at the far detector and if the global trigger is listening then counts should be seen. To check and make sure spills are being received the eavesdropper can also be run on novadaq-far-trigger.fnal.gov.

During normal operations the spill server should not require any direct actions on the part of a shifter. When a run begins on either the near or far detector the DAQApplicationManager will start a NssSpillReceiver on the same partition as the run. The backbone of the spill server chain should already be running. The backbone consists of a spill server running on a master tdu at the minos surface building (currently tdu-near-master-ppc-01), a NssSpillForwarder running at the near detector on novadaq-near-trigger.fnal.gov to forward spills both to near detector receiver and the far detector, and a second NssSpillForwarder application running on novadaq-far-trigger.fnal.gov to forward spills to the far detector receivers. Both forwarders are setup to send spills to 5 ports corresponding to partitions 0-4. By current convention, partition 0 looks for spills on port 7892, with other partitions being increments from the base port. The forwarders are always transmitting these messages regardless of if a receiver is listening. This way the backbones can be left up indefinitely.
A spill receiver can be stopped and started in the following way (must be on novadaq-near-trigger.fnal.gov for the near detector and novadaq-far-trigger-fnal.gov for the far detector):

setup_online -z partitionNumber
stopNssSpillReceiver.sh -z partitionNumber
startNssSpillReceiver.sh -z partitionNumber

The above actions should happen automatically during the process of starting a run. If the spill server process remains pink in the application manager and there are no triggers being seen, this can be tried manually. After this is done the NssSpillReceiverEavesdropper should be run as described above in order to verify that spills are being seen.

There is a Spill Server Monitor that should always be running in the cr01 vnc sessions on both detectors. There are start and stop icons on the desktop. This monitor checks
that each step of the spill server backbone is running and reflects this with a green status box. If a step in the spill server fails the box will appear pink or red. If this happens the shifter should check if the trigger scalars have also stopped (both the NuMI triggers and the 1Hz accelerator triggers). If so the above steps should be
followed to restart the backbone. If triggers are still being sent then this issue is with the monitor and not the server and an expert should be called. The run does not need to be stopped in order to restart the spill server, the monitor can also remain running. The monitor also shows the status of the SNEWS trigger system which is independent of the spill server. A screen shot of the ND spill server monitor is below.

The FD monitor tunnels back to the ND cluster and reflects the same status for the NssTDUApp and ND spill forwarder. The only difference is that the FD monitor also checks the spill forwarder on the FD. A screen shot is below.

Spill Server Auto Restarts

Message Facility at the Far Detector watches the spill server log files. If the spill server goes down, it will stop writing to the log files and be detected by Message Facility. When MF detects this, it executes the script ~/DAQOperationsTools/bin/autoRestartSpillServerBackbone.sh which restarts all components of the timing system.

At the moment there is no notification delivered to the shifter, although we may put one in place. MsgViewer on nova-cr-01 can be used to determine when an automatic restart was issued by filtering by the application AutoRestart:

This information can also be found in the Message Facility log files. For example:

[novadaq@novadaq-far-master ~]$ cat /daqlogs/FarDet/Partition1/MessageFacility/novadaq-far-msglogger/msg_archive_20160406_160112.log | grep -A 2 AutoRestart

%MSG-i AutoRestart:  09-Apr-2016 20:28:19 CDT novadaq-far-msglogger.fnal.gov  (192.168.136.19)
msglogger SpillServer SpillServer  --:0
MessageFacility automatically restarted the spill server backbone.
--
%MSG-w AutoRestart:  09-Apr-2016 20:28:19 CDT novadaq-far-msglogger.fnal.gov  (192.168.136.19)
msglogger SpillServer SpillServer  --:0
MessageFacility attempted to automatically restart the spill server backbone too soon. Will try again soon.
--
%MSG-i AutoRestart:  13-Apr-2016 10:02:20 CDT novadaq-far-msglogger.fnal.gov  (192.168.136.19)
msglogger SpillServer SpillServer  --:0
MessageFacility automatically restarted the spill server backbone.

Viewing Accelerator Event Data

When the TDUs are initialized and configured correctly, they will start decoding beam data that comes in over the MIBS, TCLK or AUX lines on the front of the TDU. Events on these lines are decoded in real-time and time stamped with a NOvA timebase timestamp.

To verify that this process is happening correctly, the following diagnostic utilities can be started independent of the rest of the beam spill system. To start the "stand alone" version of the spill server (which allows for collection of data independent of the rest of the DAQ do the following:

  • Login to the TDU as root
  • Verify that the tdu_kernel module has been loaded
  • Setup the NOvA software environment:
    setup_online --xcompile --opt
    
  • Verify that the "event enable" bit is set in the control register:
    • Run tduRegDump verify that bit 11 is set in register 0x0
      > tduRegDump
      ------------------------------------------------------
      Register        Value    Description
      ------------------------------------------------------
      0x0000        0x0400    Control
      0x0001        0x0048    Status
      0x0002        0x0000    TDU Delay Value
      0x0003        0x0000    Side Output Delay
      0x0004        0x0000    Top Output Delay
      ...
      

Then if there is no spill server currently running you can start up the stand-alone version of the spill server (SpillServerApp-Standalone) and you will get a list of accelerator signals that are being readout.

Additional Information

Spill Server Hardware (including Booster spill information)
Dummy spills (great to get started with testing)
Performance tests (it is fast enough)
Status and ToDo List
Starting the Spill Server on the test stand

Notes