Project

General

Profile

Troubleshooting for PPD

IFBeam DB Architecture

The IFBeam DB architecture is described in the IFBeam DB Architecture document

On very high level the system consists of 3 components:

  • Redundant Data Collector * Dataabse * Data access interface

Also, there are several NuMI and BNB line monitoring tools built around the system and supported by the SCD.

This document describes troubleshooting procedures for the Data Collector.

IFBeam DB Data Collector

As described in the IFBeam DB Architecture document , the Data Collector is redundant and autonomous system. Currently it consists of 3 identical instances of Data Collector. Each Data collector, in turn, consists of the Invoker and multiple Bundle Collectors. The function of the invoker is to start and keep Bundle Collectors running. Each Bundle Collector receives data from the AcNet in real time and stores the data in "flat" files on the local disk. Bundle Collectors are completely independent except that they may share the disk area where they store received data. Bundle Collector receives the list of devices to read from the Database, but once it receives it, the list is stored on the local disk so that the Collector can continue running even if the database with the device list is unavailable. Normally, flat files produced by the Collector, are transferred to dbweb3 computer and then stored to the Database. But in case the Database or the network is unavailable, data will remain buffered on the Collector's local disk and will be stored into the Database later. There is sufficient disk space on the collector computers to store several days worth of data in case of long term unavailability of the Database.

Because we run 3 identical instances of the Data Collector, we run 3 instances of each Bundle Collector. They are completely redundant, i.e. as long as at least one collector for the bundle is running, the system will receive all the information from the devices of the bundle.

IFBeam DB monitoring tools are web applications running on 2 redundant web servers dbweb3 and dbweb4. Every page has its "official" dbweb0-based URL. dbweb0 is an HTTP redirector, which automatically redirects the client to the server which is currently available.

Tools

IFBeam DB web interface includes 3 pages useful for monitoring and troubleshooting purposes:

System Status page shows status of every Bundle Collector, including whether it is sending "heartbeats", when was last time it closed its output file, and how many events were written into that file. Normally, files are closed after 100 events or when events stop coming. So for A9, 8F, 1D events, number of events per file should be 100. For more rare events, that number can be lower. Files should be closed "recently", depending on the frequency of the particular event.

Also, the page shows timestamps for latest events directly at the Collector, in the real time and in the historical Database. Normally, the latency of the Collector timestamp should be ~1 second, real time database - delayed by couple seconds compared to the Collector and historical DB is ~1 minute, but can be several hours. Just because there is longer than usual latency in the database does not mean that the system is losing data. The data is buffered in the flat files and eventually will be delivered to the database.

Another useful part of the System Status page is the comparison of A9 event counts from the Database and from AD's Event Logger. Unless there is some Database latency, these numbers must match.

The home page shows PID and status of individual Bundle Collectors. We are also adding buttons to restart the Bundle Collectors to that page.

The dashboard shows some charts showing event frequencies, some device measurements and some database statistics.

Home page shows current status of all Bundle Collectors.

Troubleshooting

The Bundle Collector is considered to be down if

  • its status is not "running" and * timestamp of last closed file is too far in the future

If a Bundle collector is down, it may indicate a problem. However, because of the redundancy, if only 1 or 2 of 3 redundant bundle collectors are down, no action is required. However, one may want to try to restart the collector(s) which are down.

If all redundant collectors for the same bundle are down, then they need to be restarted.

If collectors are running, but there is significant latency of the real time database, then it may be an indication of a database problem.

If IFBeam DB monitoring pages do not show, try to switch back to official dbweb0 URL for the page and or reload the page. If that does not help, that means both redundant web servers dbweb3 and dbweb4 are down, and an incident needs to be created for FEF.