Project

General

Profile

Plan for DAQ machine failure

This page is intended to layout the plan for failure of any of the DAQ machines. The most worrying is EVB because it currently hosts the home areas for all other machines (although this plan includes some steps to mitigate that). It also includes action items that should be updated to ensure the plan can work. In each section, there is a plan describing what to do if the machine fails in bold, including which machine we will use as a replacement and a list of things we will need to consider in making that replacement. In general, the name of the assigned substitute online machine needs to be changed to the original name of the failed one by the SLAM team. There are also notes, which are intended to document the thought process behind each plan, and remind us of other things we may need to take into account.

If EVB fails

Plan to get running again with replacement EVB

  1. SLAM team moves evb and NFS mount of home areas to ubdaq-prod-evb2 (or backup replacement machine, if ubdaq-prod-evb2 is unavailable -- see notes below).
  2. SLAM team configures the replacement machine to have hostname ubdaq-prod-evb and the same IP address as the failed evb (so that it looks like the same machine from the point of view of the DAQ software).
  3. DAQ expert restores backups of /home and /uboonenew areas from local backups (see notes below) if necessary.
    • Note that this is only necessary if the replacement machine is not ubdaq-prod-evb2. If the replacement machine is ubdaq-prod-evb2, the backups are already such that the machine is a mirror of ubdaq-prod-evb
  4. DAQ expert restores evb cron jobs from backups (this will be necessary no matter what machine is being used as the replacement)
  5. Online Monitor expert and SlowMon expert will need to check that the systems can run in the new configuration. Note that the SlowMon should not ever lose access to the database because the postgres server is running locally (but it may be affected by the change in home areas).
  6. The SlowMon expert would need to make sure that the slow monitoring and Ganglia were reading the correct machine for all event builder-related metrics. See here for a discussion of how to do that.
  7. DAQ experts will need to update the file /home/uboonedaq/runinfo.txt by hand (it usually gets written at the start of every run, but since we are planning to do the backup only once per day it could be up to 24 hours out of date). Check this elog entry to find out what needs to be in the file - all information can be found in the run config database and elog entries for run starts.
  8. The replacement evb may not immediately have a Kerberos ticket, which the DAQ needs to run. The ticket is generated by a cron job under user uboonedaq on evb (see details of all cron jobs on MicroBooNE here) every 10 minutes. Note that the ticket will be different for different machines in the cluster (it has "ubdaq_prod_evb" in the name for the ticket on evb) so even if you substitute a different online machine you will need a new ticket. You can either wait 10 minutes to get the correct ticket or run the commands from the cron job yourself (as uboonedaq).
  9. DAQ experts should check with SLAM which ethernet port is being used on the replacement evb (usually eth1 or eth2) and change config/dds/prod/ospl-uboonedaq.xml if necessary (see notes below for details)
  10. Then the DAQ should just work (ha!). DAQ experts would need to be available to make sure that's true.
  11. Finally, DM experts would need to relaunch the daemons on the new EVB machine to drain data. This may require changing some of the PUBs jobs by hand.

Plan for switching back to original EVB after it has been repaired

  1. SlowMon experts should follow procedure for a power outage before the replacement evb (i.e. the machine that is currently running as evb) is taken down
  2. SLAM will bring the original evb online, and change the hostnames and IP addresses back to what they originally were, before the replacement
  3. SLAM should then reboot all online machines, including (especially!) the machine that was being used as the replacement evb and is no longer needed, to ensure that it doesn't have any DAQ processes still running.

If a SEB fails

Plan to get running again with replacement SEB

  1. If a SEB fails, we wipe the test stand SEB (uboonedaq-seb01) and mirror the old SEB there, then use that as the new SEB. (If we buy a new SEB, then that will be the first option for a replacement, and the test stand SEB will be the second option, to use if two SEBs fail).
  2. In case it is SEB10, the DAQ experts will have to connect (physically) both pulser and ASIC configuration to the mirrored machine. They will also need to change the machine address in uboonedaq/projects/scripts/setup_daq.sh (in the DAQ code).
  3. If SEB 10 is powered down (including to replace it, but also if the rack is powered down to replace another machine), there may be issues with the GPS timing when it comes back up. As documented in Kathryn's slides from the DAQ school, this can usually be fixed by rebooting seb10.
  4. Changing machines may require some changes in the Ganglia settings to ensure the Ganglia metrics and slowmon are reading from the correct machine. See here for a discussion of things we may need to consider.

If near1 fails

Plan to get running again with replacement near1

  • If near1 fails, the first option would be to use the old evb as a replacement (once the new evb has arrived). If that is not available, we move the online monitor to run on near2.
  • We can run stably like this for a number of days but will need to ensure we can drain data from evb without near1:
    • With software trigger, we have storage for 5-6 days of operation. Should be able to react within that time, reconfigure system, and move everything within that time.
    • Transfer all PUBS projects that were running on near1 to EVB, ramp them up and it’ll drain down “normally”. Caveat is that we normally only have 4-5 mins of data that hasn’t been drained in normal operations, so can be drained quickly by hand. If near1 was out for a long time and we built up a lot of files on evb this could lead to too much CPU and network activity on evb. Would need to monitor.
    • It's probably better to slowly transition things over, although that will be painful and annoying.
    • We also need to keep an eye on power distribution while doing so. Make sure it doesn’t go over the limit (12 A in these PDUs) - PDUs are in EPICs, standard display for shifters is the ComputerStatus page (in CSS)
    • Relaunch the daemons on the new EVB machine. Daemons are set up so if we swap out machines, whatever becomes evb, we’ll relaunch the daemons on that machine and the software will adapt.
    • (This was tested and found to be successful during a 4-hour downtime on April 5th 2018)
  • There is usually 5 minutes' worth of data on near1 at any given time before it gets transferred out. Getting data off the failed machine will depend on failure mode. Maybe replace RAID card? Maybe move disks to a RAID array of the same style somewhere else? Could be hours worth of work and maybe not worth it for 5 minutes of data.

If near2 fails

  • Near2 runs the webserver for the online monitoring lizard, and does odd jobs for SN monitoring. It is unique in that it is the only machine besides ws01/ws02 that is accessible from outside the DAQ network (although it is accessible only from inside the FNAL network). This also means near2 SHOULD be getting more patches.

If it fails, SN monitoring is simply shut down.

Lizard can be run from near1 directly. The problem with this is that the firewall prevents access from outside the DAQ cluster, so shifters will have to use an SSH tunnel to connect to the webserver... and have to have a valid ws01 account to get through the gateway. To get around this, we could relax that firewall rules. (I would prefer if near1 were exposed to FNAL, but this idea was shot down in early DAQ deployment as being insecure.)

If smc fails

Plan to get running again with smc2

  • To tell the DAQ whether to look for the database on smc or smc2 (or elsewhere), you need to edit the file /home/uboonedaq/.sqlaccess/prod_conf.sh: change the IP addresses given for DBTOOL_READER_HOST and DBTOOL_WRITER_HOST. Once the file has been changed, you need to log out and log back in again (because the file is sourced when logging in as uboonedaq).
  • smc2 as the database for short-term is fine, but long-term we need more than one. Get smc back in ASAP, copy everything from smc2 back onto that and get smc back up and running again.

If ws01 or ws02 fails

Plan to get running again with replacement gateway machine
  • If one of the gateways fails, we take one of the test stand machines and convert it into a gateway
  • In the long run we would like to buy one spare gateway machine.

If two machines fail

If evb and near1/near2 fail

  • First machine: Use the old evb
  • Second machine: Use the other of near1/near2
  • Third machine: Use the test stand evb

Notes:

  • near2 doesn’t have as much capability as near1. Kirby suggests we may need to scale back a lot of operations stuff (e.g. maybe supernova stream) so we could get up and running. Wouldn’t be a long term solution.

If two sebs fail

  • First machine: Use a new spare seb (if available)
  • Second machine (or if new spare is not available): Use the test stand seb
  • Third machine (or if new spare is not available): Use near2 (this is not an ideal solution for many reasons)

Changing network names

If any machine fails and we need to replace it, the name of the new machine would be different to that of the old machine. That will cause problems because:
  1. the DAQ scripts may have hardcoded addresses (we definitely know that RunConsoleDAQ.py uses ssh to launch processes on the various machines)
  2. shifters need to ssh into the online machines.

Because of this, when putting in the replacement machine SLAM will configure its hostname and IP address to match the machine it is replacing.