Project

General

Profile

Plan for DAQ machine failure » History » Version 24

« Previous - Version 24/161 (diff) - Next » - Current version
Nathaniel Tagg, 02/02/2018 12:43 PM


Plan for DAQ machine failure

This page is intended to document the plan for failure of any of the DAQ machines. The most worrying is EVB because it currently hosts the home areas for all other machines (although this plan includes some steps to mitigate that). It also includes action items that should be updated to ensure the plan can work.

Changing network names

If any machine fails and we need to replace it, the name of the new machine would be different to that of the old machine. That could cause problems because:
  1. the DAQ scripts may have hardcoded addresses (we definitely know that RunConsoleDAQ.py uses ssh to launch processes on the various machines)
  2. shifters need to ssh into the online machines. The options are:
Thus, are considering:
  • Leave the network names as they are. Change the names in the DAQ software and write new shifter instructions. This is non-ideal because it could confuse shifters and might result in the DAQ expert being called every 8 hours.
  • Is it possible to make aliases? If every machine had an alias, we wouldn't need to do any network changes when moving machines, just change the hostname. We agreed that this is definitely a good idea at least for evb, but probably a good idea to do for all machines.
    • The private network is easy - we can give every machine an alias and change it any time.
    • The public network (used for ssh-ing in) is more difficult. The machine name you request has to match something that the system has a kerberos keytab for. Would it be possible to put all of the kerberos keys into all of the keytabs and make the machines totally interchangeable (or at least between test stand and production)? Bonnie will look into this.
  • Action item (DAQ team): Work out where in the software we would need to change addresses if necessary. Make a list so we can find and change them quickly if needed. One place we know about already: the dispatcher. Do we use the private or public network addresses?
  • Action item (RunCos): Work out where in the shifter instructions we would need to change addresses if necessary. Make a list so we can find and change them quickly if needed.
  • Action item (SLAM team): Look into whether it's possible to assign aliases for all machines over the public network

If EVB fails

1. Home areas

  • A major problem currently is that the home and products areas for all machines are hosted on evb and NFS mounted on the other machines. This has many benefits (for example the software and data products are stored in these directories so are accessible on all machines without needing to compile), but means that if evb fails the home area for all machines is gone. (Actually, the home area lives on the RAID disks which would still be fine, but we can't access them. Physically moving the disks into another machine is risky and generally agreed to not be a great idea).
  • evb is backed up by TiBS, so we can recover the home areas. However, it takes about 4-6 hours to restore the products area and home area onto a new machine.
    • %{color:red}Action item (DAQ team): Can (or should) we remove some files eg. DAQ log files from the home area to make it smaller and quicker to restore?
    • Action item (Glen): Come up with a subset of products we need
      %
  • Bonnie also suggests to move the pieces we need to run the postgres server to a local disk on smc (and an identical copy on smc2). That solves the problem of being able to run the database for the slow monitor, and we can leave the home areas NFS mounted for all users for everything else. Action item (SLAM team): copy pieces needed to run postgres server to a local disk on smc and smc2. Is there anything else that should be included in this or are we happy for everything else to be backed up on uboonedaq-evb? Should do this during downtime on Tuesday 6th Feb.
  • In addition, set up a local backup of /home and /uboonenew (the products area) on uboonedaq-evb (the test stand evb machine).
    • If evb were to go down it's fairly simple to point the other machines at uboonedaq-evb for their home and products areas instead of looking for the NFS mount.
    • There already is a uboonenew area on uboonedaq-evb, which in theory should be synced with the production evb already (although it may not be).
    • The backup should be updated regularly using rsync (after the first copy it shouldn't be too much network traffic)
    • If we do this, we should also install a 10 GB network card in uboonedaq-evb (it currently doesn't have one). There is one in near2 that is not connected, we could use that.
    • Action item (SLAM team): Set up a copy of the home and products areas on uboonedaq-evb. Do this during the downtime on Tuesday 6th Feb?
    • Action item (DAQ team): Check for spare 10 GB network cards.
    • Action item (SLAM team): Move the 10 GB network card from near2 to uboonedaq-evb.
    • Action item (who?): Arrange for a cron job or similar that does regular backups via rsync. How often should "regular" be?

2. Make sure the slowmon is running

  • Plan: If evb goes down, SLAM team will reconfigure smc to look for its home directories in uboonedaq-evb. It should not ever lose access to the database because the postgres server will be running locally.

Notes:

  • Slowmon is not very dependent on evb, except for the home directory and postgres server.
  • In general, the slowmon is pretty portable: it's been moved in the past when smc has failed, and we have smc2 as a hot spare, which should be exactly identical to smc.
  • The argument against having an entirely local home area for smc is that it makes the slowmon less easily-portable. Having the postgres server locally and a backup of the home areas in uboonedaq-evb seems like a better solution.

3. Put in a replacement evb machine

  • Plan:
    • If evb goes down, we (who? DAQ or SLAM team?) will move the event builder to near1 and the online monitor to near2. This gives us a reasonable running state
    • Moving the online monitor to near2 will require changes in the DAQ code (see the required changes here ) and the config file (compare config files 681 and 691 to see the required changes).
    • Still need to figure out how to drain data from near1 in this configuration
    • Still need to figure out how to change the network name of the new machine (or change ssh calls and other calls in the DAQ code)

Notes:

  • We decided that the replacement evb machine should be near1.
  • We need to work out a solution to the problem of changing network names in this case -- see above.
  • We would mount the home areas from uboonedaq-evb, but still be missing a /data area. We would have to create a /data area (/data/uboonedaq/rawdata, /data/uboonedaq/metadata) on near1 and then mount it on all the other machines, to ensure the configuration is the same as it currently is.
  • Why near1, and not uboonedaq-evb?
    • We would need to make sure the configuration is exactly the same as prod-evb. Yun-Tse thinks the ganglia setup might be different. Also, uboonedaq-evb is useful as a testing machine and therefore we probably can't guarantee the configuration.
    • We might need to physically move the machine, or stretch wires over to it, because it's in the teststand rack. That might require help from other people that are only available during work hours, and definitely would take some amount of time.
    • We are already planning to use uboonedaq-evb as the fileshare/backup. Using near1 as the new event builder is easier and quicker to do without reconfiguring things.
  • Near1 is currently running the online monitor and serving as a staging area for writing data to tape.
  • In the week of 22nd January 2018 the online monitoring was moved to run on near2 (because near1 failed). This seemed to work fairly well - near2 handled the load - although it is a short- to medium-term solution only.
  • The main thing we don't know at this stage is how to arrange the data transfers from near1 out if near1 was serving as the evb. Action item (RunCos/DAQ team): follow up with Data Management experts about this and formulate a plan.
    • Is there a network bandwith problem for getting data out to tape from the same machine that we're running the event builder on? We think it probably can be done. The reason the data is currently pushed out to near1 instead of sending it directly from evb to tape is to reduce the load on evb, because there are some processes needed to check the metadata before sending files off.
    • We have some space on near1 so we could run and save data on there for a while (as was done on evb when near1 failed) if near1 was being used as the event builder.
    • One option might be to use uboonedaq-evb to stage the data, since it now will have (see above) a 10 GB card.
  • Another thing to think about: the Ganglia server currently is evb, so we would also switch that to near1. The easiest way to ensure this is to just run the ganglia web application on near1 and evb all the time, and also store the graphs on near1. That way it would be very easy to switch to looking at the ganglia plots through near1. This solution will increase network traffic.
  • Action item (who?): install ganglia web server on near1 in addition to evb.
  • Cron jobs under user names and under root will also need to be backed up and switched.
  • Action item (DAQ team): work out how to back up cron jobs (copy to home area?)

4. Other concerns for running the DAQ in the new configuration

  • The run config database lives in smc and is backed up on smc2 as well. To lose the database we'd need to lose smc and smc2, which seems unlikely. With what we've got now we should be able to connect to the run config database.
  • The log files currently get written to the home area. If evb were to fail presumably these files wouldn't be lost, but we would need to think of a way to get them off the RAID disk on evb (because presumably some of them would not be in the TiBS backup).
  • The fhicl file gets printed to uboonenew/config (at the beginning of each run?)
  • There is a file uboonedaq/runinfo.txt which contains information about the current run (see here ). It is read and overwritten at the beginning of each run.
  • More timestamps during each run are written to /data/uboonedaq/metadata and /data/uboonedaq/rawdata
  • Action item (DAQ team): If evb were to fail, the most recent fhicl file, timestamp, and runinfo.txt information would be lost (everything since the last rsync update to uboonedaq_evb or TiBS backup). We need a way to deal with this! Put the rsync at the beginning of every run?

If a SEB fails

  • Plan: If a SEB fails, we wipe the test stand SEB and mirror the old SEB there, then use that as the new SEB.

Notes:

  • One issue is that the machine is in the teststand - we would need cables going from the switchbox over to the test stand to keep it there. Preferred solution: take the machine out and put it in the place of the failed SEB.
  • Mirroring the machine leaves the issue of PCIe cards - we would probably need to swap the cards from the failed SEB to the new one.
    • Will make sure we are using the same kernel on the teststand seb, but may have to recompile windriver

    • The worst case scenario would be recompiling and reinstalling, but that's manageable. We just need to remember that we might need to do that! If we forgot, the DAQ would start but crash during configuration.
  • There has been some discussion about whether SEB 10 would cause a problem but we concluded that it wouldn't. SEB 10 is a 3U machine with space for 4 PCIe cards, but Wes and Yun-Tse think it only actually uses three. The spare machine is a 2U machine that has space for 3 PCIe cards (for NU stream, SN stream, and controller) so should work fine.

If near1 fails

  • Plan: if near1 fails, we move the online monitor to run on near2. We can run stably like this for a number of days but we still need to figure out a plan for draining data from evb without near1.

Notes:

  • We have already had near1 fail, so we're good at this!
  • Near1 is currently running the online monitor and serving as a staging area for writing data to tape.
  • In the week of 22nd January 2018 the online monitoring was moved to run on near2 (because near1 failed). This seemed to work fairly well - near2 handled the load - although it is a short- to medium-term solution only.
  • Moving the online monitor to near2 requires changes in the DAQ code (see the required changes here ) and the config file (compare config files 681 and 691).
  • We had to disable the data management projects on near2.
  • The main thing we don't know at this stage is how to arrange the data transfers without near1. There is no other machine with a 10Gb card to serve the data transferring
. We have some space on evb so could run and save data on there for a while (as we did in January 2018, for 6 days) but it's not a long-term solution. Action item (RunCos/DAQ team): follow up with Data Management experts about this and formulate a plan. This is similar to (but not exactly the same) as the plan we need for if evb dies and we replace it with near1.
  • Can consider to use uboonedaq-evb to take over some data management projects, as we will add a 10Gb card there

Is near1 raid backed up? The online monitor archive there is non-critical, but would be nice not to lose. (/datalocal/om).*

OM notes (Nathaniel)
• The run control script needs to change which machine launches online-monitor (Yun-tse did this)
• The Lizard needs to change where it looks for data. Change symlink from Lizard/server/serve_hists.cgi to point to do_serve_hists.cgi. Ensure correct (new) location for om files is in Lizard/config/config.pl and that it matches what is in the near1online-monitor section of the run config (was /datalocal/omtemp)
• When recovering, copy files from /datalocal/omtemp back to near1:/datalocal/om

If near2 fails (revised Nathaniel)

  • Near2 runs the webserver for the online monitoring lizard (is this correct?), and does odd jobs for SN monitoring. It is unique in that it is the only machine besides ws01/ws02 that is accessible from outside the DAQ network (although it is accessible only from inside the FNAL network). This also means near2 SHOULD be getting more patches.

If it fails, SN monitoring is simply shut down.

Lizard can be run from near1 directly. The problem with this is that the firewall prevents access from outside the DAQ cluster, so shifters will have to use an SSH tunnel to connect to the webserver... and have to have a valid ws01 account to get through the gateway. To get around this, we could relax that firewall rules. (I would prefer if near1 were exposed to FNAL, but this idea was shot down in early DAQ deployment as being insecure.)