Project

General

Profile

Case Study - RAID Rebuild on EVB

The Problem

  • Around 2.20pm the shifter started reporting DAQ crashes, with major DAQ alarms such as sebXX/memfree and sebXX/Circular_Buffer_Occupancy_sebapp. The DAQ continued to be extremely unstable going onwards.
  • There were also reports of the slow monitoring being very laggy, and a lot of invalid status slow mon alarms (meaning that the monitoring could not connect to the things it was trying to read).
  • Andy (DAQ expert) was called around 6.30pm as the problems had persisted all afternoon and it was becoming almost impossible to run.

Andy’s Investigation

  • First thing Andy did: ask the shifter to start a new run. The shifter said that he already had (multiple times, because of the DAQ instability) and it hadn’t helped.
  • DAQ expert suspected some problem with the DAQ not being able to handle the trigger rate, but had no sign of the trigger rate having gone up. DAQ expert made a new config file with a lower EXT trigger rate (Strobe1/frequency set to 0.99 Hz instead of 699Hz) to see if that stabilised things (not a solution, but if it does make the DAQ run more stably that tells us something).
  • The network experts were informed, because there were also problems with the slow monitoring. They investigated and found no issues with the network, which seems to rule out it being a network issue.
  • Next thing: look at ganglia metrics. Started with evb because the DAQ alarms were coming up for all SEBs, not just one in particular (implies that there’s a problem with a central machine). (Note: get to these plots by going to http://ubdaq-prod-evb.fnal.gov:8080/gweb/ — with SOCKS proxy on to ws01 — and choosing ubdaq-prod-evb-priv in the drop-down box that says “--Choose a Node”). Look through all plots and see if anything stands out.
  • Noticed something strange in event build time: http://dbweb5.fnal.gov:8080/ECL/uboone/E/show?e=62143
    • Event build time gives roughly the inverse of the data rate we can accept. Normal average event build time is max ~15 milliseconds, meaning that the maximum rate we can run at is ~67 Hz. Since around 2pm, the average build time has jumped up to ~100-200 milliseconds, giving a maximum rate that we can run at of ~5-10 Hz. That’s a problem, because the standard run plan config file has an EXT rate of 6.99 Hz and the standard BNB rate is 5 Hz.
  • Also checked the SWtrigger rate vs raw plots.
    • These give the rate of raw triggers issued by the detector divided by the rate of software triggers. Somewhat equivalent to 1/passing fraction, or an averaged number of events since the last one passed the software trigger. An increase in this metric indicates that there are more raw triggers being issued (and could be indicative of eg. a hot PMT). Nothing strange was seen in this case.
  • Another good check: does the raw trigger rate agree with expectations?
    • Checked EXT, BNB, and NuMI trigger rates. Nothing strange was seen here, again ruling out the possibility that we suddenly have lots of extra triggers.
    • Note that the expected EXT rate is given in the name of the config file (with current naming system)
  • So: what can cause the event build time to increase?
    • Probably not caused by a larger number of triggers/PMT noise, because of the two checks above.
    • What about the amount of data per trigger going up? This could happen because of a large amount of TPC noise, which would make the data compression not work so well. However, in this case we’d need a factor ~10 increase in TPC noise to explain the increase in event build time. If that were happening, nonsense would be coming out of the detector and it would be very obvious (lots of other alarms, crazy things in the online monitoring). Since that wasn’t being reported by the shifter, DAQ expert concluded this wasn’t the problem.
    • Could be affected by the status of the evb itself. “Event building” pretty much just means copying data from the SEB to the EVB, building the event, and writing it to disk. So check ganglia metrics corresponding to all three:
      • Network metrics: show packets and bytes sent and received. These were flat with no drastic change, indicating that the evb was not receiving a ton more data from the detector, or sending out a lot more data to be written.
      • Load metrics: these showed an increase around 2pm, showing that the load on evb increased at this time. Note that the load metrics need to be interpreted in parallel with the trigger rates (is the load higher because we’re asking it to write more data by increasing the trigger rate? If the trigger rate hasn’t changed it could be something on evb).
      • CPU metrics: again, an increase around 2pm was seen on some of these metrics (particularly CPU_wio).
    • All of this indicates that there is something extra happening on evb, which is causing the event build time to increase.
  • While this investigation was ongoing, the run seemed to be slightly more stable with the EXT frequency at 0.99 Hz (the beam was off, so this is equal to the total data-taking rate). Since the DAQ expert estimated that, with current event build times, we could sustain a rate of ~5 Hz, later in the night the runco requested that the config be changed to run at 4 Hz overnight.
  • At this point, someone informed the DAQ expert that there was a rebuild of the evb:/data RAID array after a disk failure this morning (this is the array that holds /data, /home, and /uboone-rework on evb, so when we take data we are writing to the /data directory on this array). This could explain the problem - if the RAID array is being built that would explain “something extra happening on evb”.
  • A hypothesis was made: the RAID rebuild and DAQ are both trying to access the same disks, and that was slowing everything down. It was tested by stopping the DAQ (since beam was off), to see if the RAID rebuild speed increased. It did - from ~1% per hour to ~5% per hour! So it seems that this is the problem.