Project

General

Profile

New Expert's Guide to Diagnosing DAQ problems

To paraphrase Wes's slides from the DAQ school: DO NOT PANIC! If you do panic, DON'T LET ANYONE ELSE PANIC. Don't let the shifter know you're panicking - it won't help! Rather than say "I have no idea what's going on", say "I'm going to log in and take a look and call you back soon".

Remember that you can always call other DAQ experts for backup - we are a team! Readout and Data Management experts might also be able to help you troubleshoot, so don't be afraid to call them (or ask the shifter to contact them).

Having said that, here are some general tips that you might find useful. If you have no idea where to start, following the steps here should help give you an idea what the problem is.

General comments/advice for troubleshooting:

  • Never be short or upset about being called. Remember that the shifter and run co-ordinators and other experts are your friends! Ask their names. Ask them to help you, and thank them for doing so.
  • Always, always elog everything you do. Remind shifters to put plots and comments in the elog, and put plots and comments in the elog yourself! Use the elog to communicate necessary information to relevant experts, and to keep a record of what you investigated and what you saw. It might come in useful for future-you if the same problem comes up again.
  • If we all use the elog properly it can be a great resource - you can search for key phrases to see if the problem has occurred before, and what experts said about it then.
  • Always kindly ask what state the detector is in
    • Is there beam? From BNB and NuMI?
    • Is there activity on the platform? (That can cause DAQ instability, but it normally sorts itself out once people leave the platform)
    • Is anything in alarm? What exactly is in alarm? Are there any acknowledged alarms? How long has it been in alarm?
    • Does it look like we are getting data? At what rate?

Where to start looking for problems

  • “DAQ problems” are almost never actually a problem with the DAQ. If there are problems with the DAQ, stopping and restarting the run will kill and restart every DAQ process, and therefore get rid of whatever buggy DAQ process was causing problems. If this doesn’t fix the problem, it’s not an issue with the DAQ processes.
  • On the other hand, a lot of the time symptoms of the problem will show up in the DAQ, and so the shifter will call you. That makes our job to diagnose problems and route them to other relevant experts if necessary.
  • Often it is a problem with the readout electronics. However, we can’t just go accusing the readout electronics of having problems without investigating and showing that they do.
    • Have a look at the trigger rates and trigger fractions in the ganglia metrics (see the section below about how to use the ganglia metrics and what to check). Make sure that they match up with what you expect (from the past metrics and from config files). Problems in readout electronics could cause more/fewer/bad triggers to be sent so you might see it here.
    • Also look for data corruption by looking in the log files.
    • If you do find a problem that points towards the readout electronics, call (or have the shifter call) the readout electronics expert.
  • If you can’t rule in a readout electronics problem, the next most likely culprits to check are the disks, the network, or extra processes running on evb.
    • Again, look at the ganglia metrics: check out the load metrics, CPU metrics, and network metrics.
    • If there seems to be a problem in the network, ask the runco to contact the network people
    • If there seems to be a problem with the disks, that’s an issue for the SLAM team - again, ask the runco to contact them.

Things to check on ganglia (as a starting point, if you don’t know where to start):

  • Note: get to the ganglia metrics by going to http://ubdaq-prod-evb.fnal.gov:8080/gweb/ — with SOCKS proxy on to ws01 (see this page for more instructions) — and choosing, for example, ubdaq-prod-evb-priv in the drop-down box that says “--Choose a Node”
  • Some ganglia metrics are reported to SlowMon. If the DAQ is in alarm, it's probably coming from there - you could check it on the SlowMon if you prefer.
  • Every SEB has its own circular buffer. If you’re getting errors related to a single SEB only, then check the ganglia metrics for that SEB. If you’re getting errors related to multiple different SEBs then it’s more likely to be related to one of the central computers. In this case, the most likely culprit is the EVB, but if you see nothing on the EVB you may also want to check the SMC.
  • Failing that, the only SEB that’s “special” is SEB 10, because it has the trigger stream and PMTs. If you can’t see anything obvious on EVB or SMC, you could try looking at SEB 10.
  • A good place to start: does the raw trigger rate agree with expectations?
    • Check EXT, BNB, and NuMI trigger rates.
    • Note that the expected EXT rate is given in the name of the config file (with current naming system)
  • Another thing to check: the SWtrigger rate vs raw plots.
    • These give the rate of raw triggers issued by the detector divided by the rate of software triggers. Somewhat equivalent to 1/passing fraction, or an averaged number of events since the last one passed the software trigger. An increase in this metric indicates that there are more raw triggers being issued (and could be indicative of eg. a hot PMT).
  • The system metrics are invaluable!
    • Network traffic: direct way to see data flowing
    • Memory usage: see if data is getting stalled on a process. Look for "Use", not "Cache" (though "Cache" should really be small too)
  • Some of the most useful SEB metrics are:
    • NU-DMAReadRate
    • NU-DMA-CircularBufferOccupancy
    • WriteFragmentDataRate
    • FragmentOutboundQueueDepth
  • Some of the most useful EVB metrics are:
    • Various TrigRate metrics
    • WriteDataRate
    • WriteEventRate
    • RemainingEventStoreCapacity
  • If all else fails: look through all the plots you can see and look to see if there are any recent changes (or at times that could be related to the incident you’re investigating)

The DAQ Logs

DAQ logs for each of the processes are written to ~uboonedaq/daqlogs/uboonedaq on each of the machines (EVB and the SEBs). These logs get moved to ~uboonedaq/daqlogs/uboonedaq/old at the start of a new run. These contain a lot of detail (everything that gets printed out by any of the DAQ processes as they run!) and can be a very useful tool if you need to dig into a problem that you can't diagnose elsewhere.

  • In the SEB logs...
    • Search for "ProcessingFragmentsStart" to get the start of a run
    • You should see "StopRunRequest" when the SEB receives a stop signal
    • You should see SEB fragments getting sent!
  • In the EVB log...
    • Search for "ProcessingFragmentsStart" to get the start of a run
    • You should see the EVB receiving fragments, and fragments being inserted to EventStore
    • There will be a print of the global header on each event written to disk