Project

General

Profile

Tips to Make Life Easier

n.b Not all of this will make sense if you haven't already read https://cdcvs.fnal.gov/redmine/projects/lbne-daq/wiki/Running_DAQ_Interface

How to run the calibration module (e.g. restart the trigger or change the trigger rate)

Instructions for controlling the LCM can be found here:

http://docs.dunescience.org:8080/cgi-bin/RetrieveFile?docid=621&filename=LCM_App_Note.pdf&version=1

Known Problems

Last updated Mar-7-2016

If the system is being run to the point of backpressure (indicated by an average event rate less than the normal 100 events/sec), about 20% of the time you'll see the asynchronous read error on start

Specifically, this can happen if (A) there's backpressure, and (B) the start transition follows a stop transition, as opposed to an initialize transition. What you'll see is the following:

Fri Mar 04 16:05:19 -0600 2016: %MSG-e TpcRceReceiver3:  BoardReader-lbnedaq1-5308 MF-online
Fri Mar 04 16:05:19 -0600 2016: Got error on aysnchronous read: system:14
Fri Mar 04 16:05:19 -0600 2016: RECV: state 1 mslice state 2 uslice 0 uslice size 16002764 mslice size 16002764 addr 0x7f58cd0e2f00 next recv size 18446744073693548852

..and then the DAQ will crash. The reason this occurs is that, if there was backpressure in the previous run, there may be some data from the previous run still coming from upstream, which effectively appears as junk to the current run and which causes an error. The best way to minimize this happening is to make sure that backpressure is kept to a minimum.

SSP #8 is a spare, and shouldn't be used in DAQ running

Writing more than ~100 MB/s may not necessarily work before backpressure occurs

During studies performed in December 2015, while not all options have been explored it looks like we can achieve at most a throughput rate of 100-110 MB/s before backpressure becomes an issue. When the hardware trigger rate is set to 1 Hz, each RCE will generate 2.5 MB/s. At 10 Hz, this quantity becomes 25 MB/s, assuming ten microslices per trigger. The upshot is that if the hardware trigger is set to 10 Hz, you shouldn't expect to be able to run more than a couple of RCEs before backpressure occurs.

Some warnings aren't actually warnings

During initialization of every run, for every component you can expect to see a message like the following:

This is nothing to be concerned about - although a nuisance, it's unavoidable given the version of the art package lbne-artdaq currently depends on, and doesn't mean anything. Another message you don't need to be concerned about is this one:

Resource-intensive programs might be running on the same hosts that the artdaq processes are running on

Not a problem per-se, but something to be aware of: there's no guarantee that the artdaq processes which are processing data and writing it to disk are the only programs on their host taking up a significant amount of resources. One example of this is "cntrlGui", a GUI used to monitor the RCEs, which will sometimes run on lbnedaq3 (and possibly lbnedaq1 as well); this program is capable of using 10 MB/s of network bandwidth, which could have an effect on the throughput of the DAQ system. You can see that it's running by looking at the "Bytes Sent" and "Bytes Received" ganglia plots for a given host.

If DAQInterface throws an exception, you'll need to kill and relaunch it

Details in the official instructions, here

Signs that a run is working

No error messages, and no more than a couple of warnings per minute

Self-explanatory.

Look for the diskwriting aggregator to be processing an average of 100 events/s.

As of this writing (Feb-4-2016), all the fragment generators under normal conditions will send an average of about 100 millislices a second to the eventbuilders. After these millislices are assembled into complete events in the eventbuilders, they're then sent to the diskwriting aggregator (as opposed to the online monitoring aggregator). In the event of backpressure, typically fewer than 100 events/s will pass through the aggregator; you'll see rates such as 70 events/s, for example. To see this, in the MessageViewer window, under "Categories" (left side of the window), click on "Aggregator" so it's highlighted, then click "Set Filters"; this will display messages only from the aggregators. To return to seeing all messages, click on "Reset ALL".

Info on ganglia plots

Useful plots showing various measures of the performance of artdaq processes over the last hour can be found here . Plots covering the last two hours , four hours , and day are also available. What to look for is that the diskwriting aggregator is writing 100 events/s (as of Feb-4-2015), neither eventbuilder is suffering from a steadily increasing number of incomplete events, and that the RCE and/or SSP fragment generators included in the run are generating 100 fragments/s .

These plots are a subset of http://lbne35t-gateway01.fnal.gov/ganglia/ ; the full site includes other plots, and may be useful for expert analysis.

If you wish to look at ganglia plots offsite, you'll need a VPN (Virtual Private Network); see http://ncs.fnal.gov/nvs/vpn.php for more

Please note: in some cases, if a process becomes unresponsive or dies, a given quantity associated with it (e.g., the fragment rate in a boardreader) will appear "constant" in the plot - this is because ganglia will just keep printing the most recently sent value, which may have in fact been sent a while ago. In this sense, it's possible to miss out on a process behaving dysfunctionally simply by looking at the ganglia plot

The only way to reset a process from sending out the same value is to use it in a new run where it can start sending new reports

Scripts which make life easier

All located in lbne35t-gateway01:/data/lbnedaq/bin :

show_recent_runs.sh <N>

This will summarize the details of the last <N> runs. For example, as I write this, if I run show_recent_runs.sh 5, I see the following:

Run 4082 (Oct 29 16:27) : rces_and_ssps                 : rce00 rce01 rce02 rce04 rce05 rce06 rce07 
Run 4083 (Oct 29 16:44) : rces_and_ssps                 : rce00 rce01 rce02 rce04 rce05 rce06 rce07 
Run 4084 (Oct 29 16:56) : rces_and_ssps                 : rce00 rce01 rce02 rce04 rce05 rce06 rce07 
Run 4085 (Oct 29 22:13) : rces_and_ssps_nodiskwrite     : ssp01 
Run 4086 (Oct 29 22:32) : rces_and_ssps_nodiskwrite     : rce00 rce01 

As you can see, each row tells you the run number, run start time, configuration chosen, and components chosen.

show_all_logfiles_for_run.sh <run number> (optional second argument)

This script can be used in two ways. If you only supply it with the run number argument (i.e., the first argument), it will list the logfiles written to during the course of the run. If you supply ANY additional, second argument ("1", "ozymandius", etc.) then the bash shell's "less" command will be sequentially run on each file, allowing you to examine their contents. To skip to the next file, hit "q". When you hit "q" on the last file, the program exits.

show_warnings_for_run.sh <run number>

This script will actually show you both warning AND error messages printed to the logfile(s) for the supplied run.

show_specific_message.sh <last N runs> <component name> <message label> <message>

A somewhat complicated but powerful script, this will tell you which of the last N runs which contained the component labeled by <component name> printed out "<message>" with the "<message label>". E.g., if I want to know which of the last 100 runs which included the penn fragment generator yielded the following type of message:

Wed Jan 20 19:31:21 -0600 2016: %MSG-w PennDataReceiver:  BoardReader-lbnedaq1-5321 MF-online
Wed Jan 20 19:31:21 -0600 2016: Incomplete ReceiveMicroslicePayload received for microslice 2 (got 264 bytes, expected 416)
Wed Jan 20 19:31:21 -0600 2016: %MSG

...then I would execute

show_specific_message.sh 100 penn01 PennDataReceiver "Incomplete ReceiveMicroslicePayload received" 

if I don't care about the message label, I could replace it with a "*" or a - (note the quotes surrounding the asterisk; no corresponding quote is necessary for the dash). So, e.g.,

show_specific_message.sh 100 penn01 - "Incomplete ReceiveMicroslicePayload received" 

in the output, only those runs which contained the components are listed; those runs with an "X" next to them had the specified message appear. At the end, a summary is printed, which looks something like the following:

In the last 582 runs:

413 contained the desired component(s)

42 of those contained the desired message

Message rate was 42/413 = 10.2% +/- 1.6%

launch_messageviewer

Taking no arguments, this script will simply launch a MessageViewer window. This is useful if, e.g., you're not in ROC West (or more specifically, not in view of the terminal from which DAQInterface was launched) but want to see the lbne-artdaq output in real time as it appears. HOWEVER, make sure you're in contact with the shifter when you launch this, as it will cause messages to STOP being sent to the shifter's MessageViewer window until you kill your MessageViewer window!

Miscellaneous Advice

Don't be surprised by external stop-starts

A script called "auto_file_close.sh" is launched whenever DAQInterface is launched; when diskwriting is on, this script will periodically issue automatic stop-starts to the DAQ in order to close an output Root file and start a new one. The frequency with which this occurs is set in the DAQInterface configuration file.

If you see an error, don't expect anything afterwards to work

Messages appearing in bright red in the MessageViewer window won't necessarily end the run. However, if it doesn't, you might as well end it yourself; you can't expect anything to work correctly after it appears. Work is currently ongoing to reduce this responsibility on the shifter by automating shutdowns.

If a failure occurs during initialization, look for the error message BEFORE the one beginning with "Unknown exception creating a CommandableFragmentGenerator of type"...

During commissioning, a common failure mode is when an error occurs in the constructor of one of the lbne-artdaq fragment generators. When this happens, you'll reliably see a very long error message along the lines of "Unknown exception creating a CommandableFragmentGenerator of type "<fragment generator>" with parameter set "<fhicl parameters>" ", where <fragment generator> here is a placeholder for the actual name of the fragment generator that threw the exception and <fhicl parameters> is a placeholder for a (typically) very long string containing the full FHiCL document used in an attempt to initialize the fragment generator. While unavoidable, as this message is produced by artdaq, and artdaq isn't directly aware of the contents of the experiment-specific fragment generator code, it tends not to be as informative as the error message produced in the fragment generator's constructor itself, which appears before the "Unknown exception creating a CommandableFragmentGenerator" error message. When troubleshooting and posting in the e-log, please use this first error message from the fragment generator, not the "Unknown exception" error message.

It's possible to look at the DAQ's output if someone else is running it

Just run

tail -f /data/lbnedaq/daqlogs/daqinterface/DI.log

Don't let MessageViewer windows pile up

Whenever DAQInterface is launched, a MessageViewer window is launched alongside it. Over a period of several hours, in which DAQInterface may be killed and relaunched several times, several MessageViewer windows may be created. Each window takes up system resources and if windows aren't killed eventually gateway01 will slow down. If a window is no longer needed (i.e., it's not in active use and you're not using its messages to debug a prior run), then kill it when you get a chance.

A Calculator to Determine Whether a Given Set of Parameters Will Overwhelm the DAQ

Can be found here . Make sure the "Sheet1" tab at the bottom of the page is selected (as opposed to the "Sheet2" tab), change the relevant values (most often, this would be the "Trigger rate") and then check the resulting "Max Overwhelm Factor", making sure that it's comfortably less than 1.0 . Note that a certain rudimentary skill level with spreadsheets is needed.