Figuring out PMT problem quickly from readout expert side¶
There was very frequent daq crashes since the power came back yesterday after the shut down.
I saw daq experts and shifters struggle to debug the problem till midnight, then a few more hours to finally reach a hypothesis it’s PMT issue (it is).
I woke up around 5AM and took a look at a log, and I could come to the same conclusion (before seeing their gchat about PMTs) in 10 minutes, instead of trying more daq runs, configs, with time.
Our knowledge can be useful to debug an issue like this, so I note here how I took a look.
If you do not know this, fyi, daq logs can be found here:
) currently on-going run … /home/uboonedaq/daqlogs/uboonedaq/
) “recent” but old runs … /home/uboonedaq/daqlogs/uboonedaq/old
Log files are named like seb-X-RUN-DATES*.log.
You’ll find out.log and err.log extensions. They are from stdout and stderr stream respectively.
I copied a set of example logs here:
0th: realize the size of the file
It is 145MByte!
I explain below, but this file records 16bit data words in ASCII format from an event where problem occurred.
If you do:
wc -w /data/kterao/log_example/run_11328/seb-10-uboonedaq-11328-05.16.2017-00.04.47.err.log
You find there are 26 million words, which majority is 16bit data words.
That tells you immediately, “this is way too much data given our PMT rate”.
Let us overestimate our data rate … 3 FEM = 48 x 3 channels (not all connected to PMT but I wanna overestimate here), 5kHz cosmic ray rate, 40 words/cosmic, 6.4ms readout per event + 1500 words/channel.
That’s 48 * 3 * (5 * 6.4 * 40 + 1500) = 400k words. Not 26 million, or 26000k words.
So that should alarm you right away the data rate is crazy from PMTs (though you should check the log file contents anyway).
1st: tail err.log files all at once
In the output, you will see some hex dump for sebs that have crashed w/o exception handling.
In this particular case, you will see seb10 is the only one that crashed.
NOTE: seeing hex-dump means that seb crashed, but it does not mean that seb had a problem. This can happen also when data could not be transferred to evb, which process has nothing to do with readout.
2nd: if this is seb-10 error, most likely this is due to corrupted data. Take a lookg.
You’ll see something like this in the content:
Caught exception in ub_PMT_WindowDataCreatorHelperClass::populateChannelDataVector() Message: datatypes_exception Message: Junk data: Bad PMT Window Header channel number: 59
Object gov::fnal::uboone::datatypes::ub_MarkedRawChannelData<gov::fnal::uboone::datatypes::ub_PMT_WindowHeader_v6, gov::fnal::uboone::datatypes::empty_trailer> const*.
Object gov::fnal::uboone::datatypes::ub_PMT_WindowHeader_v6 const*.
Channel[3b], Discriminator[5, UNKNOWN], Marker9, RAW[0x9a3b]
Frame1, Sample[17f1a], Markers[3, 2], RAW[0xb937, 0xaf1a]
This is an error report from uboonedaq_datatypes software package.
It briefly tells you what is attempted and failed.
You can almost always assume what is attempted is a data decoding because that is the only thing datatype does.
Take a careful look: it tells you that it attempted to decode RAW[0x9a3b] … and found Channel[3b] and Discriminator[5, UNKNOWN] … channel 3b = 50, so that’s unreal.
At this point you wonder “is it the channel header corrupted? or this is just totally random data?”.
That’s what you need to get to, at this point.
3rd: look into the neighboring data words
So you know the problem occurred with data word 0x9a3b … so we can search for this word, take a look at neighboring data words and see if it makes sense.
The rest of err.log (which is HUGE) contains the full set of data transferred.
So you can search for 0x9a3b… and you should find it @ line 4316 for its first appearance in this log file.
So let’s look at neighbors… but you have to know some patterns of hex words to debug this problem.
a) PMT channel data starts with 9XYY where X encodes discriminator ID, and YY channel number… discriminator ids are 2,4, or 8.
b) PMT channel data ends with bXXX where XXX is the adc value of the last word (but you can look for “b” for the data last word).
c) PMT channel data adc words look like aXXX where XXX is 12-bit adc value, except for d) below.
d) PMT channel data adc words look like 8XXX where XXX is 12-bit adc value, and this happens every-8-words for the purpose of periodic clock marking.
To be clear, d) ensures that adc words are in synch with clock, and if you see anything deviating from “every-8-words” rule, that means data corruption either due to inherently clock-out-of-sync or memory overwritten due to buffer overflow.
Now let’s get back to 0x9a3b has 2 issues immediately: discriminator id is invalid, and channel number is unreal.
So this might not be even a channel header.
In fact, if you count, from this word backward, you find 87fd 8-words-apart.
This suggests this 9a3b is a corrupted adc word which was supposed to be a time marker.
Now so where’s the real header? If you search for “ 9” (note the space before 9), then you can easily spot channel headers in the file.
I attach the highlighted (by emacs search) png to this message.
You immediately notice a few things:
) There are a lot of 0x921b … which is a cosmic discriminator (2) channel 27 (1b).
) There is another corrupted word 97df which is also in the place where time maker is supposed to be (so this word is not a real header, and it should have been 97df).
The first point is consistent with the high PMT rate assumption we got from this log file’s size.
And, by this point, it is clear as hell that the data is really corrupted. So dare not to blame a decoding algorithm.
We can also count # of cosmic discriminator words (approximately) from command line by a single line execution:
[kterao@ubdaq-prod-ws01 ~]$ python -c "print len([ w for w in open('/data/kterao/log_example/run_11328/seb-10-uboonedaq-11328-05.16.2017-00.04.47.err.log','r').read().split() if w.startswith('92') and len(w)==4])"
… so 212k cosmic discriminator firing! That’s way too much.
You can change above command “w.startswith(’92’)” to “w.startswith(‘921b’)” to search for a specific channel 27 as well.
I hope this posting is helpful, in 4 steps showing how I briefly take a look at a log and understand what is causing this data corruption.