Project

General

Profile

Bug #10695

Data Logger: crash at run rollover

Added by Keith Matera about 4 years ago. Updated about 4 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Category:
-
Start date:
10/29/2015
Due date:
% Done:

0%

Estimated time:
component:
base
Duration:

Description

Data logger will occasionally crash at run rollover. The most recent incidence of this was on the FD @ 10/28/2015 21:41:15 .

Logfile at /daqlogs/FarDet/Partition1/DataLogger/novadaq-far-datadisk-04/ndl_20151028_154736.log .

It appears that some debug statements were put in place. Who added these? Information pasted below:

21:41:14.588218 28 Debug test 1
21:41:14.588227 28 Debug test 2
21:41:14.590506 28 RunStream::addEvent Adding event of size 34883 to /data-a/FarDet/fardet_r00020972_s63_DDMoon.raw
21:41:14.590949 6 DB2EP::get_db: List is empty!
21:41:14.615514 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDSun.raw
21:41:14.615707 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDMoon.raw
21:41:14.615787 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDContained.raw
21:41:14.615873 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_t00.raw
21:41:14.615949 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_t05.raw
21:41:14.616036 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_t02.raw
21:41:14.616112 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_t04.raw
21:41:14.616188 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDenergy.raw
21:41:14.616268 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDnumu.raw
21:41:14.616341 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDfastmono.raw
21:41:14.616414 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDcalmu.raw
21:41:14.616480 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDslowmono.raw
21:41:14.616556 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDsnews.raw
21:41:14.616635 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_DDupmu.raw
21:41:14.616713 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00020973_s00_unknown.raw
21:41:14.616729 28 Debug test 3
21:41:14.616737 28 Debug test 4
21:41:14.616743 28 Debug test 5
21:41:14.616752 28 Debug test 6
21:41:14.616758 28 Debug test 7
\*** glibc detected \*** NovaDataLoggerapp: double free or corruption (fasttop): 0x0000000001f23f90 \*** (Backslashes mine, to prevent formatting)

======= Backtrace: =========
/lib64/libc.so.6[0x3844875e66]
/nova/ups/boost/v1_45_0/Linux64bit+2.6-2.12-gcc45/lib/libboost_regex-mt.so.1.45.0(ZdlPv+0x1f)[0x7f8e6264f0f7]
/nova/ups/boost/v1_45_0/Linux64bit+2.6-2.12-gcc45/lib/libboost_regex-mt.so.1.45.0(_ZN9
_gnu_cxx13new_allocatorIcE10deallocateEPcm+0x20)[0x7f8e62635eea]
/nova/ups/boost/v1_45_0/Linux64bit+2.6-2.12-gcc45/lib/libboost_regex-mt.so.1.45.0(_ZNSs4_Rep10_M_destroyERKSaIcE+0x4d)[0x7f8e626354b3]
...

History

#1 Updated by Keith Matera about 4 years ago

This problem is now our single largest source of downtime, averaging a bit more than 1 crash per day with error message "Error in DataLogger due to DataLogger not responsive". I looked into the DataLogger and RC logfiles, and found that these crashes are accompanied by a large number of size 0 data files on datadisk-04.

I present here an example.

12 seconds before the crash at the start of Run 21299, the DataLogger logfile's last output ( in /daqlogs/FarDet/Partition1/DataLogger/novadaq-far-datadisk-04/ndl_20151124_143125.log ) was:

23:45:47.927298 28 Debug test 1
23:45:47.927306 3 BEFORE Adding run_subrun (run 21298, subrun 63) to EraseContainerList sz=0
23:45:47.927313 3 AFTER Adding run_subrun (run 21298, subrun 63) to EraseContainerList sz=1
23:45:47.927320 28 Debug test 2
23:45:47.927504 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDupmu.raw
23:45:47.927603 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDsnews.raw
23:45:47.927679 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDslowmono.raw
23:45:47.927748 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDcalmu.raw
23:45:47.927816 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDfastmono.raw
23:45:47.927882 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDnumu.raw
23:45:47.927948 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDenergy.raw
23:45:47.928019 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_t04.raw
23:45:47.960319 28 RunStream::addEvent Adding event of size 16811 to /data-a/FarDet/fardet_r00021299_s00_DDfastmono.raw
23:45:47.960473 28 RunStream::addEvent done
23:45:47.967463 28 RunStream::addEvent Adding event of size 13687 to /data-a/FarDet/fardet_r00021299_s00_DDfastmono.raw
23:45:47.967580 28 RunStream::addEvent done
23:45:47.973048 28 RunStream::addEvent Adding event of size 22123 to /data-a/FarDet/fardet_r00021299_s00_DDenergy.raw
23:45:47.973147 28 RunStream::addEvent done
23:45:47.981610 28 RunStream::addEvent Adding event of size 169047 to /data-a/FarDet/fardet_r00021298_s63_t02.raw
23:45:47.982178 28 RunStream::addEvent done
23:45:47.999827 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_t02.raw
23:45:48.000046 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_t05.raw
23:45:48.000141 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_t00.raw
23:45:48.000232 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDContained.raw
23:45:48.000324 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDMoon.raw
23:45:48.000429 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_DDSun.raw
23:45:48.000525 28 RunStream::Opened stream = /data-a/FarDet/fardet_r00021299_s00_unknown.raw
23:45:48.000543 28 Debug test 3
23:45:48.000552 28 Debug test 4
23:45:48.000559 28 Debug test 5
23:45:48.000577 28 Debug test 6
23:45:48.000584 28 Debug test 7

All files opened here, except for fardet_r00021299_s00_DDenergy.raw and FarDet/fardet_r00021299_s00_DDfastmono.raw are of size 0, currently sitting on datadisk04.

The code surrounding the opening of these files, in NovaDataLogger::RunStream::init( ), reads:

/// Now, open the run stream file
_FileStream = fopen (_FileName.c_str(), "wb");
_isFileOpen = true;
TRACE ( LogRunDebugInfo, "RunStream::Opened stream = %s\n",_FileName.c_str());

/// Write Out the Run Header
size_t size = run_header.sizeofdata();
fwrite (run_header.getBuffer() , 4 , size , _FileStream);
TRACE ( LogInfo, "RunStream::Write buffer = %s\n",_FileName.c_str());

This might indicate that either the size of the run_header is somehow set equal to 0, or that the attempt to write the header to file fails. I do not know enough about the system to pursue the question further at this time.

At the same time, in RC, we see (quoting /daqlogs/FarDet/Partition1/RunControl/novadaq-far-runcontrol/rcServer_20151124_142849.log ):

23:45:48.145531 2 updateRunInfo nevents 0 run 21299 subrun 0
Connecting with: novadaq-far-db-03.fnal.gov:5432:nova_prod:novadaq
Connecting with: novadaq-far-db-03.fnal.gov:5432:nova_prod:novadaq
Connecting with: novadaq-far-db-03.fnal.gov:5432:nova_prod:novadaq
Connecting with: novadaq-far-db-03.fnal.gov:5432:nova_prod:novadaq
Created 75615
Created 75616
23:45:49.596611 2 Checking run status
23:45:49.596779 2 updateRunProgress 0 0
23:45:51.597611 2 Checking run status
23:45:51.598171 2 updateRunProgress 0 0
23:45:53.597724 2 Checking run status
23:45:53.597901 2 updateRunProgress 0 0
23:45:55.597761 2 Checking run status
23:45:55.597953 2 updateRunProgress 0 1
23:45:57.597747 2 Checking run status
23:45:57.598323 2 updateRunProgress 0 1
23:45:59.596908 2 Checking run status
23:45:59.597064 2 updateRunProgress 0 1
23:46:00.539422 0 Error in DataLogger due to DataLogger not responsive
23:46:00.539802 2 error -1 Error in DataLogger due to DataLogger not responsive
23:46:00.540074 2 warning 34, Error in DataLogger due to DataLogger not responsive
23:46:00.540103 1 34, Error in DataLogger due to DataLogger not responsive
23:46:00.540235 0 Error in DataLogger due to DataLogger not responsive Stopping run.
23:46:00.540397 2 error -1 Error in DataLogger due to DataLogger not responsive Stopping run.
23:46:00.541694 2 Making transitition: BeginningRunToRunning

About 12 seconds after the run rollover is documented in the datalogger log file, the RC log file reports the Error in DataLogger due to DataLogger not responsive error message that acts as the harbinger of crash. Does anyone on this watch list have some ideas / some time to pursue this problem further, this week?

This is one of the last great Labors standing between us and a greatly reduced downtime. I'm going to forward this information to the DAQ support mailing list, as well, in the hopes of drawing out some new information.



Also available in: Atom PDF