Project

General

Profile

Feature #4810

Investigate why we get backpressure of more than 5 seconds at LNGS

Added by Kurt Biery about 6 years ago. Updated over 5 years ago.

Status:
Closed
Priority:
High
Target version:
Start date:
10/21/2013
Due date:
04/25/2014
% Done:

50%

Estimated time:
Duration: 187

Description

Occasionally, we see instances where EventBuilders fail to enqueue events in the EventStore within the 5 second timeout.

What is happening that causes backpressure for more than 5 seconds? Is this related to starting to write a new file?

dsag_4_9_2014.txt (229 KB) dsag_4_9_2014.txt Ron Rechenmacher, 04/08/2014 08:25 PM

Related issues

Related to artdaq - Support #5971: Incorporate control of the Linux system fsync calls into the systemNew04/21/201405/30/2014

Related to artdaq - Support #5866: [DS-50] Add the ability to preload a fake_fsync library in pmt.rbClosed04/09/201404/18/2014

Related to ds50daq - Support #6006: Modify the startSystemWithVeto script to specify the fake_fsync preloadClosed04/23/2014

History

#1 Updated by Kurt Biery over 5 years ago

  • Due date changed from 10/31/2013 to 04/25/2014
  • Status changed from New to Assigned
  • Assignee set to Ron Rechenmacher
  • Priority changed from Normal to High
  • Target version set to v1_01_00

This problem has become more noticeable, and a higher priority, in recent weeks. In particular, it seems that every Saturday morning, the problem is so bad that it takes 10s of seconds to close one file and open another. (I will add some data on the observed behavior soon.)

Ron has done a considerable amount of investigation on the nature of the problem and why it happens on Saturday mornings.

#2 Updated by Kurt Biery over 5 years ago

One way to look at the times when the problem has occurred is to look at runs that have a low efficiency (live time over run duration) from the database:

[biery@dsfr6 ~]$ psql h localhost -d darkside50 -p 5438 -U rea -c "select (livetime_v1495 / runtime_daq) as efficiency, extract(epoch from date_trunc('minute', start)), run, start from run.tpc_info where (type ~ 'normal.*' or type ~ 'veto*' or type ~ 'g2.*') and (livetime_v1495 / runtime_daq) < 0.9 order by run desc"
efficiency | date_part | run | start
-------------------
------------------+----------------------------
0.710064138378845 | 1396683480 | 7827 | 2014-04-05 09:38:01.403905
0.738014575873516 | 1396675920 | 7825 | 2014-04-05 07:32:03.288586
0.213184376997869 | 1396075740 | 7734 | 2014-03-29 07:49:50.27952
0.126946555905994 | 1396072080 | 7733 | 2014-03-29 06:48:52.742446
0.141136445625753 | 1396069020 | 7732 | 2014-03-29 05:57:28.741935
0.186399599950017 | 1396066680 | 7731 | 2014-03-29 05:18:57.642239
0.244041272622786 | 1396066380 | 7730 | 2014-03-29 05:13:30.234515
0.27927515581948 | 1396066320 | 7729 | 2014-03-29 05:12:53.246633
0.255072569632858 | 1396066320 | 7728 | 2014-03-29 05:12:17.361164
0.253350724029984 | 1396066260 | 7727 | 2014-03-29 05:11:40.843243
0.265099540172886 | 1396066260 | 7726 | 2014-03-29 05:11:02.2464
0.196721439838295 | 1396063200 | 7725 | 2014-03-29 04:20:24.924793
0.166297614187295 | 1396060800 | 7724 | 2014-03-29 03:40:00.934487
0.090205057603154 | 1396060740 | 7723 | 2014-03-29 03:39:24.822345
0.180049190635211 | 1396059060 | 7721 | 2014-03-29 03:11:11.354275
0.610416957245335 | 1395825060 | 7694 | 2014-03-26 10:11:25.996981
0.797184271409756 | 1395755880 | 7687 | 2014-03-25 14:58:44.161702
0.841135912492899 | 1395753780 | 7685 | 2014-03-25 14:23:44.615023
0.820956442092353 | 1395742440 | 7683 | 2014-03-25 11:14:12.117052
0.789778637622701 | 1395740880 | 7682 | 2014-03-25 10:48:44.650355
0.837845655903464 | 1395720000 | 7680 | 2014-03-25 05:00:11.765858
0.820936217451014 | 1395713220 | 7679 | 2014-03-25 03:07:22.740724
0.821252177173141 | 1395702960 | 7678 | 2014-03-25 00:16:30.095208
0.841532350628892 | 1395700920 | 7677 | 2014-03-24 23:42:06.501689
0.783902200906849 | 1395695880 | 7676 | 2014-03-24 22:18:46.625809
0.856619684691326 | 1395694260 | 7675 | 2014-03-24 21:51:10.201193
0.821693892238881 | 1395682440 | 7673 | 2014-03-24 18:34:32.647286
0.802233533802846 | 1395670980 | 7672 | 2014-03-24 15:23:42.138878
0.855522290631215 | 1395668160 | 7671 | 2014-03-24 14:36:30.670304
0.763144581198865 | 1395662880 | 7670 | 2014-03-24 13:08:13.624713
0.73516888517581 | 1395661260 | 7669 | 2014-03-24 12:41:55.134248
0.830426121180538 | 1395658560 | 7668 | 2014-03-24 11:56:38.705325
0.805613187874712 | 1395656940 | 7667 | 2014-03-24 11:29:01.08032
0.7390269265112 | 1395653040 | 7666 | 2014-03-24 10:24:16.735282
0.214452907074285 | 1394867940 | 7553 | 2014-03-15 08:19:36.441549
0.775543339114305 | 1394866920 | 7552 | 2014-03-15 08:02:34.948108
0.713433565975914 | 1394865000 | 7551 | 2014-03-15 07:30:54.558135
0.342244775861206 | 1394859660 | 7550 | 2014-03-15 06:01:18.324355
0.665468956752526 | 1394269800 | 7493 | 2014-03-08 10:10:05.034168
0.887182175023288 | 1394267520 | 7492 | 2014-03-08 09:32:24.413115
0.465138971104069 | 1394260980 | 7491 | 2014-03-08 07:43:47.337943
0.361183102415155 | 1393664880 | 7438 | 2014-03-01 10:08:28.951097
0.242630070509939 | 1393656780 | 7437 | 2014-03-01 07:53:51.898473
0.805820184292489 | 1393075680 | 7386 | 2014-02-22 14:28:58.986283
0.77170107635837 | 1393073940 | 7385 | 2014-02-22 13:59:46.377499
0.800386318340643 | 1393061400 | 7383 | 2014-02-22 10:30:33.87924
0.820985442438136 | 1392718860 | 7382 | 2014-02-18 11:21:40.663302
0.878784942695275 | 1392650280 | 7377 | 2014-02-17 16:18:39.178722
0.744967096173832 | 1392478620 | 7344 | 2014-02-15 16:37:28.517067
0.871845787306139 | 1391919000 | 7271 | 2014-02-09 05:10:03.512266
0.860118581451607 | 1391908320 | 7269 | 2014-02-09 02:12:34.804529

I haven't yet investigated whether the low efficiencies on 24-Mar and 25-Mar are due to problems in switching files.

The other way to check for occurrences of the problem is to examine the artdaq process log files.

#3 Updated by Kurt Biery over 5 years ago

From looking at the logfiles, it seems that the majority of the low-efficiency runs on 24-Mar and 25-Mar were due to this problem.

#4 Updated by Ron Rechenmacher over 5 years ago

The smoking gun as far a why the problem occurs on Saturday morning is, from the Raid Control log, the fact that the
Weekly "Consistency Check" starts Saturday at 3:00 AM. On April 5, 2014, the check finished at 05-04-2014 11:13:43.
I believe we do not need to change the configuration. This incident just highlight the non-optimal system configuration
and the way we are writing (including closing) files. We are testing the behaviour of the system without "fsync" and
will know more about the correct way to change the system after we gain experience with no fsyncs.

#5 Updated by Kurt Biery over 5 years ago

Based on discussions with Ron, Alessandro, Stephen, and other members of the DAQ team, we've decided on the following plan:
  • keep the existing fake_fsync preload in place
  • modify the artdaq system startup script (pmt.rb) to support the optional enabling of the preload (this will provide a slightly more robust situation compared to the current one in which the needed environmental variable is simply set in the startSystem TMUX window) [Kurt will do this]
  • modify the system configuration on the dsag computer (the Aggregator computer) to reduce the amount of disk write cache [Ron will do this]
  • talk to the ROOT development team about making the 'fsync' call configurable (default on, as it is now, but allowing us to turn it off) [Ron has already started this discussion]
  • modify the scripts that move data off the dsag computer (to the ds50fs computer at LNGS) so that they don't do this on Saturday mornings (say 02:30 to 12:00 local time) [Alessandro will do this]
  • make no changes to the existing RAID system Consistency Check setup

#6 Updated by Kurt Biery over 5 years ago

For reference, we are summarizing various details and observations regarding this issue on the artdaq Wiki pages (https://cdcvs.fnal.gov/redmine/projects/artdaq/wiki/DarkSide-50_disk_writing_lessons).

#7 Updated by Kurt Biery over 5 years ago

  • Status changed from Assigned to Resolved

The work-arounds that have been identified have been implemented (reducing the allowed dirty fraction in the disk cache, and skipping the fsync call). And, Issues have been filed for making the disabling of the fsync call more formalized.

#8 Updated by Kurt Biery over 5 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF