Project

General

Profile

Feature #18058

Add to TFileService a mechanism to switch file during processing

Added by Antonino Sergi over 2 years ago. Updated about 2 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Category:
Application
Target version:
Start date:
10/27/2017
Due date:
% Done:

100%

Estimated time:
16.00 h
Spent time:
Scope:
External
Experiment:
DUNE
SSI Package:
art
Duration:

Description

I would say that the general use case would be simply a trigger (event number periodicity, a flag in the data, a signal coming from some control unit) that makes TFileService clone the current stuff, save it in the open file, close the file, open a new one, maybe reset histos&co.

A higher level description is that we have an online monitor made of a chain of producers and analyzers, and we want to switch to a new file every now and then instead of waiting for the run to finish to have one single file properly closed.
Our current approach in the data logger is to switch to a new file every 1000 events, and RootOutput allows it easily. A similar, or even more generic, behaviour would be desirable also outside of the data path, i.e. for online monitoring histograms and event display.

We would need this feature rather urgently, for ProtoDUNE ColdBox tests, and we are willing to help to implement it.

TFileDirectory.cc (2.4 KB) TFileDirectory.cc Kyle Knoepfel, 11/02/2017 09:06 AM
CMakeLists.txt (232 Bytes) CMakeLists.txt Kyle Knoepfel, 11/02/2017 09:06 AM
TFileDirectory.h (3.3 KB) TFileDirectory.h Kyle Knoepfel, 11/02/2017 09:06 AM
TFileService.h (3.24 KB) TFileService.h Kyle Knoepfel, 11/02/2017 09:06 AM
TFileService_service.cc (5.99 KB) TFileService_service.cc Kyle Knoepfel, 11/02/2017 09:06 AM
TFileDirectory.p (1.54 KB) TFileDirectory.p Kyle Knoepfel, 11/08/2017 04:32 PM

History

#1 Updated by Kyle Knoepfel over 2 years ago

Tonino, are you able to discuss this issue by video conference? If so, I propose sometime Monday morning (FNAL time)--perhaps 9 am--I believe that would be 3 pm CERN time once you take into account daylight savings ending this weekend in Europe but not the US.

#2 Updated by Antonino Sergi over 2 years ago

Hi Kyle,
since the usual daq meeting has been cancelled, it would be better at 10am, if it's ok for you.

#3 Updated by Kyle Knoepfel over 2 years ago

10 am should be okay.


Connection information

Join from PC, Mac, Linux, iOS or Android: https://fnal.zoom.us/j/6766408743

Or iPhone one-tap :
US: +16699006833,,6766408743# or +16465588656,,6766408743#
Or Telephone:
Dial(for higher quality, dial a number based on your current location):
US: +1 669 900 6833 or +1 646 558 8656
Meeting ID: 676 640 8743
International numbers available: https://fnal.zoom.us/zoomconference?m=n6YR0dydyhMd-ezhEBk0Imqh2uEuXhSI

Or an H.323/SIP room system:
H.323:
162.255.37.11 (US West)
162.255.36.11 (US East)
221.122.88.195 (China)
115.114.131.7 (India)
213.19.144.110 (EMEA)
202.177.207.158 (Australia)
209.9.211.110 (Hong Kong)
64.211.144.160 (Brazil)
69.174.57.160 (Canada)

Meeting ID: 676 640 8743

SIP:

#4 Updated by Antonino Sergi over 2 years ago

The meeting I had in 10 minutes was cancelled, so if you prefer 9am I'm available. I'm already connected

#5 Updated by Kyle Knoepfel over 2 years ago

  • Priority changed from Normal to Urgent

#6 Updated by Rob Kutschke over 2 years ago

I won't be able to make the meeting but here is my input.

Please think about the use case that only some modules in the job want this behaviour. Is it practical to support that or is this an all or none proposition?

I recommend that you minimize the scope of work done by art and maximize the scope of work done by user called that is registered with art and is called by art at the appropriate times - this gives end users maximum flexibility.

Here is one example of minimizing work within art. There can be many triggers for the close-file-open-new-file sequence. A few are straightforward to define and can be implemented within art: assert the trigger every N events, every Run, every Subrun, every input file, every output file on the output module with a specified modulelabel. If people want more complex triggers they should go into user written registered callbacks, not into art infrastructure. An example of this might be writing output files and resetting histograms when some particular histogram reaches a certain number of entries or when a statistically significant anomaly is discovered.

A second example of minimizing work within art. I think that art's job should be limited to opening and closing files; and poking user written callbacks to do the rest. I think it's far too much work to develop a configuration language that explains to art which of the possible actions it should take on which histograms in which modules.

As much as possible things like making copies of histograms in the new file or resetting histograms etc should be done by callbacks to user code. If possible the trigger should also be asserted by registering callbacks to user code.

#7 Updated by Kyle Knoepfel over 2 years ago

  • Status changed from New to Assigned
  • Assignee set to Kyle Knoepfel

We believe we can provide an implementation that places reasonable demands on the user without introducing unreasonable entanglements within art.

#8 Updated by Kyle Knoepfel over 2 years ago

  • Category set to Application
  • Estimated time set to 16.00 h
  • SSI Package art added

#9 Updated by Kyle Knoepfel over 2 years ago

  • % Done changed from 0 to 90

This feature has been largely committed with commit art:5c502b4c. Some remaining issues may need to be discussed with the stakeholders and with the art team.

Documentation regarding how it can be enabled is forthcoming.

#10 Updated by Kyle Knoepfel over 2 years ago

Please see the attached files that include versions of TFileService and TFileDirectory, which support the file-switching behavior you seek. In addition, there is a CMakeLists.txt file that gives you the appropriate library dependencies, assuming you're using mrb/cetbuildtools.

There will likely be some adjustments required whenever the updated TFileService is released with art 2.10.00. For that reason, I encourage you to rename your copy of the service to something else. This will avoid unintentional collisions with the officially-released version.

We spent considerable time trying to find a way to allow TFileService to seamlessly switch to a new input file without users needing to change their C++ source code. Unfortunately, since TFileService exposes various cached objects and allows users to cache pointers to ROOT objects, it was not feasible to do this given ROOT's memory management. Even though it may be possible to retain the pointer values for anything associated with a TFileService file, whenever a new file is opened, the directory pointer cached by the TObjects would be incorrect. Adjusting that pointer is difficult as is recreating a new TFile object at the same memory address as the old file without causing memory corruption issues.

For that reason, we have added new interface to TFileService that allows users to specify which actions should be performed after a file switch has occurred. This gives users the opportunity to update their pointers at the right time, but it still conceals the file-switching mechanics that are necessary for the framework to handle. See below for documentation.


Job configuration

To configure the TFileService to switch to a new output file for a given condition, the optional fileProperties table can be specified, just as it can be specified for any RootOutput module:

## Any parameters prefaced with '#' are optional.

TFileService: {

   closeFileFast: true  # default

   fileName: <string>

   tmpDir: "<parent-path-of-filename>"  # default

 # fileProperties: {
 #
 #    maxEvents: 4294967295  # default
 #
 #    maxSubRuns: 4294967295  # default
 #
 #    maxRuns: 4294967295  # default
 #
 #    maxInputFiles: 4294967295  # default
 #
 #    ## Maximum size of file (in KiB)
 #
 #    maxSize: 2130706432  # default
 #
 #    ## Maximum age of output file (in seconds)
 #
 #    maxAge: 4294967295  # default
 #
 #    ## The 'granularity' parameter specifies the level at which
 #    ## an output file may be closed, and thereby the granularity
 #    ## of the file.  The following values are possible:
 #    ##
 #    ##     Value        Meaning
 #    ##    =======================================================
 #    ##    "Event"       Allow file switch at next Event
 #    ##    "SubRun"      Allow file switch at next SubRun
 #    ##    "Run"         Allow file switch at next Run
 #    ##    "InputFile"   Allow file switch at next InputFile
 #    ##    "Job"         File closes at the end of Job
 #    ##
 #    ## For example, if a granularity of "SubRun" is specified, but the
 #    ## output-module has reached the maximum events written to disk (as
 #    ## specified by the 'maxEvents' parameter), the output module will NOT
 #    ## switch to a new file until a new SubRun has been reached (or
 #    ## there are no more Events/SubRuns/Runs to process).
 #
 #    granularity: "Event"  # default
 # }
}

The comment needs to be updated so that it does not refer to the "output module", but to a "service" instead.

N.B. If the fileProperties parameter is specified, then the only modules using TFileService that can be used in the job are those that implement the changes described in the next section.

Usage in modules

If you want your module to be usable whenever a user enables TFileService file-switching, then it must provide a file-switch callback. For example:

TestTFileService::TestTFileService(Parameters const& p)
  : EDAnalyzer{p}
{
  ServiceHandle<TFileService> fs;
  fs->registerFileSwitchCallback(this, &TestTFileService::setRootObjects);
  setRootObjects();
}

void
TestTFileService::setRootObjects()
{
  ServiceHandle<TFileService> fs;
  h1_ = fs->make<TH1F>("test1", "test histogram #1", 100, 0., 100.);
}

In this example, we have provided the 'void setRootObjects()' function as a callback to the TFileService. After a file-switch has occurred, the callback will be invoked, resetting h1_ to a new histogram.

N.B. The callbacks are invoked only after a file switch has occurred--they are not invoked after the constructor of the module has been called. Hence why we have called setRootObjects in the constructor as well as provided it as a callback to the TFileService.

If the job is configured to switch TFileService files, and one of the modules does not register a callback, you will see an error like (e.g.):

---- Configuration BEGIN
  A TFileService error occured while attempting to make a directory or ROOT object.
  File-switching has been enabled for TFileService.  All modules must register
  a callback function to be invoked whenever a file switch occurs.  The callback
  must ensure that any pointers to ROOT objects have been updated.

    No callback has been registered for module 'a1'.

  Contact artists@fnal.gov for guidance.
---- Configuration END

An unfortunate consequence

Because art does not directly interact with any objects created via the TFileService, art cannot be smart about when to open a ROOT file--art must be cautious. To that end, a ROOT file is created (a) whenever the TFileService is constructed, and (b) immediately after a file has been closed during a file switch. This means that in some cases, it is possible to get an extra file that is not necessarily desired. For example:

  • TFileService is configured to switch after 10 events
  • The job is configured to process only 10 events
  • After the 10th event is processed, TFileService closes one file and opens a new one
  • The endSubRun and endRun calls are made to all the modules
  • The newly opened file is then closed, even though the histograms may be empty if none of them were filled during endSubRun or endRun.

This consequence is largely unavoidable. It may be possible to improve the situation, but there is no apparent way to do so right now without imposing additional restrictions on TFileService usage.

Future directions

There was a request for the art team to be cognizant of user-defined conditions that would trigger a file-switch. Although we have not exposed any interface to allow that at this time, it would be straightforward to add the feature in a non-breaking way in the future.

#11 Updated by Kyle Knoepfel over 2 years ago

  • Target version set to 2.10.00

#12 Updated by Kyle Knoepfel over 2 years ago

  • Status changed from Assigned to Resolved
  • % Done changed from 90 to 100

#13 Updated by Kyle Knoepfel over 2 years ago

The directory-setting when invoking callbacks was not being done properly. This has been fixed--please see the attached patch file.

#14 Updated by Kyle Knoepfel about 2 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF