Project

General

Profile

Idea #22046

How to handle dataset for intermediate stages

Added by Vito Di Benedetto 11 months ago. Updated 9 months ago.

Status:
Closed
Priority:
Normal
Assignee:
Target version:
Start date:
03/04/2019
Due date:
% Done:

0%

Estimated time:
Experiment:
-
Stakeholders:
Duration:

Description

When submitting a POMS Campaign, POMS defines a dataset to track intermediate files. i.e. the output of the stage that will be used as input for the next stage.
The experiment could not desire to get this dataset declared on SAM.
In LArBarch/project.py the user can choose which stage output declare.

Is there a way to handle those intermediate dataset without using the experiment SAM DB?

History

#1 Updated by Stephen White 11 months ago

  • Assignee set to Marc Mengel

We probably need a discussion about this.

#2 Updated by Marc Mengel 11 months ago

Tracking these intermediate filesets in SAM was a design goal, recommended by the data management experts. Doing otherwise would be a step backwards. We want to track everything in SAM.

#3 Updated by Yuyi Guo 11 months ago

Then can we put the needed meta data into SAM so we can run test jobs now? Vito, Could you list the metadata you had in order to run the testing job for uboone?

#4 Updated by Marc Mengel 11 months ago

So we need to respond to the user requesting this with our best Jedi Mind Tricks, wave our hand and say: "you do want to track your files in SAM..."

#5 Updated by Vito Di Benedetto 11 months ago

By looking uBooNE analysis job submitted using LArBarch/project.py, the metadata used to declare a file looks like the following:

{
 "file_name": "prodgenie_bnb_nu_cosmic_uboone_1_20181024T213932_gen0_bc5589fe-58db-4d29-813c-083a4f841ed4.root",
 "file_id": 342810167,
 "create_date": "2018-10-25T15:48:31+00:00",
 "user": "vito",
 "file_size": 250648,
 "checksum": [
  "enstore:3685744324" 
 ],
 "content_status": "good",
 "file_type": "mc",
 "file_format": "artroot",
 "group": "uboone",
 "data_tier": "generated",
 "application": {
  "family": "art",
  "name": "geniegen",
  "version": "v07_07_00" 
 },
 "event_count": 3,
 "first_event": 3,
 "last_event": 5,
 "start_time": "2018-10-24T21:39:00+00:00",
 "end_time": "2018-10-24T21:39:31+00:00",
 "fcl.name": "prodgenie_bnb_nu_cosmic_uboone.fcl",
 "fcl.version": "v07_07_00",
 "mc.pot": 1259320000000000.0,
 "ub_project.name": "prodgenie_bnb_nu_cosmic_uboone",
 "ub_project.stage": "gen",
 "ub_project.version": "v07_07_00",
 "runs": [
  [
   1,
   2,
   "physics" 
  ]
 ]
}

while what we get as metadata from a POMS submission is the following:

{
"file_name": "prodgenie_bnb_nu_cosmic_uboone_v06_26_01_sim_pass_2_16807529_0_vito-16807529-0-fnpc7026.fnal.gov_1551474943_337_0.root",
"create_date": "2019-03-01T21:15:43:+00:00",
"user": "vito",
"file_size": 672645113,
"checksum": [
 "enstore:3581730972" 
],
"content_status": "good",
"file_type": "mc",
"file_format": "artroot",
"group": "uboone",
"data_tier": "detector-simulated",
"application": {
"version": "v06_26_01",
"name": "demo",
"family": "demo" 
},
"event_count": 5,
"first_event": 1,
"last_event": 5,
"start_time": "2019-03-01T21:01:18",
"end_time": "2019-03-01T21:15:13" 
"runs": [
 [
  1,
  0,
  "physics" 
 ]
],
"dataset.tag": "poms_depends_441597_1",
"parents": ["prodgenie_bnb_nu_cosmic_uboone_v06_26_01_sim_pass_1_16807529_0.root"],
"data_stream": "out1",
}

comparing the two lists of metadata we are missing:

 "fcl.name": "prodgenie_bnb_nu_cosmic_uboone.fcl",
 "fcl.version": "v07_07_00",
 "mc.pot": 1259320000000000.0,
 "ub_project.name": "prodgenie_bnb_nu_cosmic_uboone",
 "ub_project.stage": "gen",
 "ub_project.version": "v07_07_00" 

I think we can add those through directive

add_metadata x=y

then we have in our metadata the extra field

"data_stream": "out1" 

that needs to be removed.
In general the experiment could need to deal with something like:

"data_stream": "out<N>" 

where <N> could be any digit.

Would it be possible in POMS/fife_utils to have a mechanism to remove/filter out undesired metadata before declaring files?

#6 Updated by Marc Mengel 11 months ago

Use

[job_output]
...
filter_metadata = out1

to exclude it.

#7 Updated by Marc Mengel 11 months ago

..er sorry, that should be:

[job_output]
...
filter_metadata = data_stream

#8 Updated by Vito Di Benedetto 11 months ago

Marc Mengel wrote:

..er sorry, that should be:

[job_output]
...
filter_metadata = data_stream

Great! I'll use that.

Then, in case I need to override a metadata, can I use

add_metadata x=y

does this metadata assignment happen as last step in metadata manipulation?

#9 Updated by Marc Mengel 11 months ago

Yes. We build the base metadata (file size, checksums, etc) then merge in the stuff from the metadata extractor, then remove the filter_metadata ones, and add any add_metadata ones.
source:libexec/fife_wrap#L916

#10 Updated by Yuyi Guo 11 months ago

I am wondering. Are the below difference fixed? If not, How the user know what extra metadata to add each time?

"fcl.name": "prodgenie_bnb_nu_cosmic_uboone.fcl",
"fcl.version": "v07_07_00",
"mc.pot": 1259320000000000.0,
"ub_project.name": "prodgenie_bnb_nu_cosmic_uboone",
"ub_project.stage": "gen",
"ub_project.version": "v07_07_00"

#11 Updated by Marc Mengel 11 months ago

"fcl.name": "prodgenie_bnb_nu_cosmic_uboone.fcl",

This is usually passed in as a [globals] entry , so it could be %(fclfile)s or however it is passed in.

"fcl.version": "v07_07_00",

Ditto here, for %(version)s

"mc.pot": 1259320000000000.0,

That needs to come from a calibration database or something(?)

"ub_project.name": "prodgenie_bnb_nu_cosmic_uboone",

I may need to add that in the fife_wrap wrapper, a way to pass the project name
out to the script..

"ub_project.stage": "gen",

we can set that in our stages, too.

"ub_project.version": "v07_07_00"

use %(version)s again...

#12 Updated by Marc Mengel 11 months ago

So what I came up with was:

job_output.add_metadata_3 = fcl.name=%(basename)s.fcl
job_output.add_metadata_4 = fcl.version=%(release)s
job_output.add_metadata_5 = mc.pot=%(pot)s
job_output.add_metadata_6 = ub_project.name=\$SAM_PROJECT
job_output.add_metadata_7 = ub_project.stage=%(basename)s
job_output.add_metadata_8 = ub_project.version=%(release)s

where you have to pass -Oglobal.pot=12345 on the launch line for the moment, and my stages set global.basename=stage_name.

#13 Updated by Marc Mengel 11 months ago

  • Status changed from New to Feedback

#14 Updated by Vito Di Benedetto 11 months ago

What suggested, with some tweak, works for me.

#15 Updated by Stephen White 9 months ago

  • Status changed from Feedback to Closed


Also available in: Atom PDF