Project

General

Profile

Nova2015-03-31

Matthew Tamsett

Most jobs are single input/single output.

We tend to run all of phase 1, all of phase 2, etc.

Art stuff gives you per-module usage; would be nice to sum that up
over the whole project, etc.

So my I could define the whole workflow, and have it all launched
automatically and watch the status...

Robert: Status page showing files, etc. ?
Yes, that would be useful.

Diesburg: anythhing you are really married to that you can't give up? SAM.

Sometimes, rerunning jobs manually to get error message... Requirements on triage-ing?

A way to get log files easily, through a web page, would help you identify pathologies.
Not sure if our jobs generate logs that are really parse-able that way.

Error code summaries -- is it useful? Maybe we need a translation from the experiments

Mike D: for this to work properly, we need to provide something to the experiment,
to handle error messages... Andrew: One could have a standard Art module for this

Matt: Some connection between task producing file and files in fts -- Robert, we could
add a metadata task-id field for this, and then fts could report it.

Other Notes

Matthew's Awesome Production Site: http://nusoft.fnal.gov/nova/production/datasets/overview.html

Have a way to communicate the results of jobs. Report would be good or maybe send out emails?

Matthew's site just checks if the files have shown up (uses the web api). It does not check the status of the jobs.

Matthew has cron jobs that query these things and displays it on the site statically.

First analysis page is done by hand by Matthew, it is not automated at all.

None of what is on the site is tied in with actual running jobs.

No tie in with the submission system.

Draining dataset failure: file with no ancestor means the file was not processed.

A SAM project takes a snapshot so you know exactly how many files there are. From there you can determine the number of output files. Then you can determine job success.

Track where a job ran.
What facility?
-Fermilab?
-Somewhere else?
What node?
Hostname of the worker node it ran on. //sam project can tell you this already

Additional things to keep track of:
/proc/cpuinfo
/proc/meminfo
Memory footprints of jobs.
Efficiency rates.
Output file sizes.
Run times.

Often times someone will want to know how long it will take to reprocess a dataset. This is why it's a good idea to keep track of timing.

Have a history of the real configuration that ran the job:
fhcl files
actual scripts that were run (we need to store tags- experiment code lives in repositories)

Should be able to reproduce a job exactly 10 years from now.

Start small, and then add features to this system.

Big bottle neck is that if one job fails, then everything else has to wait.

Open time / close time of the file is what we can track in the job. Job has to do additional tracking itself if you want more.
Getting CPU usage means the job scripts would have to be instrumented by the experiment.

Error handling:
Production group needs to run things interactively to find errors.
Retry button for transient issues.
Get log files easily via a web interface.

ART spits out error codes.

NOvA doesn't do error codes in their scripts.

Add a task id metadata in the file, so when the fts picks it up, we know what stage this file is in. (Robert)