Project

General

Profile

G-22Mtg2015-04-20

  • Experiments:
    • Monitoring
      • What do they want to see?
    • Jobs
      • List of types of production job types
        • for now all MC
        • some substantial runs
        • big ones in one pass
        • may revisit some interesting ones, with more stuff turned on
        • may have different MC types running simultaneously
      • magnetic field stuff? no real compute there?
      • will be reco; large data reduction , like NOvA
      • How launched
      • Success criterea
    • Workflow
      • Info in request to approvers
        • as long as there is a comment, and physics group
    • Metrics
      • what reports/metrics would you want from system?
        • probably usual stuff
        • would want to see monitoring factored by category of job
        • similarly for reporting/accounting
        • efficiency / utilization (by type,etc.)
      • might run special MC's by ourself
      • Can we use monitoring for personal stuff?
        if I can tag jobs with more stuff than just name, and group it.

Hooks for monitoring would be good over and above just for production system.
Do you want a hook to specify your own special things, or do you just want to
be able to turn on the same stuff as production for your own jobs?
Possiby both could be useful.

phased rollout good.

log file management!

Other Notes

This meeting was with Adam Lyon.

They will have several production scripts running in parallel.

What would go in a request:
--Number of events.
--FHCL file.
--Comment field stating what you are trying to do.
--...

Monitoring:
Category of MC jobs.
If you have 3 simultaneous runs, then it would be nice to see the status for each of the 3.
CPU time vs. wall time.
We already have fifemon, which is already generic.

Hooks to access data. //Outside of scope. (Robert)

Book keeping is an important part that everyone does themselves.
Log files get stashed away someplace. Adam suggested to put the log files in a dropbox.

Different styles of running:
--Keepup processing.
--Process everything I got over there.
--I want to generate n events.

Condor can report a 0 exit status, but there will still be an error.

How to determine if it's a service fault or code fault just comes through experience.