Project

General

Profile

Mu2e2015-04-07 » History » Version 2

Michael Gheith, 04/08/2015 11:52 AM

1 1 Marc Mengel
h1. Mu2e2015-04-07
2 1 Marc Mengel
3 1 Marc Mengel
** Monitoring
4 1 Marc Mengel
*** What do they want to see? 
5 1 Marc Mengel
****
6 1 Marc Mengel
****
7 1 Marc Mengel
** Jobs
8 1 Marc Mengel
*** List of types of production job types
9 1 Marc Mengel
**** N stages of MC
10 1 Marc Mengel
**** Event mixing -- maybe
11 1 Marc Mengel
**** MC reco
12 1 Marc Mengel
**** eventually -- real reco?
13 1 Marc Mengel
**** eventually -- calibration?
14 1 Marc Mengel
***** probably won't be production group for initial phase
15 1 Marc Mengel
***** eventually
16 1 Marc Mengel
**** Get to where they're running smoothly, then hand off to production 
17 1 Marc Mengel
*** How launched
18 1 Marc Mengel
**** scripts wrapped around jobsub-client
19 1 Marc Mengel
**** So far none are SAM project based; relatively soon (weeks) to use sam projects...
20 1 Marc Mengel
*** Success criterea Categories
21 1 Marc Mengel
**** success reported by script
22 1 Marc Mengel
**** success post-hoc
23 1 Marc Mengel
**** data integrity 
24 1 Marc Mengel
**** arbitrary user provided logfile check..
25 1 Marc Mengel
***** maybe script per job/per project/ per campaign
26 1 Marc Mengel
** Workflow
27 1 Marc Mengel
*** Info in request to approvers
28 1 Marc Mengel
**** 
29 1 Marc Mengel
****
30 1 Marc Mengel
** Metrics
31 1 Marc Mengel
*** what reports/metrics would you want from system?
32 1 Marc Mengel
****
33 1 Marc Mengel
34 1 Marc Mengel
Data disks full?
35 1 Marc Mengel
98% of time in diagnosing/triage of problems
36 1 Marc Mengel
Can the division spend time on reliability to reduce above?
37 1 Marc Mengel
error codes largely Art -- if internal.
38 1 Marc Mengel
May havea period where productionjobs are SAM based, and other work isn't.
39 1 Marc Mengel
40 1 Marc Mengel
Once condor_q showed empty, scanned logfiles for completion codes, and one failure was
41 1 Marc Mengel
same job could complete multiple times (condor resubmit?) 
42 1 Marc Mengel
THis happened more often than expected...  much discussion.
43 1 Marc Mengel
44 1 Marc Mengel
In this upcomoing phase, if our success rate is  in the 90's need not do anything.
45 1 Marc Mengel
46 1 Marc Mengel
Idea of black hole nodes.  cvmfs errors, bus errors, etc. eating jobs
47 1 Marc Mengel
48 1 Marc Mengel
Provide tools to check things, etc. and we'll call them.
49 1 Marc Mengel
cvmfs up to date checks in jobsub wrapper?
50 1 Marc Mengel
51 1 Marc Mengel
52 1 Marc Mengel
Monitoring
53 1 Marc Mengel
54 1 Marc Mengel
Thing I'd want to see is sort of progress bars, percent complete vs time, etc.
55 1 Marc Mengel
on each campaign.
56 1 Marc Mengel
57 1 Marc Mengel
Concatenation/merge stage projects?
58 1 Marc Mengel
59 1 Marc Mengel
Merging -- we care about when we run a grid cluster; MC generation within one cluster gets a
60 1 Marc Mengel
unique run number and subruns are a cluster number.   Subruns not split across files.
61 1 Marc Mengel
Bookkepign corners we havent explored -- would like as much as possible to have subruns
62 1 Marc Mengel
made contiguous and in order in a merge phase.
63 2 Michael Gheith
64 2 Michael Gheith
65 2 Michael Gheith
66 2 Michael Gheith
h2. Other Notes
67 2 Michael Gheith
68 2 Michael Gheith
Analysis computing may want to use this production system, but the scope is just for the production group for now.
69 2 Michael Gheith
70 2 Michael Gheith
Normal operation procedure:
71 2 Michael Gheith
Sit down with them (OPG) and define specs.  Rob is skeptical of the request form.  Feels there will need to be human contact.
72 2 Michael Gheith
73 2 Michael Gheith
They have their own script wrapped around job_sub client.
74 2 Michael Gheith
75 2 Michael Gheith
If their script writes log files, we need to tell the experiment where to write them, so we can analyze them later.
76 2 Michael Gheith
77 2 Michael Gheith
They use a check script to check a bunch of things.  After that, they use their cleanup script.
78 2 Michael Gheith
79 2 Michael Gheith
Write log information on the worker node, then use ifdh to send the logs to BlueArc.
80 2 Michael Gheith
81 2 Michael Gheith
Have generic base tests for success that will work for all experiments.
82 2 Michael Gheith
83 2 Michael Gheith
Stage 1: 50% complete
84 2 Michael Gheith
Stage 2: 25% complete  //depends on stage 1 output
85 2 Michael Gheith
etc...
86 2 Michael Gheith
87 2 Michael Gheith
Give the jobs the ability to communicate with GlideinWMS.  Does the node have CVMFS?  Store this data.
88 2 Michael Gheith
89 2 Michael Gheith
IFDH stages the files in, and out.
90 2 Michael Gheith
91 2 Michael Gheith
Mu2e has 3 flavors of jobs.
92 2 Michael Gheith
93 2 Michael Gheith
Perhaps have the jobs broadcast its logs, via http, to a dedicated log server.  Maybe just the tail of the log files?  This log server will of course contain a database, which could be queried to get relevant information...
94 2 Michael Gheith
95 2 Michael Gheith
Log stuff should go to dCache, and eventually to tape.  (Mengel)