Project

General

Profile

Mu2e2015-04-07 » History » Version 1

Marc Mengel, 04/07/2015 05:10 PM

1 1 Marc Mengel
h1. Mu2e2015-04-07
2 1 Marc Mengel
3 1 Marc Mengel
** Monitoring
4 1 Marc Mengel
*** What do they want to see? 
5 1 Marc Mengel
****
6 1 Marc Mengel
****
7 1 Marc Mengel
** Jobs
8 1 Marc Mengel
*** List of types of production job types
9 1 Marc Mengel
**** N stages of MC
10 1 Marc Mengel
**** Event mixing -- maybe
11 1 Marc Mengel
**** MC reco
12 1 Marc Mengel
**** eventually -- real reco?
13 1 Marc Mengel
**** eventually -- calibration?
14 1 Marc Mengel
***** probably won't be production group for initial phase
15 1 Marc Mengel
***** eventually
16 1 Marc Mengel
**** Get to where they're running smoothly, then hand off to production 
17 1 Marc Mengel
*** How launched
18 1 Marc Mengel
**** scripts wrapped around jobsub-client
19 1 Marc Mengel
**** So far none are SAM project based; relatively soon (weeks) to use sam projects...
20 1 Marc Mengel
*** Success criterea Categories
21 1 Marc Mengel
**** success reported by script
22 1 Marc Mengel
**** success post-hoc
23 1 Marc Mengel
**** data integrity 
24 1 Marc Mengel
**** arbitrary user provided logfile check..
25 1 Marc Mengel
***** maybe script per job/per project/ per campaign
26 1 Marc Mengel
** Workflow
27 1 Marc Mengel
*** Info in request to approvers
28 1 Marc Mengel
**** 
29 1 Marc Mengel
****
30 1 Marc Mengel
** Metrics
31 1 Marc Mengel
*** what reports/metrics would you want from system?
32 1 Marc Mengel
****
33 1 Marc Mengel
34 1 Marc Mengel
Data disks full?
35 1 Marc Mengel
98% of time in diagnosing/triage of problems
36 1 Marc Mengel
Can the division spend time on reliability to reduce above?
37 1 Marc Mengel
error codes largely Art -- if internal.
38 1 Marc Mengel
May havea period where productionjobs are SAM based, and other work isn't.
39 1 Marc Mengel
40 1 Marc Mengel
Once condor_q showed empty, scanned logfiles for completion codes, and one failure was
41 1 Marc Mengel
same job could complete multiple times (condor resubmit?) 
42 1 Marc Mengel
THis happened more often than expected...  much discussion.
43 1 Marc Mengel
44 1 Marc Mengel
In this upcomoing phase, if our success rate is  in the 90's need not do anything.
45 1 Marc Mengel
46 1 Marc Mengel
Idea of black hole nodes.  cvmfs errors, bus errors, etc. eating jobs
47 1 Marc Mengel
48 1 Marc Mengel
Provide tools to check things, etc. and we'll call them.
49 1 Marc Mengel
cvmfs up to date checks in jobsub wrapper?
50 1 Marc Mengel
51 1 Marc Mengel
52 1 Marc Mengel
Monitoring
53 1 Marc Mengel
54 1 Marc Mengel
Thing I'd want to see is sort of progress bars, percent complete vs time, etc.
55 1 Marc Mengel
on each campaign.
56 1 Marc Mengel
57 1 Marc Mengel
Concatenation/merge stage projects?
58 1 Marc Mengel
59 1 Marc Mengel
Merging -- we care about when we run a grid cluster; MC generation within one cluster gets a
60 1 Marc Mengel
unique run number and subruns are a cluster number.   Subruns not split across files.
61 1 Marc Mengel
Bookkepign corners we havent explored -- would like as much as possible to have subruns
62 1 Marc Mengel
made contiguous and in order in a merge phase.