Project

General

Profile

Summary of performance issues

Issues prior to the deployment of 'cp1' are discussed in the HISTORY section.

APP performance

The following table summarizes all recorded
APP area (measured from /grid/data) performance issues
as monitored from minos27.

SLOW is the count of samples under 5 MBytes/second.
PPmil is per 1000 fraction that are slow

YEAR/MO SAMPLES SLOW PPmil COMMENT
2009/05 9438 126 13 2009-04-30 cp1 deployed
2009/06 9497 303 31
2009/07 43731 1593 36
2009/08 42811 3203 74 2009-08-21 cpn deployed
2009-08-25 d0ora2 retired
2009/09 42259 1451 34 2009-09-24 D0 project disks isolated
2009/10 43843 317 7
2009/11 30992 346 11 2009-11-22 minos files on new disks
2009/12 19920 11 0
2009 242491 7350 30 down 3 % up 97 %
2010/01 19915 2 0
2010/02 17646 2 0
2010/03 19826 2 0
2010/04 19398 0 0
2010/05 20035 1 0
2010/06 19394 0 0
2010/07 19732 0 0
2010/08 19988 0 0
2010/09 19227 0 0
2010/10 19983 0 0
2010/11 18630 0 0
2010/12 19957 0 0
2010 233731 7 0 down .003 % up 99.997 %
2011/01 19792 0 0
2011/02 18034 2 0
2011/03 19949 1 0
2011/04 19324 2 0
2011/05 6742 5 0
2011/06 18991 240 12 Minerva startup
2011/07 19800 126 6
2011/08 19978 22 1
2011/09 19327 3 0
2011/10 19173 0 0
2011/11 19334 28 1
2011 190444 429 2 down .2% up 99.8%
2012/01 19831 23 1
2012/02 18684 0 0
2012/03 20964 3 0
2012/04 19298 9 0
2012/05 19855 1 0
2012/06 19332 2 0
2012/07 19942 71 3
2012/08 19971 32 1
2012/09 19334 24 1
2012/10 13718 43 3
2012/11 15124 250 16 Mu2e and LBNE startup
2012/12 1616 0 0
2012 207669 458 2 down .2 % up 99.8 %
2013/01 19939 47 2
2013/02 18052 0 0
2013/03 19795 62 3 DATA/APP heads separated Mar 21
2013/04 19343 0 0
2013/05 19993 0 0
2013/06 19321 3 0
2013/07 19992 0 0
2013/08 20000 0 0
2013/09 19350 0 0
2013/10 19993 3 0
2013/11 19379 1 0
2013/12 19999 0 0
2013 235156 116 0 down .05 % up 99.95 %
2014/01 19980 1 0
2014/02 18042 1 0
2014/03 19924 3 0
2014/04 19330 0 0
2014/05 19958 3 0
2014/06 19070 18 0
2014/07 19964 1 0
2014/08 19953 0 0
2014/09 19323 1 0
2014/10 6877 0 0
2014/11 17739 2 0
2014/12 19987 6 0
2014 220147 36 0 down .02 % up 99.98 %
2015/01 19980 6 0
2015/02 18057 0 0
2015/03 10115 0 0
2015 48152 6 0

DATA performance

The following table summarizes /minos/data slowdowns
as monitored on minos25.

SLOW is the count of samples under 5 MBytes/second.
PPmil is per 1000 fraction that are slow

YEAR/MO SAMPLES SLOW PPmil COMMENT
2010/01 19988 4 0
2010/02 17309 0 0
2010/03 19924 3 0
2010/04 19303 39 2
2010/05 19927 29 1
2010/06 19324 3 0
2010/07 19953 0 0
2010/08 19921 0 0
2010/09 24001 0 0
2010/10 29356 2 0
2010/11 18573 0 0
2010/12 19894 0 0
2010 247473 80 0
2011/01 18263 1 0
2011/02 17849 1 0
2011/03 19739 15 0
2011/04 19086 18 0
2011/05 19707 3 0
2011/06 18974 266 14
2011/07 19646 158 8
2011/08 19795 28 1
2011/09 19223 25 1
2011/10 19360 0 0
2011/11 19272 27 1
2011/12 19905 0 0
2011 230819 542 2
2012/01 19891 32 1
2012/02 18581 2 0
2012/03 20326 43 2
2012/04 19161 21 1
2012/05 19425 133 6
2012/06 19207 0 0
2012/07 19778 100 5
2012/08 19899 90 4
2012/09 19148 62 3
2012/10 19552 89 4
2012/11 19056 297 15
2012/12 19850 1 0
2012 233874 870 3
2013/01 19822 52 2
2013/02 17958 1 0
2013/03 19511 121 6 DATA/APP heads separated Mar 21
2013/04 19136 105 5
2013/05 19837 1 0
2013/06 19211 25 1
2013/07 19835 38 1
2013/08 19843 180 9
2013/09 19202 34 1
2013/10 19907 34 1
2013/11 19281 39 2
2013/12 19910 5 0
2013 233453 635 2
2014/01 19842 56 2
2014/02 17824 60 3
2014/03 19685 200 10
2014/04 19068 58 3
2014/05 19786 18 0
2014/06 19224 41 2
2014/07 19895 30 1
2014/08 19592 143 7
2014/09 18843 116 6 average 80/month 2014/01-09
2014/10 6899 9 1
2014/11 17712 11 0
2014/12 19900 33 1
2014 218270 775 3
2015/01 19802 9 0
2015/02 17989 19 1
2015/03 10063 6 0 average 15/month since 2014/10
2015 47856 34 0

Summary of Service Desk tickets

DATE TICKET ISSUE COMMENTS
2009 05 29 INC000000002734 slowdown shut down all Minos processing
detected Mon/Tue/Thu/Sat 18:15 to 00:15 pattern (d0ora2)
2009 06 16 INC000000003924 blocked BLuearc export te diagnose overload no effect
2009 06 25 INC000000004598 slowdown due to D0 overload
2009 08 05 INC000000007362 /grid/data overload due to D0 prj_root access, separated Aug 25
due to d0ora2 Oracle RMAN, retired 2009 Aug 25
Noted 4 hour Sep 18 slowdown due to Minos
2010 03 03 INC000000027410 dd Bluearc messages on Minos servers no harm, never understood, gone by 2013
2010 03 08 INC000000027885 /minos/app slowdowns March 2 12:00 to 21:00 CST
March 6 13:00 to 17:15 CST.
March 9 05:46 to 06:34 CST.
minos/ahimmil not using cpn
Fermigrid internal monitoring scripts part of cause
2010 10 21 INC000000058062 CMS Bluearc slow faulty fiber fixed 10
2011 05 10 INC000000082064 high activity no specific cause found, monitoring was down
2011 06 27 INC000000089660 Jun 27 16:00 to Jun 28 04:39 heavy activity, no user identified
2011 07 06 INC000000090650 slowdown slow 15:40 to 19:50
no user identified. Adjusted bluearc parameters Jul 7
2012 01 31 INC000000199724 NOvA gpvm slow 01/31 21-22:00 CST due toNOvA Tim Kutnink
slow 01-31 11:20-12:54
2012 05 26 INC000000256200 gpsn01 failures microboone grid jobs chmod'ing 200K files each
2012 07 16 INC000000282852 gpsn batch overloading Bluearc microboone script chmod'ing 200K files
same script, new user.
Searced and Destroyed the script
2012 07 27 INC000000289055 mu2e nodes slow due to bad NOvA script, corrected
2012 09 03 INC000000303757
INC000000303766
INC000000304102
NOva slow
gm2gpvm01 slow
minevagpvm* slow
mu2e writing unregulated 40 GB files in error
2012 10 10 INC000000322355 NOvA data slow avalanche of expired locks
repaired in cpn v1.2
2012 10 17 INC000000319829 NOvA slow builds generally Oct 9-12 busy network, bad cable, NOvA overloads
2012 11 18 INC000000340726 mu2e slowdown
2012 11 21 INC000000340726 mu2e slowdown
2012 11 23 INC000000341585 lock failures network switch failed
2012 11 29 INC000000343888 moderate slowdown seen in monitoring
2012 11 30 2 INC000000344411
INC000000344416
INC000000344427
INC000000344438
INC000000344449
INC000000344473
INC000000344534
INC000000344709
compilation on gm2gpvm
lbnegpvm02 excessivley slow
CPN locks incorrect ( main mu2e ticket )
CDF project area inaccessable
virtual machines ok?
mnv bluearc slowness
condor_q not working on uboonegpvmXX
gpsn01 condor hung again
mu2e not using cpn
2012 12 21 INC000000354424 minerva uneven I/O not enough details to address
2013 01 08 INC000000358973 mu2e slowdown 2 minutes, no obvious cause
2013 01 14 INC000000361604 fermigrid dismounts short outages
due to TB Minerva logs
deployed new jobsub to fix
2013 01 16 INC000000362099
INC000000362101
INC000000362113
INC000000362159
INC000000362188
novagpvm02
argoneut
mu2e
nova
fermigrid
LBNE not using cpn