PLANS¶
The 'perf' file sets a lower limit on performance,
as deterimined by the content of the PERF file.
In 2012 the PERF file contains a fixed number.
It would be filled by an external process
measuring overall file system performance.
We have never implemented an automatic process to set PERF.
The 'glimit' sets a global lock limit for all groups under /grid/data/.
Because we did not originally share file systems
the lock script did not enforce this limit.
In 2012 we have several separate shared file systems.
If we activate glimit, we need to build in sharing information.
We may want a separate global limit for Bluearc heads.
Again, knowledge of heads would be needed.
DEPLOYMENT¶
- Create verison in UPS
products@if01 . ~ifmon/shrc/kreymer . /grid/fermiapp/products/common/etc/setups.sh cd /grid/fermiapp/products/common/prd/cpn OVER=v1.6 NVER=v1.7 OVERU=`echo ${OVER} | tr . _` NVERU=`echo ${NVER} | tr . _` cp -vax ${OVER} ${NVER} ups declare cpn ${NVER} -f NULL \ -r /grid/fermiapp/products/common/prd/cpn/${NVERU}/NULL \ -m cpn.table setup cpn ${NVER} cd $CPN_DIR/bin ADM=/afs/fnal.gov/files/expwww/numi/html/computing/admin/bluearc diff ${ADM}/lock lock cp ${ADM}/lock lock ETC
- Test the new version as noted below.
- Upload to UPD
kreymer@minos27 unset SETUPS_DIR UPS_DIR SETUP_UPS . /grid/fermiapp/products/common/etc/setups.sh setup upd UPDADD=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/updadd NVER=v1.7 ${UPDADD} NULL cpn ${NVER} upd list -aK+ cpn
- Get approval via Change Management, i.e. CHG000000005680
- Notify CS Liaisons at cs-liaison@fnal.gov
- Deploy with
As ifmon@if01 :. /grid/fermiapp/products/common/etc/setups.sh export CPN_LOCK_GROUP=gpcf setup cpn DAY=`date +%Y%m%d` LLOG=/grid/data/ifmon/CPNTEST/log OLDCPN="v1r6" NEWCPN="v1r7" lock statusall | tee ${LLOG}/${DAY}-STATS-${OLDCPN}.log ls -l /grid/data/*/LOCK/LOCKS | tee ${LLOG}/${DAY}-LOCKS-${OLDCPN}.log ls -l /grid/data/*/LOCK/QUEUE | tee ${LLOG}/${DAY}-QUEUE-${OLDCPN}.log
As products@if01. /grid/fermiapp/products/common/etc/setups.sh OLDCPN="v1.6" NEWCPN="v1.7" ups list -aK+ cpn ups declare -c cpn ${NEWCPN} date ups list -aK+ cpn
In the same ifmon@if01 sessionlock statusall | tee ${LLOG}/${DAY}-STATS-${NEWCPN}.log ls -l /grid/data/*/LOCK/LOCKS | tee ${LLOG}/${DAY}-LOCKS-${NEWCPN}.log ls -l /grid/data/*/LOCK/QUEUE | tee ${LLOG}/${DAY}-QUEUE-${NEWCPN}.log
- Monitor with lock statusall
- Fallback is
ups declare -c cpn ${OLDCPN}
TESTING¶
ALTERNATE LOCKS¶
By default, the locks are managed from an area
/grid/data/${GROUP}/LOCK
where group is taken from the output of the id command.
You can override the lock area like
export CPN_LOCK_BASE=/grid/data/ifmon/test
You can override the group like
export CPN_LOCK_GROUP=ifmon
VALIDATION¶
CHECKLIST¶
- lock
- LOCKING - obsolete
- single queued
- single queued interrupted
- 20 files
- copy 1
- copy interrupted
- copy 20
- copy 20 holding
- copy 20 large
- 20 sleep 5 min
Use the ifmon account for local tests, with a private LOCK area¶
setup the desired version of cpn
unset SETUP_UPS SETUPS_DIR UPS_DIR . /grid/fermiapp/products/common/etc/setups.sh setup cpn
Status¶
Test the statusv and statusallv commands in the public context.
This also tests the status and statusv commands.
export CPN_LOCK_BASE=/grid/data export CPN_LOCK_GROUP=nova
lock statusv nova LOCK STATUS Thu Dec 6 20:05:14 UTC 2012 LOCKS 0 of 10 ( 0 stale ) QUEUE 0 ( 0 stale) Control files limit perf PERF rate stale staleq wait small 10 3 50 1 10 600 5 lock statusallv LOCK summary Thu Dec 6 20:06:15 UTC 2012 argoneut LOCKS 0/20 ( 0) QUEUE 0 ( 0) 0 d0 LOCKS 0/20 ( 0) QUEUE 0 ( 0) 3 e875 LOCKS 0/ 5 ( 0) QUEUE 0 ( 0) 3 e938 LOCKS 4/20 ( 0) QUEUE 6 ( 0) 10 fnalgrid LOCKS 0/20 ( 0) QUEUE 0 ( 0) 341 gm2 LOCKS 0/ 5 ( 0) QUEUE 0 ( 0) 0 gpcf LOCKS 0/20 ( 188) QUEUE 0 ( 268) 180772 lbne LOCKS 0/20 ( 0) QUEUE 0 ( 0) 0 marslbne LOCKS 0/20 ( 0) QUEUE 0 ( 0) 11418 marsmu2e LOCKS 0/ 5 ( 0) QUEUE 0 ( 0) 0 microboone LOCKS 0/20 ( 0) QUEUE 0 ( 0) 226 mu2e LOCKS 0/ 5 ( 0) QUEUE 0 ( 0) 0 mu2epro LOCKS 0/ 5 ( 1) QUEUE 0 ( 0) 4588 nova LOCKS 0/10 ( 0) QUEUE 0 ( 0) 0 t-962 LOCKS 0/20 ( 0) QUEUE 0 ( 0) 0 Control files GROUP limit perf PERF rate stale staleq wait small argoneut 20 3 50 1 10 600 5 1000000 d0 20 3 50 1 10 600 5 e875 5 3 50 1 10 600 5 1000000 e938 20 3 50 1 10 600 5 fnalgrid 20 3 50 10 10 600 5 1000000 gm2 5 5 50 1 3 3 5 1000000 gpcf 20 3 50 1 10 600 5 1000000 lbne 20 3 50 1 10 600 5 1000000 marslbne 20 3 50 1 10 600 5 1000000 marsmu2e 5 3 50 1 10 600 5 microboone 20 3 50 1 10 600 5 1000000 mu2e 5 3 50 1 10 600 5 mu2epro 5 3 50 1 10 600 5 1000000 nova 10 3 50 1 10 600 5 t-962 20 3 50 1 10 600 5 1000000
Move to private test LOCKs¶
export CPN_LOCK_BASE=/nusoft/app/home export CPN_LOCK_GROUP=ifmon LOCK=${CPN_LOCK_BASE}/${CPN_LOCK_GROUP}/LOCK LIMIT=${LOCK}/limit WAIT=${LOCK}/wait
Simple lock and release¶
lock ; lock free
LOCK - Tue Nov 6 18:44:18 UTC 2012 lock /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.18:44:18.1.gpsn01.16146.ifmon.ifmon
LOCK - Tue Nov 6 12:44:21 CST 2012 freed /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.18:44:18.1.gpsn01.16146.ifmon.ifmon
ls ${LOCK}/LOG | tail -1
20121106.18:44:21.1.gpsn01.16146.ifmon.ifmon.1.0
lock clean
Single lock delayed by ${LOCK}/DO/LOCKING ( need two windows )¶
- OBSOLETE, NOT USING LOCKING FILE 2014+ ***
FORCE QUEUING WINDOW 1 touch ${LOCK}/DO/LOCKING TRY TO LOCK/FREE WINDOW 2 lock ; lock free LOCKING already in progress, usleep 300000 LOCKING already in progress, usleep 9400000 LOCKING already in progress, usleep 6100000 LOCKING already in progress, usleep 800000 LOCKING already in progress, usleep 000000 LOCKING already in progress, usleep 800000 LOCKING already in progress, usleep 1600000 LOCKING already in progress, usleep 9000000 RELEASE THE LOCKING FILE WINDOW 1 rm ${LOCK}/DO/LOCKING SEE THE LOCK BE TAKEN WINDOW 2 LOCK - Mon Mar 18 23:07:50 UTC 2013 lock /nusoft/app/home/ifmon/LOCK/LOCKS/20130318.23:07:50.28.minos27.7828.ifmon.ifmon LOCK - Mon Mar 18 23:07:50 UTC 2013 freed /nusoft/app/home/ifmon/LOCK/LOCKS/20130318.23:07:50.28.minos27.7828.ifmon.ifmon
Single queued lock ( need two windows )¶
FORCE QUEUING WINDOW 1 cat ${LIMIT} ; echo 0 > ${LIMIT} ; cat ${LIMIT} 5 0 cat ${WAIT} ; echo 200 > ${WAIT} ; cat ${WAIT} 5 200 TRY TO LOCK/FREE WINDOW 2 lock ; lock free LOCK - Tue Nov 6 12:57:53 CST 2012 LOCKS/LIMIT/QUEUE 0/0/1 sleeping 5 LOCK - Tue Nov 6 12:57:53 CST 2012 queue 20121106.18:57:53.if01.16146.ifmon.ifmon VERIFY THE QUEUE WINDOW 1 ls -l ${LOCK}/QUEUE # verify that the queue file is touched once a minute sleep 90 ls -l ${LOCK}/QUEUE ALLOW THE LOCK AND WAKE UP WITH UDP PACKET WINDOW 1 cat ${WAIT} ; echo 5 > ${WAIT} ; cat ${WAIT} 200 5 cat ${LIMIT} ; echo 5 > ${LIMIT} ; cat ${LIMIT} 0 5 Send a UDP packet to wake up the sleeping process at port PID+2000 echo "wakeup" > /dev/udp/if01.fnal.gov/18146 WINDOW 2 LOCK - Tue Nov 6 18:58:18 UTC 2012 lock /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.18:58:18.25.if01.16146.ifmon.ifmon LOCK - Tue Nov 6 12:59:00 CST 2012 freed /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.18:58:18.25.if01.16146.ifmon.ifmon
Single queued lock, interrupted ( need two windows )¶
FORCE QUEUING WINDOW 1 cat ${LIMIT} ; echo 0 > ${LIMIT} ; cat ${LIMIT} 5 0 START THE LOCK WINDOW 2 lock LOCK - Tue Nov 6 12:57:53 CST 2012 LOCKS/LIMIT/QUEUE 0/0/1 sleeping 5 LOCK - Tue Nov 6 12:57:53 CST 2012 queue 20121106.18:57:53.gpsn01.16146.ifmon.ifmon VERIFY THE QUEUE WINDOW 1 ls -l ${LOCK}/QUEUE # verify that the queue file is touched once a minute INTERRUPT THE LOCK WINDOW 2 ^C see the sleep 60 lock tickler subprocess disappear after a minute ps xf ; sleep 70 ; ps xf ... 5502 pts/2 S 0:00 perl -MIO::Socket -e $s=IO::Socket::INET->new(LocalPort=>4015,Proto=>'udp'); $dg='x'; while($dg 5499 pts/2 S 0:00 /bin/sh /grid/fermiapp/products/common/prd/cpn/v1_4/NULL/bin/lock 5500 pts/2 S 0:00 \_ sleep 60 ... 5502 pts/2 S 0:00 perl -MIO::Socket -e $s=IO::Socket::INET->new(LocalPort=>4015,Proto=>'udp'); $dg='x'; while($dg VERIFY THE QUEUE IS NOT KEPT ALIVE, THEN CLEAN UP WINDOW 1 ls -l ${LOCK}/QUEUE # verify that the queue file is no longer touched once a minute lock clean # after 2 minutes RESTORE NORMAL LOCKS WINDOW 1 cat ${LIMIT} ; echo 5 > ${LIMIT} ; cat ${LIMIT} 0 5 Send a UDP packet to wake up the sleeping process at port PID+2000 echo "wakeup" > /dev/udp/if01.fnal.gov/18146
Make 20 files for subsequent testing¶
N20='00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19'
for N in ${N20} ; do
dd if=/dev/urandom of=/var/tmp/CPN${N} bs=1M count=11
done
ls -l /var/tmp/CPN*
Copy 1 file¶
cpn /var/tmp/CPN01 /dev/null LOCK - Tue Nov 6 19:14:55 UTC 2012 lock /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.19:14:55.0.gpsn01.10885.ifmon.ifmon LOCK - Tue Nov 6 13:14:55 CST 2012 freed /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.19:14:55.0.gpsn01.10885.ifmon.ifmon
Copy a file, interrupted¶
Verify that the lock keepalive process goes away, and the lock expires.
cpn /grid/data/ifmon/CPNTEST/FILES/20G /dev/null
LOCK - Fri Mar 1 21:05:44 UTC 2013 lock /nusoft/app/home/ifmon/LOCK/LOCKS/20130301.21:05:44.0.if01.10411.ifmon.ifmon
^C
ls -l ${LOCK}/LOCKS # verify that the lock file is no longer touched once a minute
lock status ; sleep 130 ; lock status # should go stale afer 2 minutes
lock clean # clean this up
Copy 20 files, moderately¶
for N in ${N20} ; do sleep 1 { cpn /var/tmp/CPN${N} /var/tmp/CPX${N}; } > /dev/null 2>&1 & done ls -1 ${LOCK}/LOG | tail -20 lock clean
Copy 20 files, holding then releasing the locks to produce a high demand.¶
echo 0 > ${LIMIT} for N in ${N20} ; do usleep 100000 { cpn /var/tmp/CPN${N} /var/tmp/CPX${N}; } > /dev/null 2>&1 & done lock status echo 5 > ${LIMIT} ; watch -n 1 lock status ... ls -1 ${LOCK}/LOG | tail -20 lock clean
Copy flood of 20 large files from Bluearc¶
echo 0 > ${LIMIT} for N in ${N20} ; do usleep 100000 { cpn /grid/data/ifmon/CPNTEST/FILES/file${N} /dev/null; } > /dev/null 2>&1 & done lock status echo 5 > ${LIMIT} ; watch -n 1 lock status ... ls -1 ${LOCK}/LOG | tail -20 lock clean
Run 20 sleep commands via cpn, 5 minutes per file, with 2 minute stale limit¶
See that peak locks to not go go over the limit.
( next to last number in the LOG files )
See that the LOCK and QUEUE files have timestamps within 1 minute of current
export PATH=${PATH}:/nusoft/app/home/ifmon/CPNTEST echo 0 > ${LIMIT} cat ${LOCK}/stale # should be 2 for N in ${N20} ; do sleep 2 { slown 300; } > /dev/null 2>&1 & done echo 5 > ${LIMIT} while true ; do lock status ; ls -l ${LOCK}/LOCKS ; ls -l ${LOCK}/QUEUE ; sleep 60 ; done ls -1 ${LOCK}/LOG | tail -20 20121106.21:23:52.15.300.gpsn01.5316.ifmon.ifmon.5.15 20121106.21:23:52.15.300.gpsn01.5371.ifmon.ifmon.4.15 20121106.21:23:53.15.300.gpsn01.5542.ifmon.ifmon.2.14 20121106.21:23:53.16.300.gpsn01.5419.ifmon.ifmon.3.15 20121106.21:23:53.16.300.gpsn01.5477.ifmon.ifmon.4.14 20121106.21:28:53.316.300.gpsn01.4961.ifmon.ifmon.5.10 20121106.21:28:53.316.300.gpsn01.4990.ifmon.ifmon.4.10 20121106.21:28:57.320.300.gpsn01.5076.ifmon.ifmon.5.8 20121106.21:28:57.320.300.gpsn01.5130.ifmon.ifmon.5.7 20121106.21:28:57.320.300.gpsn01.5256.ifmon.ifmon.5.6 20121106.21:33:53.616.300.gpsn01.5203.ifmon.ifmon.5.5 20121106.21:33:54.616.300.gpsn01.5820.ifmon.ifmon.5.4 20121106.21:33:57.619.300.gpsn01.5604.ifmon.ifmon.5.3 20121106.21:33:57.619.300.gpsn01.5659.ifmon.ifmon.5.2 20121106.21:33:57.619.300.gpsn01.5715.ifmon.ifmon.5.1 20121106.21:38:54.915.301.gpsn01.5769.ifmon.ifmon.5.0 20121106.21:38:54.916.300.gpsn01.5877.ifmon.ifmon.4.0 20121106.21:38:57.919.300.gpsn01.5930.ifmon.ifmon.3.0 20121106.21:38:57.919.300.gpsn01.5985.ifmon.ifmon.3.0 20121106.21:38:58.920.300.gpsn01.6039.ifmon.ifmon.1.0 lock clean
Run 20 jobs holding locks for 5 minutes each on Fermigrid¶
The slurk grid job uses the new lurk script to wait for UDP packets,
then uses slurk to hold a lock for 5 minutes.
flurk is run by hand to send UDP packets to the lurkers, so they all start at once.
Make links to the currect version in /nusoft/app/home/ifmon/CPNTEST
ifmon@if01 /nusoft/app/home/ifmon/CPNTEST/testlinks <release> Making links to /nusoft/app/home/nusoft/CPNTEST from cpn v1.4 WAS lrwxrwxrwx 1 ifmon ifmon 57 Mar 1 14:44 lock -> /grid/fermiapp/products/common/prd/cpn/v1_2/NULL/bin/lock lrwxrwxrwx 1 ifmon ifmon 56 Mar 1 14:44 slown -> /grid/fermiapp/products/common/prd/cpn/v1_2/NULL/bin/cpn NEW lrwxrwxrwx 1 ifmon ifmon 57 Mar 1 14:44 lock -> /grid/fermiapp/products/common/prd/cpn/v1_4/NULL/bin/lock lrwxrwxrwx 1 ifmon ifmon 56 Mar 1 14:44 slown -> /grid/fermiapp/products/common/prd/cpn/v1_4/NULL/bin/cpn
kreymer@minos50 unset SETUPS_DIR UPS_DIR SETUP_UPS . /grid/fermiapp/minos/scripts/jobha.sh setup cpn export CPN_LOCK_BASE=/nusoft/app/home export CPN_LOCK_GROUP=ifmon lock status lock statusall jobsub -N 10 -g file:///nusoft/app/home/ifmon/CPNTEST/slurk ; date -u
Watch for them to run on Fermigrid, and register in LURK
watch -n 60 'jobsub_q.py --group minos | grep kreymer' # see them run ls -l /minos/data/users/kreymer/LURK # see them LURKing
Free the lurking processes and see them run
/nusoft/app/home/ifmon/CPNTEST/flurk LOCK=/nusoft/app/home/ifmon/LOCK watch -n 10 "lock status ; ls -l ${LOCK}/LOCKS ; ls -l ${LOCK}/QUEUE "
Check the log and clean up.
The peak lock count ( next to last number in log lines ) should be 5
ifmon@if01 ls -1 ${LOCK}/LOG lock clean YM=`date +%Y%m` tail -15 ${LOCK}/LOGS/${YM}.log