Project

General

Profile

PLANS

The 'perf' file sets a lower limit on performance,
as deterimined by the content of the PERF file.
In 2012 the PERF file contains a fixed number.
It would be filled by an external process
measuring overall file system performance.
We have never implemented an automatic process to set PERF.

The 'glimit' sets a global lock limit for all groups under /grid/data/.
Because we did not originally share file systems
the lock script did not enforce this limit.
In 2012 we have several separate shared file systems.
If we activate glimit, we need to build in sharing information.

We may want a separate global limit for Bluearc heads.
Again, knowledge of heads would be needed.

DEPLOYMENT

  • Create verison in UPS
        products@if01
    
    . ~ifmon/shrc/kreymer
    . /grid/fermiapp/products/common/etc/setups.sh
    
    cd /grid/fermiapp/products/common/prd/cpn
    
    OVER=v1.6
    NVER=v1.7
    
    OVERU=`echo ${OVER} | tr . _`
    NVERU=`echo ${NVER} | tr . _`
    
    cp -vax ${OVER} ${NVER}
    
    ups declare cpn ${NVER} -f NULL \
      -r /grid/fermiapp/products/common/prd/cpn/${NVERU}/NULL \
      -m cpn.table
    
    setup cpn ${NVER}
    cd $CPN_DIR/bin
    
    ADM=/afs/fnal.gov/files/expwww/numi/html/computing/admin/bluearc
    
    diff ${ADM}/lock lock
    cp   ${ADM}/lock lock
    
       ETC
    
  • Test the new version as noted below.
  • Upload to UPD
        kreymer@minos27
    
    unset SETUPS_DIR UPS_DIR SETUP_UPS
    . /grid/fermiapp/products/common/etc/setups.sh
    setup upd
    
    UPDADD=/afs/fnal.gov/files/home/room1/kreymer/minos/scripts/updadd
    
    NVER=v1.7
    
    ${UPDADD} NULL cpn ${NVER}
    
    upd list -aK+ cpn
    
  • Get approval via Change Management, i.e. CHG000000005680
  • Notify CS Liaisons at
  • Deploy with
    As ifmon@if01 :
    . /grid/fermiapp/products/common/etc/setups.sh
    export CPN_LOCK_GROUP=gpcf
    setup cpn
    
    DAY=`date +%Y%m%d`
    LLOG=/grid/data/ifmon/CPNTEST/log
    OLDCPN="v1r6" 
    NEWCPN="v1r7" 
    
    lock statusall                | tee ${LLOG}/${DAY}-STATS-${OLDCPN}.log
    ls -l /grid/data/*/LOCK/LOCKS | tee ${LLOG}/${DAY}-LOCKS-${OLDCPN}.log
    ls -l /grid/data/*/LOCK/QUEUE | tee ${LLOG}/${DAY}-QUEUE-${OLDCPN}.log
    

    As products@if01
    . /grid/fermiapp/products/common/etc/setups.sh
    
    OLDCPN="v1.6" 
    NEWCPN="v1.7" 
    ups list -aK+ cpn
    ups declare -c cpn ${NEWCPN}
    date
    ups list -aK+ cpn
    

    In the same ifmon@if01 session
    lock statusall                | tee ${LLOG}/${DAY}-STATS-${NEWCPN}.log
    ls -l /grid/data/*/LOCK/LOCKS | tee ${LLOG}/${DAY}-LOCKS-${NEWCPN}.log
    ls -l /grid/data/*/LOCK/QUEUE | tee ${LLOG}/${DAY}-QUEUE-${NEWCPN}.log
    
  • Monitor with lock statusall
  • Fallback is
    ups declare -c cpn ${OLDCPN} 

TESTING

ALTERNATE LOCKS

By default, the locks are managed from an area

/grid/data/${GROUP}/LOCK

where group is taken from the output of the id command.

You can override the lock area like

export CPN_LOCK_BASE=/grid/data/ifmon/test

You can override the group like

export CPN_LOCK_GROUP=ifmon

VALIDATION

CHECKLIST

  • lock
  • LOCKING - obsolete
  • single queued
  • single queued interrupted
  • 20 files
  • copy 1
  • copy interrupted
  • copy 20
  • copy 20 holding
  • copy 20 large
  • 20 sleep 5 min

Use the ifmon account for local tests, with a private LOCK area

setup the desired version of cpn

unset SETUP_UPS SETUPS_DIR UPS_DIR
. /grid/fermiapp/products/common/etc/setups.sh
setup cpn

Status

Test the statusv and statusallv commands in the public context.
This also tests the status and statusv commands.

export CPN_LOCK_BASE=/grid/data
export CPN_LOCK_GROUP=nova


lock statusv

 nova LOCK STATUS Thu Dec  6 20:05:14 UTC 2012

    LOCKS  0 of 10 ( 0 stale )

    QUEUE  0 ( 0 stale) 

   Control files
   limit   perf   PERF   rate  stale staleq   wait    small
      10      3     50      1     10    600      5         

lock statusallv

    LOCK summary Thu Dec  6 20:06:15 UTC 2012

        argoneut LOCKS  0/20 (    0) QUEUE      0 (    0)        0
              d0 LOCKS  0/20 (    0) QUEUE      0 (    0)        3
            e875 LOCKS  0/ 5 (    0) QUEUE      0 (    0)        3
            e938 LOCKS  4/20 (    0) QUEUE      6 (    0)       10
        fnalgrid LOCKS  0/20 (    0) QUEUE      0 (    0)      341
             gm2 LOCKS  0/ 5 (    0) QUEUE      0 (    0)        0
            gpcf LOCKS  0/20 (  188) QUEUE      0 (  268)   180772
            lbne LOCKS  0/20 (    0) QUEUE      0 (    0)        0
        marslbne LOCKS  0/20 (    0) QUEUE      0 (    0)    11418
        marsmu2e LOCKS  0/ 5 (    0) QUEUE      0 (    0)        0
      microboone LOCKS  0/20 (    0) QUEUE      0 (    0)      226
            mu2e LOCKS  0/ 5 (    0) QUEUE      0 (    0)        0
         mu2epro LOCKS  0/ 5 (    1) QUEUE      0 (    0)     4588
            nova LOCKS  0/10 (    0) QUEUE      0 (    0)        0
           t-962 LOCKS  0/20 (    0) QUEUE      0 (    0)        0

   Control files
          GROUP  limit   perf   PERF   rate  stale staleq   wait    small
       argoneut     20      3     50      1     10    600      5  1000000
             d0     20      3     50      1     10    600      5         
           e875      5      3     50      1     10    600      5  1000000
           e938     20      3     50      1     10    600      5         
       fnalgrid     20      3     50     10     10    600      5  1000000
            gm2      5      5     50      1      3      3      5  1000000
           gpcf     20      3     50      1     10    600      5  1000000
           lbne     20      3     50      1     10    600      5  1000000
       marslbne     20      3     50      1     10    600      5  1000000
       marsmu2e      5      3     50      1     10    600      5         
     microboone     20      3     50      1     10    600      5  1000000
           mu2e      5      3     50      1     10    600      5         
        mu2epro      5      3     50      1     10    600      5  1000000
           nova     10      3     50      1     10    600      5         
          t-962     20      3     50      1     10    600      5  1000000

Move to private test LOCKs

export CPN_LOCK_BASE=/nusoft/app/home
export CPN_LOCK_GROUP=ifmon
LOCK=${CPN_LOCK_BASE}/${CPN_LOCK_GROUP}/LOCK
LIMIT=${LOCK}/limit
WAIT=${LOCK}/wait

Simple lock and release

lock ; lock free
LOCK - Tue Nov 6 18:44:18 UTC 2012 lock /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.18:44:18.1.gpsn01.16146.ifmon.ifmon
LOCK - Tue Nov 6 12:44:21 CST 2012 freed /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.18:44:18.1.gpsn01.16146.ifmon.ifmon
ls ${LOCK}/LOG | tail -1
20121106.18:44:21.1.gpsn01.16146.ifmon.ifmon.1.0
lock clean

Single lock delayed by ${LOCK}/DO/LOCKING ( need two windows )

  • OBSOLETE, NOT USING LOCKING FILE 2014+ ***

FORCE QUEUING

    WINDOW 1

touch ${LOCK}/DO/LOCKING

TRY TO LOCK/FREE

    WINDOW 2

lock ; lock free
LOCKING already in progress, usleep 300000 
LOCKING already in progress, usleep 9400000 
LOCKING already in progress, usleep 6100000 
LOCKING already in progress, usleep 800000 
LOCKING already in progress, usleep 000000 
LOCKING already in progress, usleep 800000 
LOCKING already in progress, usleep 1600000 
LOCKING already in progress, usleep 9000000 

RELEASE THE LOCKING FILE

    WINDOW 1

rm ${LOCK}/DO/LOCKING

SEE THE LOCK BE TAKEN

    WINDOW 2

LOCK - Mon Mar 18 23:07:50 UTC 2013 lock  /nusoft/app/home/ifmon/LOCK/LOCKS/20130318.23:07:50.28.minos27.7828.ifmon.ifmon
LOCK - Mon Mar 18 23:07:50 UTC 2013 freed /nusoft/app/home/ifmon/LOCK/LOCKS/20130318.23:07:50.28.minos27.7828.ifmon.ifmon

Single queued lock ( need two windows )


FORCE QUEUING

    WINDOW 1
cat ${LIMIT} ; echo   0 > ${LIMIT} ; cat ${LIMIT}
5
0
cat ${WAIT}  ; echo 200 > ${WAIT}  ; cat ${WAIT}
5
200

TRY TO LOCK/FREE

    WINDOW 2
lock ; lock free
LOCK - Tue Nov  6 12:57:53 CST 2012 LOCKS/LIMIT/QUEUE 0/0/1 sleeping 5
LOCK - Tue Nov  6 12:57:53 CST 2012 queue 20121106.18:57:53.if01.16146.ifmon.ifmon

VERIFY THE QUEUE

    WINDOW 1
ls -l ${LOCK}/QUEUE # verify that the queue file is touched once a minute
sleep 90
ls -l ${LOCK}/QUEUE

ALLOW THE LOCK AND WAKE UP WITH UDP PACKET

    WINDOW 1
cat ${WAIT}  ; echo 5 > ${WAIT}  ; cat ${WAIT}
200
5
cat ${LIMIT} ; echo 5 > ${LIMIT} ; cat ${LIMIT}
0
5

Send a UDP packet to wake up the sleeping process at port PID+2000

echo "wakeup" > /dev/udp/if01.fnal.gov/18146

    WINDOW 2
LOCK - Tue Nov  6 18:58:18 UTC 2012 lock  /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.18:58:18.25.if01.16146.ifmon.ifmon
LOCK - Tue Nov  6 12:59:00 CST 2012 freed /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.18:58:18.25.if01.16146.ifmon.ifmon

Single queued lock, interrupted ( need two windows )


FORCE QUEUING

    WINDOW 1
cat ${LIMIT} ; echo   0 > ${LIMIT} ; cat ${LIMIT}
5
0

START THE LOCK

    WINDOW 2
lock
LOCK - Tue Nov  6 12:57:53 CST 2012 LOCKS/LIMIT/QUEUE 0/0/1 sleeping 5
LOCK - Tue Nov  6 12:57:53 CST 2012 queue 20121106.18:57:53.gpsn01.16146.ifmon.ifmon

VERIFY THE QUEUE

    WINDOW 1
ls -l ${LOCK}/QUEUE # verify that the queue file is touched once a minute

INTERRUPT THE LOCK

    WINDOW 2

^C

   see the sleep 60 lock tickler subprocess disappear after a minute

ps xf ; sleep 70 ; ps xf
...
 5502 pts/2    S      0:00 perl -MIO::Socket -e $s=IO::Socket::INET->new(LocalPort=>4015,Proto=>'udp'); $dg='x'; while($dg 
 5499 pts/2    S      0:00 /bin/sh /grid/fermiapp/products/common/prd/cpn/v1_4/NULL/bin/lock
 5500 pts/2    S      0:00  \_ sleep 60

...
 5502 pts/2    S      0:00 perl -MIO::Socket -e $s=IO::Socket::INET->new(LocalPort=>4015,Proto=>'udp'); $dg='x'; while($dg 

VERIFY THE QUEUE IS NOT KEPT ALIVE, THEN CLEAN UP

    WINDOW 1
ls -l ${LOCK}/QUEUE # verify that the queue file is no longer touched once a minute

lock clean  # after 2 minutes

RESTORE NORMAL LOCKS

    WINDOW 1

cat ${LIMIT} ; echo 5 > ${LIMIT} ; cat ${LIMIT}
0
5

Send a UDP packet to wake up the sleeping process at port PID+2000

echo "wakeup" > /dev/udp/if01.fnal.gov/18146

Make 20 files for subsequent testing

N20='00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19'
for N in ${N20} ; do
dd if=/dev/urandom of=/var/tmp/CPN${N} bs=1M count=11
done
ls -l /var/tmp/CPN*

Copy 1 file

cpn /var/tmp/CPN01 /dev/null
LOCK - Tue Nov 6 19:14:55 UTC 2012 lock /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.19:14:55.0.gpsn01.10885.ifmon.ifmon
LOCK - Tue Nov  6 13:14:55 CST 2012 freed /nusoft/app/home/ifmon/LOCK/LOCKS/20121106.19:14:55.0.gpsn01.10885.ifmon.ifmon

Copy a file, interrupted

Verify that the lock keepalive process goes away, and the lock expires.

cpn /grid/data/ifmon/CPNTEST/FILES/20G /dev/null
LOCK - Fri Mar 1 21:05:44 UTC 2013 lock /nusoft/app/home/ifmon/LOCK/LOCKS/20130301.21:05:44.0.if01.10411.ifmon.ifmon
^C

ls -l ${LOCK}/LOCKS # verify that the lock file is no longer touched once a minute

lock status ; sleep 130 ; lock status # should go stale afer 2 minutes

lock clean # clean this up

Copy 20 files, moderately

for N in ${N20} ; do sleep 1
{ cpn /var/tmp/CPN${N} /var/tmp/CPX${N}; } > /dev/null 2>&1 &
done
ls -1 ${LOCK}/LOG | tail -20
lock clean

Copy 20 files, holding then releasing the locks to produce a high demand.

echo 0 > ${LIMIT}

for N in ${N20} ; do usleep 100000
{ cpn /var/tmp/CPN${N} /var/tmp/CPX${N}; } > /dev/null 2>&1 &
done

lock status

echo 5 > ${LIMIT} ; watch -n 1 lock status
... 

ls -1 ${LOCK}/LOG | tail -20
lock clean

Copy flood of 20 large files from Bluearc

echo 0 > ${LIMIT}

for N in ${N20} ; do usleep 100000
{ cpn /grid/data/ifmon/CPNTEST/FILES/file${N} /dev/null; } > /dev/null 2>&1 &
done

lock status

echo 5 > ${LIMIT} ; watch -n 1 lock status
... 

ls -1 ${LOCK}/LOG | tail -20

lock clean

Run 20 sleep commands via cpn, 5 minutes per file, with 2 minute stale limit

See that peak locks to not go go over the limit.
( next to last number in the LOG files )
See that the LOCK and QUEUE files have timestamps within 1 minute of current

export PATH=${PATH}:/nusoft/app/home/ifmon/CPNTEST
echo 0 > ${LIMIT}

cat ${LOCK}/stale # should be 2

for N in ${N20} ; do sleep 2
{ slown 300; } > /dev/null 2>&1 &
done

echo 5 > ${LIMIT}

while true ; do lock status ; ls -l ${LOCK}/LOCKS ; ls -l ${LOCK}/QUEUE ; sleep 60 ; done

ls -1 ${LOCK}/LOG | tail -20

20121106.21:23:52.15.300.gpsn01.5316.ifmon.ifmon.5.15
20121106.21:23:52.15.300.gpsn01.5371.ifmon.ifmon.4.15
20121106.21:23:53.15.300.gpsn01.5542.ifmon.ifmon.2.14
20121106.21:23:53.16.300.gpsn01.5419.ifmon.ifmon.3.15
20121106.21:23:53.16.300.gpsn01.5477.ifmon.ifmon.4.14
20121106.21:28:53.316.300.gpsn01.4961.ifmon.ifmon.5.10
20121106.21:28:53.316.300.gpsn01.4990.ifmon.ifmon.4.10
20121106.21:28:57.320.300.gpsn01.5076.ifmon.ifmon.5.8
20121106.21:28:57.320.300.gpsn01.5130.ifmon.ifmon.5.7
20121106.21:28:57.320.300.gpsn01.5256.ifmon.ifmon.5.6
20121106.21:33:53.616.300.gpsn01.5203.ifmon.ifmon.5.5
20121106.21:33:54.616.300.gpsn01.5820.ifmon.ifmon.5.4
20121106.21:33:57.619.300.gpsn01.5604.ifmon.ifmon.5.3
20121106.21:33:57.619.300.gpsn01.5659.ifmon.ifmon.5.2
20121106.21:33:57.619.300.gpsn01.5715.ifmon.ifmon.5.1
20121106.21:38:54.915.301.gpsn01.5769.ifmon.ifmon.5.0
20121106.21:38:54.916.300.gpsn01.5877.ifmon.ifmon.4.0
20121106.21:38:57.919.300.gpsn01.5930.ifmon.ifmon.3.0
20121106.21:38:57.919.300.gpsn01.5985.ifmon.ifmon.3.0
20121106.21:38:58.920.300.gpsn01.6039.ifmon.ifmon.1.0

lock clean

Run 20 jobs holding locks for 5 minutes each on Fermigrid

The slurk grid job uses the new lurk script to wait for UDP packets,
then uses slurk to hold a lock for 5 minutes.

flurk is run by hand to send UDP packets to the lurkers, so they all start at once.

Make links to the currect version in /nusoft/app/home/ifmon/CPNTEST

    ifmon@if01

/nusoft/app/home/ifmon/CPNTEST/testlinks <release>

Making links to /nusoft/app/home/nusoft/CPNTEST from cpn v1.4
WAS
lrwxrwxrwx 1 ifmon ifmon 57 Mar  1 14:44 lock -> /grid/fermiapp/products/common/prd/cpn/v1_2/NULL/bin/lock
lrwxrwxrwx 1 ifmon ifmon 56 Mar  1 14:44 slown -> /grid/fermiapp/products/common/prd/cpn/v1_2/NULL/bin/cpn
NEW
lrwxrwxrwx 1 ifmon ifmon 57 Mar  1 14:44 lock -> /grid/fermiapp/products/common/prd/cpn/v1_4/NULL/bin/lock
lrwxrwxrwx 1 ifmon ifmon 56 Mar  1 14:44 slown -> /grid/fermiapp/products/common/prd/cpn/v1_4/NULL/bin/cpn


    kreymer@minos50

unset SETUPS_DIR UPS_DIR SETUP_UPS
. /grid/fermiapp/minos/scripts/jobha.sh
setup cpn 

export CPN_LOCK_BASE=/nusoft/app/home
export CPN_LOCK_GROUP=ifmon

lock status
lock statusall

jobsub -N 10 -g file:///nusoft/app/home/ifmon/CPNTEST/slurk ; date -u

Watch for them to run on Fermigrid, and register in LURK

watch -n 60 'jobsub_q.py --group minos  | grep kreymer' # see them run
ls -l /minos/data/users/kreymer/LURK  # see them LURKing

Free the lurking processes and see them run

/nusoft/app/home/ifmon/CPNTEST/flurk

LOCK=/nusoft/app/home/ifmon/LOCK

watch -n 10 "lock status ; ls -l ${LOCK}/LOCKS ; ls -l ${LOCK}/QUEUE " 

Check the log and clean up.
The peak lock count ( next to last number in log lines ) should be 5


    ifmon@if01

ls -1 ${LOCK}/LOG
lock clean

YM=`date +%Y%m`
tail -15 ${LOCK}/LOGS/${YM}.log