Project

General

Profile

ADMINISTRATION » History » Version 21

Michael Diesburg, 09/15/2016 12:34 PM

1 1 Arthur Kreymer
h1. ADMINISTRATION
2 1 Arthur Kreymer
3 2 Arthur Kreymer
h2. FILES
4 2 Arthur Kreymer
5 1 Arthur Kreymer
Working directories and files are under /grid/data/${GROUP}/LOCK
6 1 Arthur Kreymer
7 1 Arthur Kreymer
On the grid, the username does not reflect the identity of the
8 1 Arthur Kreymer
person who submitted the job.
9 1 Arthur Kreymer
So the lock script gets the identity from the grid proxy.
10 1 Arthur Kreymer
11 1 Arthur Kreymer
/LOCKS - active lock files
12 1 Arthur Kreymer
  The lock files are empty, with names contining
13 1 Arthur Kreymer
  date, time queued,               host, pid, user, identity
14 1 Arthur Kreymer
15 1 Arthur Kreymer
/QUEUE - locks pending, empty files containing
16 1 Arthur Kreymer
  date,                            host, pid, user, identity
17 1 Arthur Kreymer
18 1 Arthur Kreymer
/LOG   - empty files with names reflecting completed locks
19 1 Arthur Kreymer
   date, time queued, time locked, host, pid,  user, identity
20 1 Arthur Kreymer
21 1 Arthur Kreymer
/LOGS   - monthly text summaries built from LOG file names.
22 1 Arthur Kreymer
23 1 Arthur Kreymer
/STALE - record of locks that have timed out
24 1 Arthur Kreymer
25 1 Arthur Kreymer
glimit - global activity limit, including all user groups
26 1 Arthur Kreymer
         set this near the actual Bluearc capacity
27 1 Arthur Kreymer
         this is not implemented as of 2012-11-06
28 1 Arthur Kreymer
29 1 Arthur Kreymer
limit  - local activity limit, for the users' own group
30 1 Arthur Kreymer
             set this well under Bluearc capacity
31 1 Arthur Kreymer
32 1 Arthur Kreymer
perf   - performance MB/sec required in PERF before locking
33 1 Arthur Kreymer
34 1 Arthur Kreymer
PERF   - actual MB/sec performance, measured by external agent
35 1 Arthur Kreymer
            ( No agents implemented as of 2010-08-02 )
36 1 Arthur Kreymer
37 1 Arthur Kreymer
rate   - net retry rate target, in retries per second
38 1 Arthur Kreymer
39 1 Arthur Kreymer
small  - MBytes: files smaller than this are not locked by cpn.
40 1 Arthur Kreymer
41 1 Arthur Kreymer
wait   - mininum time to wait before retrying, regardless of the load.
42 1 Arthur Kreymer
         the time delay before retrying a lock is the minimum of
43 1 Arthur Kreymer
* wait
44 1 Arthur Kreymer
* (number of queued locks)/rate
45 1 Arthur Kreymer
46 4 Arthur Kreymer
h2. MAINTENANCE
47 1 Arthur Kreymer
48 1 Arthur Kreymer
lock files should be owned by some appropriate group account, like mindata.
49 1 Arthur Kreymer
50 1 Arthur Kreymer
That account should occasionally remove expired locks and queue entries,
51 1 Arthur Kreymer
and concatenate  LOG entries into monthly summary files.
52 1 Arthur Kreymer
53 4 Arthur Kreymer
You can run the lockclean script manually, which will do this hourly :
54 4 Arthur Kreymer
But be careful, interactive logins on gpsn01 are in group gpcf.
55 4 Arthur Kreymer
Use 'sg' to set the proper group first
56 4 Arthur Kreymer
<pre>
57 9 Arthur Kreymer
    GRP=<mygroup>
58 9 Arthur Kreymer
    set nohup ; sg ${GRP} -c /grid/fermiapp/common/tools/lockclean &
59 4 Arthur Kreymer
</pre>
60 4 Arthur Kreymer
There should be a crontab entry for each account like
61 4 Arthur Kreymer
<pre>
62 9 Arthur Kreymer
@reboot sg ${GRP} -c /grid/fermiapp/common/tools/lockclean
63 4 Arthur Kreymer
</pre>
64 4 Arthur Kreymer
65 11 Arthur Kreymer
| Group     | Account@Host            | crontab |
66 21 Michael Diesburg
| annie     | ifmon@ifmongpvm02       | @reboot sleep 300; sg annie      /grid/fermiapp/common/tools/lockclean |
67 21 Michael Diesburg
| cdms      | ifmon@ifmongpvm02       | @reboot sleep 300; sg scdms      /grid/fermiapp/common/tools/lockclean |
68 21 Michael Diesburg
| coupp     | ifmon@ifmongpvm02       | @reboot sleep 300 ; sg coupp  -c /grid/fermiapp/common/tools/lockclean |
69 21 Michael Diesburg
| darkside  | ifmon@ifmongpvm02       | @reboot sleep 300 ; sg darkside -c /grid/fermiapp/common/tools/lockclean |
70 21 Michael Diesburg
| des       | desdata@ifmongpvm02     | @reboot sleep 300 ; sg des    -c /grid/fermiapp/common/tools/lockclean |
71 21 Michael Diesburg
| dune      | ifmon@ifmongpvm02       | @reboot sleep 300 ;              /grid/fermiapp/common/tools/lockclean |
72 1 Arthur Kreymer
| e875      | mindata@minos27         | @reboot sleep 300 ;              /grid/fermiapp/common/tools/lockclean |
73 21 Michael Diesburg
| e898      | ifmon@ifmongpvm02       | @reboot sleep 300 ; sg e898   -c /grid/fermiapp/common/tools/lockclean |
74 1 Arthur Kreymer
| e938      | minervadat@if02         | @reboot sleep 300 ;              /grid/fermiapp/common/tools/lockclean |
75 21 Michael Diesburg
| gm2       | gm2dat@ifmongpvm02      | @reboot sleep 300 ; sg gm2    -c /grid/fermiapp/common/tools/lockclean |
76 21 Michael Diesburg
| gpcf      | ifmon@ifmongpvm02       | @reboot sleep 300 ; sg gpcf   -c /grid/fermiapp/common/tools/lockclean |
77 21 Michael Diesburg
| lar1nd    | ifmon@ifmongpvm02       | @reboot sleep 300 ; sg lar1nd -c /grid/fermiapp/common/tools/lockclean |
78 21 Michael Diesburg
| marslbne  | marslbne@ifmongpvm02    | @reboot sleep 300 ;              /grid/fermiapp/common/tools/lockclean |
79 14 Arthur Kreymer
| marsmu2e  | marsmu2e@detsim         | @reboot sleep 300 ;              /grid/fermiapp/common/tools/lockclean |
80 21 Michael Diesburg
| mu2e      | ifmon@ifmongpvm02       | @reboot sleep 300 ;              /grid/fermiapp/common/tools/lockclean |
81 14 Arthur Kreymer
| mu2epro   | mu2epro@mu2egpvm01      | RETIRED FEB 4 2013                                                     |
82 21 Michael Diesburg
| nova      | novadata@gpcf028        | @reboot sleep 300;               /grid/fermiapp/common/tools/lockclean |
83 21 Michael Diesburg
| numix     | ifmon@ifmongpvm02       | @reboot sleep 300; sg numix -c   /grid/fermiapp/common/tools/lockclean |
84 21 Michael Diesburg
| sbnd      | ifmon@ifmongpvm02       | @reboot sleep 300; sg sbnd    -c /grid/fermiapp/common/tools/lockclean |
85 21 Michael Diesburg
| sdss      | ifmon@ifmongpvm02       | @reboot sleep 300; sg sdss       /grid/fermiapp/common/tools/lockclean |
86 4 Arthur Kreymer
| t-962     | argoneut@argoneutgpvm01 | @reboot sleep 300;               /grid/fermiapp/common/tools/lockclean |
87 21 Michael Diesburg
| t-1034    | ifmon@ifmongpvm02       | @reboot sleep 300 ; sg t-1034 -c /grid/fermiapp/common/tools/lockclean |
88 1 Arthur Kreymer
| uboone    | uboone@uboonegpvm01     | @reboot sleep 300;               /grid/fermiapp/common/tools/lockclean |
89 1 Arthur Kreymer
90 1 Arthur Kreymer
h1. USAGE
91 1 Arthur Kreymer
92 1 Arthur Kreymer
Get an idea of activity by counting lines in log files.
93 1 Arthur Kreymer
94 1 Arthur Kreymer
For example, for Minos, 
95 1 Arthur Kreymer
96 1 Arthur Kreymer
  $ wc -l /grid/data/e875/LOCK/LOGS/*.log
97 1 Arthur Kreymer
    9124 /grid/data/e875/LOCK/LOGS/200908.log
98 1 Arthur Kreymer
  140794 /grid/data/e875/LOCK/LOGS/200909.log
99 1 Arthur Kreymer
  181895 /grid/data/e875/LOCK/LOGS/200910.log
100 1 Arthur Kreymer
  196327 /grid/data/e875/LOCK/LOGS/200911.log
101 1 Arthur Kreymer
  125084 /grid/data/e875/LOCK/LOGS/200912.log
102 1 Arthur Kreymer
  272598 /grid/data/e875/LOCK/LOGS/201001.log
103 1 Arthur Kreymer
  284000 /grid/data/e875/LOCK/LOGS/201002.log
104 1 Arthur Kreymer
  275479 /grid/data/e875/LOCK/LOGS/201003.log
105 1 Arthur Kreymer
  354725 /grid/data/e875/LOCK/LOGS/201004.log
106 1 Arthur Kreymer
 1840026 total
107 1 Arthur Kreymer
108 2 Arthur Kreymer
  
109 2 Arthur Kreymer
  $ wc -l /grid/data/e875/LOCK/STALE/LOCKS/*.log
110 2 Arthur Kreymer
$ wc -l /grid/data/e875/LOCK/STALE/QUEUE/*.log
111 13 Arthur Kreymer
112 13 Arthur Kreymer
h2. INITIALIZATION
113 1 Arthur Kreymer
114 1 Arthur Kreymer
New LOCK areas should be created by the ifmon account.
115 1 Arthur Kreymer
We can use root@if-admin-minos to create this area.
116 13 Arthur Kreymer
117 1 Arthur Kreymer
<pre>
118 1 Arthur Kreymer
    ifmon@gpsn01
119 20 Arthur Kreymer
120 1 Arthur Kreymer
GROU=<name of the group>
121 19 Arthur Kreymer
sg ${GROU}
122 20 Arthur Kreymer
GROU=<name of the group>
123 20 Arthur Kreymer
124 1 Arthur Kreymer
. /grid/fermiapp/products/common/etc/setups.sh
125 1 Arthur Kreymer
ups tailor cpn          # will echo the commands proposed 
126 1 Arthur Kreymer
ups tailor cpn -O write # will execute the commands
127 20 Arthur Kreymer
128 1 Arthur Kreymer
cd /grid/data/${GROU}/LOCK
129 1 Arthur Kreymer
chmod 777 DO LOCKS LOG QUEUE
130 1 Arthur Kreymer
</pre>
131 1 Arthur Kreymer
132 1 Arthur Kreymer
   Update gpsn01:ifmon:crontab.gpsn01
133 20 Arthur Kreymer
   adding an entry for the new group
134 1 Arthur Kreymer
   and start the lockclean process manually
135 1 Arthur Kreymer
136 1 Arthur Kreymer
<pre>
137 20 Arthur Kreymer
crontab crontab.gpsn01
138 1 Arthur Kreymer
139 1 Arthur Kreymer
set nohup ;  sg ${GROU} -c /grid/fermiapp/common/tools/lockclean &
140 1 Arthur Kreymer
</pre>
141 20 Arthur Kreymer
142 1 Arthur Kreymer
    Test the locks with a normal user
143 1 Arthur Kreymer
144 1 Arthur Kreymer
<pre>
145 1 Arthur Kreymer
GROU=<name of the group>
146 1 Arthur Kreymer
147 1 Arthur Kreymer
. /grid/fermiapp/products/common/etc/setups.sh
148 1 Arthur Kreymer
149 1 Arthur Kreymer
setup cpn
150 1 Arthur Kreymer
151 1 Arthur Kreymer
export CPN_LOCK_GROUP=${GROU}
152 1 Arthur Kreymer
153 1 Arthur Kreymer
lock
154 1 Arthur Kreymer
LOCK - Wed Jan 22 16:37:07 UTC 2014 lock  /grid/data/t-1034/LOCK/LOCKS/20140122.16:37:07.0.minos27.7436.kreymer.kreymer
155 1 Arthur Kreymer
156 1 Arthur Kreymer
lock free
157 14 Arthur Kreymer
LOCK - Wed Jan 22 16:37:09 UTC 2014 freed /grid/data/t-1034/LOCK/LOCKS/20140122.16:37:07.0.minos27.7436.kreymer.kreymer
158 14 Arthur Kreymer
159 20 Arthur Kreymer
cpn /usr/bin/crash /dev/null
160 19 Arthur Kreymer
LOCK - Wed Jan 22 16:37:44 UTC 2014 lock  /grid/data/t-1034/LOCK/LOCKS/20140122.16:37:44.1.minos27.30874.kreymer.kreymer
161 19 Arthur Kreymer
LOCK - Wed Jan 22 16:37:44 UTC 2014 freed /grid/data/t-1034/LOCK/LOCKS/20140122.16:37:44.1.minos27.30874.kreymer.kreymer
162 19 Arthur Kreymer
</pre>
163 20 Arthur Kreymer
164 19 Arthur Kreymer
h3. If the /grid/data/${GROU} does not exist, or needs to be linked to alternate name, see the following notes
165 19 Arthur Kreymer
166 19 Arthur Kreymer
<pre>
167 19 Arthur Kreymer
DATA=${GROU}  # directory to be locked, usually ${GROU}/data, but not always
168 19 Arthur Kreymer
169 19 Arthur Kreymer
Determine what the gid is for the area.
170 19 Arthur Kreymer
stat /${DATA}/data | grep Gid
171 19 Arthur Kreymer
172 1 Arthur Kreymer
    root@if-admin-minos
173 19 Arthur Kreymer
174 19 Arthur Kreymer
IFUID=45438  # ifmon UID
175 19 Arthur Kreymer
GRGID=....   # group GID from /${DATA}/data
176 19 Arthur Kreymer
GROU=<name of the group>
177 19 Arthur Kreymer
178 19 Arthur Kreymer
mkdir -p                /grid/data/$GROU}  # if necessary       
179 19 Arthur Kreymer
ls -ld                  /grid/data/$GROU}  # note original owner
180 20 Arthur Kreymer
chown ${IFUID}.${GRGID} /grid/data/$GROU}  # set ownership while tailoring cpn
181 19 Arthur Kreymer
182 19 Arthur Kreymer
REX will verify that the group id name is the same on Fermigrid nodes and in the Lab GID registry, at
183 19 Arthur Kreymer
http://www-giduid.fnal.gov/cd/FUE/uidgid/gid_id.lis</pre>
184 19 Arthur Kreymer
185 19 Arthur Kreymer
WARNING - if the group name contains a special character like -,
186 19 Arthur Kreymer
that group name will not be used on Fermigrid worker nodes.
187 19 Arthur Kreymer
You need to find out what the group name is there,
188 14 Arthur Kreymer
and make a symlink in /grid/data for compatibility
189 14 Arthur Kreymer
190 1 Arthur Kreymer
GRID=<name of Fermigrid group>
191 1 Arthur Kreymer
192 1 Arthur Kreymer
ln -s /grid/data/${GROU} /grid/data/${GRID}
193 1 Arthur Kreymer
</pre>