Project

General

Profile

ADMINISTRATION » History » Version 10

Arthur Kreymer, 05/17/2013 10:54 AM

1 1 Arthur Kreymer
h1. ADMINISTRATION
2 1 Arthur Kreymer
3 2 Arthur Kreymer
h2. FILES
4 2 Arthur Kreymer
5 1 Arthur Kreymer
Working directories and files are under /grid/data/${GROUP}/LOCK
6 1 Arthur Kreymer
7 1 Arthur Kreymer
On the grid, the username does not reflect the identity of the
8 1 Arthur Kreymer
person who submitted the job.
9 1 Arthur Kreymer
So the lock script gets the identity from the grid proxy.
10 1 Arthur Kreymer
11 1 Arthur Kreymer
/LOCKS - active lock files
12 1 Arthur Kreymer
  The lock files are empty, with names contining
13 1 Arthur Kreymer
  date, time queued,               host, pid, user, identity
14 1 Arthur Kreymer
15 1 Arthur Kreymer
/QUEUE - locks pending, empty files containing
16 1 Arthur Kreymer
  date,                            host, pid, user, identity
17 1 Arthur Kreymer
18 1 Arthur Kreymer
/LOG   - empty files with names reflecting completed locks
19 1 Arthur Kreymer
   date, time queued, time locked, host, pid,  user, identity
20 1 Arthur Kreymer
21 1 Arthur Kreymer
/LOGS   - monthly text summaries built from LOG file names.
22 1 Arthur Kreymer
23 1 Arthur Kreymer
/STALE - record of locks that have timed out
24 1 Arthur Kreymer
25 1 Arthur Kreymer
glimit - global activity limit, including all user groups
26 1 Arthur Kreymer
         set this near the actual Bluearc capacity
27 1 Arthur Kreymer
         this is not implemented as of 2012-11-06
28 1 Arthur Kreymer
29 1 Arthur Kreymer
limit  - local activity limit, for the users' own group
30 1 Arthur Kreymer
             set this well under Bluearc capacity
31 1 Arthur Kreymer
32 1 Arthur Kreymer
perf   - performance MB/sec required in PERF before locking
33 1 Arthur Kreymer
34 1 Arthur Kreymer
PERF   - actual MB/sec performance, measured by external agent
35 1 Arthur Kreymer
            ( No agents implemented as of 2010-08-02 )
36 1 Arthur Kreymer
37 1 Arthur Kreymer
rate   - net retry rate target, in retries per second
38 1 Arthur Kreymer
39 1 Arthur Kreymer
small  - MBytes: files smaller than this are not locked by cpn.
40 1 Arthur Kreymer
41 1 Arthur Kreymer
wait   - mininum time to wait before retrying, regardless of the load.
42 1 Arthur Kreymer
         the time delay before retrying a lock is the minimum of
43 1 Arthur Kreymer
* wait
44 1 Arthur Kreymer
* (number of queued locks)/rate
45 1 Arthur Kreymer
46 4 Arthur Kreymer
h2. MAINTENANCE
47 1 Arthur Kreymer
48 1 Arthur Kreymer
lock files should be owned by some appropriate group account, like mindata.
49 1 Arthur Kreymer
50 1 Arthur Kreymer
That account should occasionally remove expired locks and queue entries,
51 1 Arthur Kreymer
and concatenate  LOG entries into monthly summary files.
52 1 Arthur Kreymer
53 4 Arthur Kreymer
You can run the lockclean script manually, which will do this hourly :
54 4 Arthur Kreymer
But be careful, interactive logins on gpsn01 are in group gpcf.
55 4 Arthur Kreymer
Use 'sg' to set the proper group first
56 4 Arthur Kreymer
<pre>
57 9 Arthur Kreymer
    GRP=<mygroup>
58 9 Arthur Kreymer
    set nohup ; sg ${GRP} -c /grid/fermiapp/common/tools/lockclean &
59 4 Arthur Kreymer
</pre>
60 4 Arthur Kreymer
There should be a crontab entry for each account like
61 4 Arthur Kreymer
<pre>
62 9 Arthur Kreymer
@reboot sg ${GRP} -c /grid/fermiapp/common/tools/lockclean
63 4 Arthur Kreymer
</pre>
64 4 Arthur Kreymer
65 6 Arthur Kreymer
| Group    | Account@Host            | crontab |
66 10 Arthur Kreymer
| des      | desdata@gpsn01          | @reboot sleep 300 ; sg des -c /grid/fermiapp/common/tools/lockclean |
67 10 Arthur Kreymer
| e875     | mindata@minos27         | @reboot sleep 300 ; /grid/fermiapp/minos/scripts/lockclean |
68 10 Arthur Kreymer
| e938     | minervadat@if02         | @reboot sleep 300 ; /grid/fermiapp/minos/scripts/lockclean |
69 10 Arthur Kreymer
| gm2      | gm2dat@gpsn01           | @reboot sleep 300 ; sg gm2 -c /grid/fermiapp/common/tools/lockclean |
70 10 Arthur Kreymer
| gpcf     | ifmon@gpsn01            | @reboot sleep 300 ; sg gpcf -c /grid/fermiapp/common/tools/lockclean |
71 10 Arthur Kreymer
| lbne     | lbnedata@lbnegpvm01     | @reboot sleep 300 ; /grid/fermiapp/minos/scripts/lockclean |
72 10 Arthur Kreymer
| marslbne | marslbne@lbnegpvm01     | @reboot sleep 300 ; /grid/fermiapp/minos/scripts/lockclean |
73 10 Arthur Kreymer
| marsmu2e | marsmu2e@detsim         | @reboot sleep 300 ; /grid/fermiapp/minos/scripts/lockclean |
74 10 Arthur Kreymer
| mu2e     | mu2e@mu2egpvm01         | @reboot sleep 300 ; /grid/fermiapp/minos/scripts/lockclean |
75 10 Arthur Kreymer
| mu2epro  | mu2epro@mu2egpvm01      | RETIRED FEB 4 2013 |
76 10 Arthur Kreymer
| t-962    | argoneut@argoneutgpvm01 | @reboot sleep 300;  /grid/fermiapp/common/tools/lockclean |
77 10 Arthur Kreymer
| uboone   | uboone@uboonegpvm01     | @reboot sleep 300;  /grid/fermiapp/common/tools/lockclean |
78 10 Arthur Kreymer
| nova     | novadata@gpcf028        | @reboot sleep 300 ; /grid/fermiapp/minos/scripts/lockclean |
79 10 Arthur Kreymer
80 4 Arthur Kreymer
81 4 Arthur Kreymer
h1. USAGE
82 1 Arthur Kreymer
83 1 Arthur Kreymer
Get an idea of activity by counting lines in log files.
84 1 Arthur Kreymer
85 1 Arthur Kreymer
For example, for Minos, 
86 1 Arthur Kreymer
87 1 Arthur Kreymer
  $ wc -l /grid/data/e875/LOCK/LOGS/*.log
88 1 Arthur Kreymer
    9124 /grid/data/e875/LOCK/LOGS/200908.log
89 1 Arthur Kreymer
  140794 /grid/data/e875/LOCK/LOGS/200909.log
90 1 Arthur Kreymer
  181895 /grid/data/e875/LOCK/LOGS/200910.log
91 1 Arthur Kreymer
  196327 /grid/data/e875/LOCK/LOGS/200911.log
92 1 Arthur Kreymer
  125084 /grid/data/e875/LOCK/LOGS/200912.log
93 1 Arthur Kreymer
  272598 /grid/data/e875/LOCK/LOGS/201001.log
94 1 Arthur Kreymer
  284000 /grid/data/e875/LOCK/LOGS/201002.log
95 1 Arthur Kreymer
  275479 /grid/data/e875/LOCK/LOGS/201003.log
96 1 Arthur Kreymer
  354725 /grid/data/e875/LOCK/LOGS/201004.log
97 1 Arthur Kreymer
 1840026 total
98 1 Arthur Kreymer
99 1 Arthur Kreymer
  
100 1 Arthur Kreymer
  $ wc -l /grid/data/e875/LOCK/STALE/LOCKS/*.log
101 1 Arthur Kreymer
$ wc -l /grid/data/e875/LOCK/STALE/QUEUE/*.log
102 2 Arthur Kreymer
103 2 Arthur Kreymer
h2. INITIALIZATION
104 2 Arthur Kreymer
105 2 Arthur Kreymer
To start up a new group's LOCKs, 
106 3 Arthur Kreymer
the group should give REX DH people access to the account,
107 2 Arthur Kreymer
and issue a ServiceNow ticket to have the files set up.
108 2 Arthur Kreymer
The .k5login should include 
109 2 Arthur Kreymer
<pre>
110 2 Arthur Kreymer
dbox@FNAL.GOV
111 2 Arthur Kreymer
illingwo@FNAL.GOV
112 2 Arthur Kreymer
kreymer@FNAL.GOV
113 2 Arthur Kreymer
lyon@FNAL.GOV
114 2 Arthur Kreymer
mengel@FNAL.GOV
115 3 Arthur Kreymer
rs@FNAL.GOV[[]]
116 2 Arthur Kreymer
votava@FNAL.GOV
117 1 Arthur Kreymer
</pre>
118 2 Arthur Kreymer
119 3 Arthur Kreymer
REX will verify that the group id name is the same
120 3 Arthur Kreymer
on Fermigrid nodes and in the Lab GID registry, at
121 3 Arthur Kreymer
http://www-giduid.fnal.gov/cd/FUE/uidgid/gid_id.lis
122 2 Arthur Kreymer
123 3 Arthur Kreymer
REX will then log in to the account 
124 3 Arthur Kreymer
and use 'ups tailor cpn' to create the default files.
125 3 Arthur Kreymer
( Available from cpn v1.3 onward )
126 2 Arthur Kreymer
127 3 Arthur Kreymer
 ups tailor cpn          - will echo the commands proposed 
128 3 Arthur Kreymer
ups tailor cpn -O write - will execute the commands