ADMINISTRATION » History » Version 6
Arthur Kreymer, 02/15/2013 11:54 AM
1 | 1 | Arthur Kreymer | h1. ADMINISTRATION |
---|---|---|---|
2 | 1 | Arthur Kreymer | |
3 | 2 | Arthur Kreymer | h2. FILES |
4 | 2 | Arthur Kreymer | |
5 | 1 | Arthur Kreymer | Working directories and files are under /grid/data/${GROUP}/LOCK |
6 | 1 | Arthur Kreymer | |
7 | 1 | Arthur Kreymer | On the grid, the username does not reflect the identity of the |
8 | 1 | Arthur Kreymer | person who submitted the job. |
9 | 1 | Arthur Kreymer | So the lock script gets the identity from the grid proxy. |
10 | 1 | Arthur Kreymer | |
11 | 1 | Arthur Kreymer | /LOCKS - active lock files |
12 | 1 | Arthur Kreymer | The lock files are empty, with names contining |
13 | 1 | Arthur Kreymer | date, time queued, host, pid, user, identity |
14 | 1 | Arthur Kreymer | |
15 | 1 | Arthur Kreymer | /QUEUE - locks pending, empty files containing |
16 | 1 | Arthur Kreymer | date, host, pid, user, identity |
17 | 1 | Arthur Kreymer | |
18 | 1 | Arthur Kreymer | /LOG - empty files with names reflecting completed locks |
19 | 1 | Arthur Kreymer | date, time queued, time locked, host, pid, user, identity |
20 | 1 | Arthur Kreymer | |
21 | 1 | Arthur Kreymer | /LOGS - monthly text summaries built from LOG file names. |
22 | 1 | Arthur Kreymer | |
23 | 1 | Arthur Kreymer | /STALE - record of locks that have timed out |
24 | 1 | Arthur Kreymer | |
25 | 1 | Arthur Kreymer | glimit - global activity limit, including all user groups |
26 | 1 | Arthur Kreymer | set this near the actual Bluearc capacity |
27 | 1 | Arthur Kreymer | this is not implemented as of 2012-11-06 |
28 | 1 | Arthur Kreymer | |
29 | 1 | Arthur Kreymer | limit - local activity limit, for the users' own group |
30 | 1 | Arthur Kreymer | set this well under Bluearc capacity |
31 | 1 | Arthur Kreymer | |
32 | 1 | Arthur Kreymer | perf - performance MB/sec required in PERF before locking |
33 | 1 | Arthur Kreymer | |
34 | 1 | Arthur Kreymer | PERF - actual MB/sec performance, measured by external agent |
35 | 1 | Arthur Kreymer | ( No agents implemented as of 2010-08-02 ) |
36 | 1 | Arthur Kreymer | |
37 | 1 | Arthur Kreymer | rate - net retry rate target, in retries per second |
38 | 1 | Arthur Kreymer | |
39 | 1 | Arthur Kreymer | small - MBytes: files smaller than this are not locked by cpn. |
40 | 1 | Arthur Kreymer | |
41 | 1 | Arthur Kreymer | wait - mininum time to wait before retrying, regardless of the load. |
42 | 1 | Arthur Kreymer | the time delay before retrying a lock is the minimum of |
43 | 1 | Arthur Kreymer | * wait |
44 | 1 | Arthur Kreymer | * (number of queued locks)/rate |
45 | 1 | Arthur Kreymer | |
46 | 4 | Arthur Kreymer | h2. MAINTENANCE |
47 | 1 | Arthur Kreymer | |
48 | 1 | Arthur Kreymer | lock files should be owned by some appropriate group account, like mindata. |
49 | 1 | Arthur Kreymer | |
50 | 1 | Arthur Kreymer | That account should occasionally remove expired locks and queue entries, |
51 | 1 | Arthur Kreymer | and concatenate LOG entries into monthly summary files. |
52 | 1 | Arthur Kreymer | |
53 | 4 | Arthur Kreymer | You can run the lockclean script manually, which will do this hourly : |
54 | 4 | Arthur Kreymer | But be careful, interactive logins on gpsn01 are in group gpcf. |
55 | 4 | Arthur Kreymer | Use 'sg' to set the proper group first |
56 | 4 | Arthur Kreymer | <pre> |
57 | 1 | Arthur Kreymer | set nohup ; /grid/fermiapp/common/tools/lockclean & |
58 | 4 | Arthur Kreymer | </pre> |
59 | 4 | Arthur Kreymer | There should be a crontab entry for each account like |
60 | 4 | Arthur Kreymer | <pre> |
61 | 4 | Arthur Kreymer | @reboot sg <mygroup> -c /grid/fermiapp/common/tools/lockclean |
62 | 4 | Arthur Kreymer | </pre> |
63 | 4 | Arthur Kreymer | |
64 | 6 | Arthur Kreymer | | Group | Account@Host | crontab | |
65 | 6 | Arthur Kreymer | | des | des@gpsn01 | |
66 | 6 | Arthur Kreymer | | e875 | mindata@minos27 | |
67 | 6 | Arthur Kreymer | | e938 | minervadat@if02 | |
68 | 6 | Arthur Kreymer | | gpcf | ifmon@gpsn01 | @reboot sg gpcf -c /grid/fermiapp/common/tools/lockclean | |
69 | 6 | Arthur Kreymer | | lbne | lbnedata@lbnegpvm01 | |
70 | 6 | Arthur Kreymer | | marslbne | | |
71 | 6 | Arthur Kreymer | | marsmu2e |marsmu2e@detsim | |
72 | 6 | Arthur Kreymer | | mu2e | mu2e@mu2egpvm01 | |
73 | 6 | Arthur Kreymer | | mu2epro | mu2epro@mu2egpvm01 | |
74 | 6 | Arthur Kreymer | | t-962 | argoneut@argoneutgpvm01 | @reboot /grid/fermiapp/common/tools/lockclean | |
75 | 6 | Arthur Kreymer | | uboone | uboone@uboonegpvm01 | |
76 | 6 | Arthur Kreymer | | nova | novadata@gpcf028 | |
77 | 4 | Arthur Kreymer | |
78 | 4 | Arthur Kreymer | |
79 | 4 | Arthur Kreymer | h1. USAGE |
80 | 1 | Arthur Kreymer | |
81 | 1 | Arthur Kreymer | Get an idea of activity by counting lines in log files. |
82 | 1 | Arthur Kreymer | |
83 | 1 | Arthur Kreymer | For example, for Minos, |
84 | 1 | Arthur Kreymer | |
85 | 1 | Arthur Kreymer | $ wc -l /grid/data/e875/LOCK/LOGS/*.log |
86 | 1 | Arthur Kreymer | 9124 /grid/data/e875/LOCK/LOGS/200908.log |
87 | 1 | Arthur Kreymer | 140794 /grid/data/e875/LOCK/LOGS/200909.log |
88 | 1 | Arthur Kreymer | 181895 /grid/data/e875/LOCK/LOGS/200910.log |
89 | 1 | Arthur Kreymer | 196327 /grid/data/e875/LOCK/LOGS/200911.log |
90 | 1 | Arthur Kreymer | 125084 /grid/data/e875/LOCK/LOGS/200912.log |
91 | 1 | Arthur Kreymer | 272598 /grid/data/e875/LOCK/LOGS/201001.log |
92 | 1 | Arthur Kreymer | 284000 /grid/data/e875/LOCK/LOGS/201002.log |
93 | 1 | Arthur Kreymer | 275479 /grid/data/e875/LOCK/LOGS/201003.log |
94 | 1 | Arthur Kreymer | 354725 /grid/data/e875/LOCK/LOGS/201004.log |
95 | 1 | Arthur Kreymer | 1840026 total |
96 | 1 | Arthur Kreymer | |
97 | 1 | Arthur Kreymer | |
98 | 1 | Arthur Kreymer | $ wc -l /grid/data/e875/LOCK/STALE/LOCKS/*.log |
99 | 1 | Arthur Kreymer | $ wc -l /grid/data/e875/LOCK/STALE/QUEUE/*.log |
100 | 2 | Arthur Kreymer | |
101 | 2 | Arthur Kreymer | h2. INITIALIZATION |
102 | 2 | Arthur Kreymer | |
103 | 2 | Arthur Kreymer | To start up a new group's LOCKs, |
104 | 3 | Arthur Kreymer | the group should give REX DH people access to the account, |
105 | 2 | Arthur Kreymer | and issue a ServiceNow ticket to have the files set up. |
106 | 2 | Arthur Kreymer | The .k5login should include |
107 | 2 | Arthur Kreymer | <pre> |
108 | 2 | Arthur Kreymer | dbox@FNAL.GOV |
109 | 2 | Arthur Kreymer | illingwo@FNAL.GOV |
110 | 2 | Arthur Kreymer | kreymer@FNAL.GOV |
111 | 2 | Arthur Kreymer | lyon@FNAL.GOV |
112 | 2 | Arthur Kreymer | mengel@FNAL.GOV |
113 | 3 | Arthur Kreymer | rs@FNAL.GOV[[]] |
114 | 2 | Arthur Kreymer | votava@FNAL.GOV |
115 | 1 | Arthur Kreymer | </pre> |
116 | 2 | Arthur Kreymer | |
117 | 3 | Arthur Kreymer | REX will verify that the group id name is the same |
118 | 3 | Arthur Kreymer | on Fermigrid nodes and in the Lab GID registry, at |
119 | 3 | Arthur Kreymer | http://www-giduid.fnal.gov/cd/FUE/uidgid/gid_id.lis |
120 | 2 | Arthur Kreymer | |
121 | 3 | Arthur Kreymer | REX will then log in to the account |
122 | 3 | Arthur Kreymer | and use 'ups tailor cpn' to create the default files. |
123 | 3 | Arthur Kreymer | ( Available from cpn v1.3 onward ) |
124 | 2 | Arthur Kreymer | |
125 | 3 | Arthur Kreymer | ups tailor cpn - will echo the commands proposed |
126 | 3 | Arthur Kreymer | ups tailor cpn -O write - will execute the commands |