DM - Expert Documentation » History » Version 52

Lu Ren, 08/21/2018 11:59 PM

1 3 Michael Kirby
2 3 Michael Kirby
3 1 Michael Kirby
h1. DM - Expert Documentation
4 1 Michael Kirby
5 8 Kazuhiro Terao
h2. Documentation!
6 8 Kazuhiro Terao
7 8 Kazuhiro Terao
... is finally initiated by David Caratelli :) 
8 8 Kazuhiro Terao
9 8 Kazuhiro Terao
10 51 Lu Ren
For super-duper experts, if interested in, PUBS base framework documentation is on "DocDB 5400":
11 17 Kazuhiro Terao
12 8 Kazuhiro Terao
Keep updated! Also attach the latest version to this Wiki.
13 8 Kazuhiro Terao
14 1 Michael Kirby
h2. Starting the PUBS daemon running
15 2 Michael Kirby
16 34 Afroditi Papadopoulou
*%{color:blue}The daemons should be restarted every Monday-Wednesday-Friday-Sunday.%*
17 34 Afroditi Papadopoulou
18 3 Michael Kirby
Details are on this page. [[Starting the PUBS online daemon]]
19 2 Michael Kirby
20 30 Michael Kirby
h2. Moving all projects to a single online machine
21 30 Michael Kirby
22 30 Michael Kirby
Details are on this page. [[Running all PUBS projects on single server]]
23 30 Michael Kirby
24 4 Michael Kirby
h2. Building up the PUBS online testbed
25 1 Michael Kirby
26 4 Michael Kirby
Details are on this page. [[Building up the PUBS online testbed]]
27 3 Michael Kirby
28 21 David Caratelli
h2. Mapping project name to names on GUI.
29 22 David Caratelli
30 21 David Caratelli
How do I find the project name (database table name) given the name of a specific box on the monitoring gui? [[Project GUI Map]]
31 21 David Caratelli
32 20 David Caratelli
h2. Querying DB for errors. [[DB Query]]
33 20 David Caratelli
34 26 Michael Kirby
h2. Changing the Database Configuration for Online PUBS [[Online PUBS Database Reconfig]]
35 26 Michael Kirby
36 47 Lu Ren
h2. Correcting Errors in PUBS
37 1 Michael Kirby
38 48 Lu Ren
* Errors in *Metadata Generation* From Incomplete Files [[Correcting Failed Metadata Generation]]
39 25 Michael Kirby
40 48 Lu Ren
*  Failed *Near1 Binary Transfers* [[Correcting Failed Near1 Binary Transfer]]
41 25 Michael Kirby
42 48 Lu Ren
> If files transferred from EVB to Near1 fail to transfer to the FTS dropbox, errors will appear in the Near1 Binary Transfer box on PUBS.
43 19 Michael Kirby
44 48 Lu Ren
* Errors in *Registering File Metadata* and crontab entries for kerberos tickets and grid proxies [[Correcting Failed Metadata Registration]]
45 46 Lu Ren
46 48 Lu Ren
> When there are SSL problems registering file metadata into the SAM database or missing crontab entries.
47 18 Michael Kirby
48 50 Lu Ren
h2. Expired Certificate on Near1 "Request OSG Production Service Certificate":
49 23 Michael Kirby
50 5 David Caratelli
h2. Running out of Disk Space on ubdaq-prod-evb ?
51 5 David Caratelli
52 5 David Caratelli
useful info: there are ~ 33 TB of disk space in /data/ on the evb machine. PUBS will try and clear data in /data/uboonedaq/TestRuns/ until the disk-usage reaches 40% of /data/uboonedaq/TestRuns/ is empty.
53 5 David Caratelli
54 5 David Caratelli
If this is the case there are several things one should do:
55 5 David Caratelli
0) Idenfity who is using up the disk space. Options:
56 6 David Caratelli
--> a) /data/uboonedaq/rawdata/  -> this is where data from "official" runs goes. Files here are seen (and should be eventually removed) by PUBS.
57 6 David Caratelli
--> b) /data/uboonedaq/TestRuns/ -> this is disk-space DAQ people use to test things. It is not seen by PUBS and needs to be removed by hand in order to be cleared.
58 6 David Caratelli
--> c) /data/uboonedaq/lukhanin/ -> test-space for Gennadiy. Also needs to be removed manually in order to free up space.
59 6 David Caratelli
--> d) /data/OTHER/              -> data used by someone else.
60 5 David Caratelli
If most of the space is not being used by /data/uboonedaq/rawdata/ we need to free space manually. If it is urgent to free up space (i.e. data-taking should not be interrupted and the disk will fill up rather soon) you are authorized to clear /data/uboonedaq/TestRuns/. Contact any other person who is using up a considerable amount of space and ask them to quickly remove contents in their /data/ folder.
61 5 David Caratelli
If /data/uboonedaq/rawdata/ is using up a significant amount of space, the problem is probably PUBS' fault.
62 5 David Caratelli
1) identify the cause of the problem. Why is disk space not being freed? Possible causes:
63 6 David Caratelli
--> a) clear_binary_evb is having issues.
64 6 David Caratelli
--> b) clear_binary_evb does not find any new files to clear. This indicates a possible problem with one of the projects that clear_binary_evb depends on. A possible cause could be poor network speed to drain data out of the evb machine.
65 2 Michael Kirby
66 2 Michael Kirby
Questions? "Ask Kirby"
67 7 David Caratelli
68 7 David Caratelli
h2. Running out of Disk Space on /datalocal/ @ near1 ?
69 7 David Caratelli
70 7 David Caratelli
If the disk-usage @ /datalocal/ is above 95% as an immediate action please stop the "mv_binary_evb" project. Notify the PUBS team that you just did this and start addressing the disk-space issue
71 9 David Kaleko
72 45 Lu Ren
h2. [[What to do if dCache/enstore go down (no access to pnfs area)]]
73 31 Victor Genty
74 31 Victor Genty
h2. Running out of Disk Space on sebXX? uB_DataMgmt_PCXX_seb06_data/disk_occ
75 31 Victor Genty
76 31 Victor Genty
This is a super nova stream related issue. The super nova PUBS projects are located on ws02. Please restart the daemon on ws02. The error will be cleared after ~15min. 
77 31 Victor Genty
78 31 Victor Genty
Further notes: this particular error can happen when on of the PUBS projects for the SNS (super nova stream) gets stuck. We are using a cpulimiter to keep the load down on the ws02 machine. Something the cpulimiter can hang one of the projects and cause new incoming SNS file registration to halt. Restarting the daemon will kill and refresh the PUBS projects.
79 31 Victor Genty
80 32 Victor Genty
h2. Collaborator has asked me, the DM expert, to prevent the deletion of one or more SN runs.
81 32 Victor Genty
82 33 Victor Genty
To prevent the deletion of one or more runs in the SN stream login as uboonepro. Head over the to the SN PUBS script directory located here /home/uboonepro/pubs/dstream_online/snova. Here you will find "frozen_runs.txt". In this file insert *new line separated* run numbers. The monitoring script will read this ASCII text file, and prevent the deletion of files in this text file.
83 35 Victor Genty
84 35 Victor Genty
h2. The daemon on ws02 has mysteriously died.
85 35 Victor Genty
86 35 Victor Genty
DM experts are currently debugging an issue related to the daemon on ws02 being killed by the kernel. If you are a DM expert on shift and you find the ws02 daemon has mysteriously died. Please execute the following command to copy the log files to a safe location, then please restart the daemon.
87 35 Victor Genty
88 35 Victor Genty
mkdir -p /data/uboonepro/ws02_daemon_failures/`date +%D`; cp /home/uboonepro/pubs/log/* /data/uboonepro/ws02_daemon_failures/`date +%D`/
89 35 Victor Genty