Project

General

Profile

DM - Expert Documentation » History » Version 53

Lu Ren, 08/22/2018 12:01 AM

1 53 Lu Ren
h1. DM - Expert Documentation
2 3 Michael Kirby
{{>toc}}
3 1 Michael Kirby
4 1 Michael Kirby
h2. Documentation!
5 8 Kazuhiro Terao
6 53 Lu Ren
... is finally initiated by David Caratelli :) https://www.overleaf.com/3384459hxbyns#/9541217/
7 8 Kazuhiro Terao
8 51 Lu Ren
For super-duper experts, if interested in, PUBS base framework documentation is on "DocDB 5400":http://microboone-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=5400
9 17 Kazuhiro Terao
10 8 Kazuhiro Terao
Keep updated! Also attach the latest version to this Wiki.
11 8 Kazuhiro Terao
12 1 Michael Kirby
h2. Starting the PUBS daemon running
13 2 Michael Kirby
14 34 Afroditi Papadopoulou
*%{color:blue}The daemons should be restarted every Monday-Wednesday-Friday-Sunday.%*
15 34 Afroditi Papadopoulou
16 3 Michael Kirby
Details are on this page. [[Starting the PUBS online daemon]]
17 2 Michael Kirby
18 30 Michael Kirby
h2. Moving all projects to a single online machine
19 30 Michael Kirby
20 30 Michael Kirby
Details are on this page. [[Running all PUBS projects on single server]]
21 30 Michael Kirby
22 4 Michael Kirby
h2. Building up the PUBS online testbed
23 1 Michael Kirby
24 4 Michael Kirby
Details are on this page. [[Building up the PUBS online testbed]]
25 3 Michael Kirby
26 21 David Caratelli
h2. Mapping project name to names on GUI.
27 22 David Caratelli
28 21 David Caratelli
How do I find the project name (database table name) given the name of a specific box on the monitoring gui? [[Project GUI Map]]
29 21 David Caratelli
30 20 David Caratelli
h2. Querying DB for errors. [[DB Query]]
31 20 David Caratelli
32 26 Michael Kirby
h2. Changing the Database Configuration for Online PUBS [[Online PUBS Database Reconfig]]
33 26 Michael Kirby
34 47 Lu Ren
h2. Correcting Errors in PUBS
35 1 Michael Kirby
36 48 Lu Ren
* Errors in *Metadata Generation* From Incomplete Files [[Correcting Failed Metadata Generation]]
37 25 Michael Kirby
38 48 Lu Ren
*  Failed *Near1 Binary Transfers* [[Correcting Failed Near1 Binary Transfer]]
39 25 Michael Kirby
40 48 Lu Ren
> If files transferred from EVB to Near1 fail to transfer to the FTS dropbox, errors will appear in the Near1 Binary Transfer box on PUBS.
41 19 Michael Kirby
42 48 Lu Ren
* Errors in *Registering File Metadata* and crontab entries for kerberos tickets and grid proxies [[Correcting Failed Metadata Registration]]
43 46 Lu Ren
44 48 Lu Ren
> When there are SSL problems registering file metadata into the SAM database or missing crontab entries.
45 18 Michael Kirby
46 50 Lu Ren
h2. Expired Certificate on Near1 "Request OSG Production Service Certificate":https://cdcvs.fnal.gov/redmine/projects/uboonecode/wiki/CSR
47 23 Michael Kirby
48 5 David Caratelli
h2. Running out of Disk Space on ubdaq-prod-evb ?
49 5 David Caratelli
50 5 David Caratelli
useful info: there are ~ 33 TB of disk space in /data/ on the evb machine. PUBS will try and clear data in /data/uboonedaq/TestRuns/ until the disk-usage reaches 40% of /data/uboonedaq/TestRuns/ is empty.
51 5 David Caratelli
52 5 David Caratelli
If this is the case there are several things one should do:
53 5 David Caratelli
0) Idenfity who is using up the disk space. Options:
54 6 David Caratelli
--> a) /data/uboonedaq/rawdata/  -> this is where data from "official" runs goes. Files here are seen (and should be eventually removed) by PUBS.
55 6 David Caratelli
--> b) /data/uboonedaq/TestRuns/ -> this is disk-space DAQ people use to test things. It is not seen by PUBS and needs to be removed by hand in order to be cleared.
56 6 David Caratelli
--> c) /data/uboonedaq/lukhanin/ -> test-space for Gennadiy. Also needs to be removed manually in order to free up space.
57 6 David Caratelli
--> d) /data/OTHER/              -> data used by someone else.
58 5 David Caratelli
If most of the space is not being used by /data/uboonedaq/rawdata/ we need to free space manually. If it is urgent to free up space (i.e. data-taking should not be interrupted and the disk will fill up rather soon) you are authorized to clear /data/uboonedaq/TestRuns/. Contact any other person who is using up a considerable amount of space and ask them to quickly remove contents in their /data/ folder.
59 5 David Caratelli
If /data/uboonedaq/rawdata/ is using up a significant amount of space, the problem is probably PUBS' fault.
60 5 David Caratelli
1) identify the cause of the problem. Why is disk space not being freed? Possible causes:
61 6 David Caratelli
--> a) clear_binary_evb is having issues.
62 6 David Caratelli
--> b) clear_binary_evb does not find any new files to clear. This indicates a possible problem with one of the projects that clear_binary_evb depends on. A possible cause could be poor network speed to drain data out of the evb machine.
63 2 Michael Kirby
64 2 Michael Kirby
Questions? "Ask Kirby":mailto:kirby@fnal.gov
65 7 David Caratelli
66 7 David Caratelli
h2. Running out of Disk Space on /datalocal/ @ near1 ?
67 7 David Caratelli
68 7 David Caratelli
If the disk-usage @ /datalocal/ is above 95% as an immediate action please stop the "mv_binary_evb" project. Notify the PUBS team that you just did this and start addressing the disk-space issue
69 9 David Kaleko
70 45 Lu Ren
h2. [[What to do if dCache/enstore go down (no access to pnfs area)]]
71 31 Victor Genty
72 31 Victor Genty
h2. Running out of Disk Space on sebXX? uB_DataMgmt_PCXX_seb06_data/disk_occ
73 31 Victor Genty
74 31 Victor Genty
This is a super nova stream related issue. The super nova PUBS projects are located on ws02. Please restart the daemon on ws02. The error will be cleared after ~15min. 
75 31 Victor Genty
76 31 Victor Genty
Further notes: this particular error can happen when on of the PUBS projects for the SNS (super nova stream) gets stuck. We are using a cpulimiter to keep the load down on the ws02 machine. Something the cpulimiter can hang one of the projects and cause new incoming SNS file registration to halt. Restarting the daemon will kill and refresh the PUBS projects.
77 31 Victor Genty
78 32 Victor Genty
h2. Collaborator has asked me, the DM expert, to prevent the deletion of one or more SN runs.
79 32 Victor Genty
80 33 Victor Genty
To prevent the deletion of one or more runs in the SN stream login as uboonepro. Head over the to the SN PUBS script directory located here /home/uboonepro/pubs/dstream_online/snova. Here you will find "frozen_runs.txt". In this file insert *new line separated* run numbers. The monitoring script will read this ASCII text file, and prevent the deletion of files in this text file.
81 35 Victor Genty
82 35 Victor Genty
h2. The daemon on ws02 has mysteriously died.
83 35 Victor Genty
84 35 Victor Genty
DM experts are currently debugging an issue related to the daemon on ws02 being killed by the kernel. If you are a DM expert on shift and you find the ws02 daemon has mysteriously died. Please execute the following command to copy the log files to a safe location, then please restart the daemon.
85 35 Victor Genty
<pre>
86 35 Victor Genty
mkdir -p /data/uboonepro/ws02_daemon_failures/`date +%D`; cp /home/uboonepro/pubs/log/ubdaq-prod-ws02.fnal.gov/* /data/uboonepro/ws02_daemon_failures/`date +%D`/
87 35 Victor Genty
</pre>