Project

General

Profile

DM - Expert Documentation » History » Version 54

Lu Ren, 08/22/2018 12:01 AM

1 1 Michael Kirby
{{>toc}}
2 54 Lu Ren
3 54 Lu Ren
h1. DM - Expert Documentation
4 1 Michael Kirby
5 1 Michael Kirby
h2. Documentation!
6 8 Kazuhiro Terao
7 53 Lu Ren
... is finally initiated by David Caratelli :) https://www.overleaf.com/3384459hxbyns#/9541217/
8 8 Kazuhiro Terao
9 51 Lu Ren
For super-duper experts, if interested in, PUBS base framework documentation is on "DocDB 5400":http://microboone-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=5400
10 17 Kazuhiro Terao
11 8 Kazuhiro Terao
Keep updated! Also attach the latest version to this Wiki.
12 8 Kazuhiro Terao
13 1 Michael Kirby
h2. Starting the PUBS daemon running
14 2 Michael Kirby
15 34 Afroditi Papadopoulou
*%{color:blue}The daemons should be restarted every Monday-Wednesday-Friday-Sunday.%*
16 34 Afroditi Papadopoulou
17 3 Michael Kirby
Details are on this page. [[Starting the PUBS online daemon]]
18 2 Michael Kirby
19 30 Michael Kirby
h2. Moving all projects to a single online machine
20 30 Michael Kirby
21 30 Michael Kirby
Details are on this page. [[Running all PUBS projects on single server]]
22 30 Michael Kirby
23 4 Michael Kirby
h2. Building up the PUBS online testbed
24 1 Michael Kirby
25 4 Michael Kirby
Details are on this page. [[Building up the PUBS online testbed]]
26 3 Michael Kirby
27 21 David Caratelli
h2. Mapping project name to names on GUI.
28 22 David Caratelli
29 21 David Caratelli
How do I find the project name (database table name) given the name of a specific box on the monitoring gui? [[Project GUI Map]]
30 21 David Caratelli
31 20 David Caratelli
h2. Querying DB for errors. [[DB Query]]
32 20 David Caratelli
33 26 Michael Kirby
h2. Changing the Database Configuration for Online PUBS [[Online PUBS Database Reconfig]]
34 26 Michael Kirby
35 47 Lu Ren
h2. Correcting Errors in PUBS
36 1 Michael Kirby
37 48 Lu Ren
* Errors in *Metadata Generation* From Incomplete Files [[Correcting Failed Metadata Generation]]
38 25 Michael Kirby
39 48 Lu Ren
*  Failed *Near1 Binary Transfers* [[Correcting Failed Near1 Binary Transfer]]
40 25 Michael Kirby
41 48 Lu Ren
> If files transferred from EVB to Near1 fail to transfer to the FTS dropbox, errors will appear in the Near1 Binary Transfer box on PUBS.
42 19 Michael Kirby
43 48 Lu Ren
* Errors in *Registering File Metadata* and crontab entries for kerberos tickets and grid proxies [[Correcting Failed Metadata Registration]]
44 46 Lu Ren
45 48 Lu Ren
> When there are SSL problems registering file metadata into the SAM database or missing crontab entries.
46 18 Michael Kirby
47 50 Lu Ren
h2. Expired Certificate on Near1 "Request OSG Production Service Certificate":https://cdcvs.fnal.gov/redmine/projects/uboonecode/wiki/CSR
48 23 Michael Kirby
49 5 David Caratelli
h2. Running out of Disk Space on ubdaq-prod-evb ?
50 5 David Caratelli
51 5 David Caratelli
useful info: there are ~ 33 TB of disk space in /data/ on the evb machine. PUBS will try and clear data in /data/uboonedaq/TestRuns/ until the disk-usage reaches 40% of /data/uboonedaq/TestRuns/ is empty.
52 5 David Caratelli
53 5 David Caratelli
If this is the case there are several things one should do:
54 5 David Caratelli
0) Idenfity who is using up the disk space. Options:
55 6 David Caratelli
--> a) /data/uboonedaq/rawdata/  -> this is where data from "official" runs goes. Files here are seen (and should be eventually removed) by PUBS.
56 6 David Caratelli
--> b) /data/uboonedaq/TestRuns/ -> this is disk-space DAQ people use to test things. It is not seen by PUBS and needs to be removed by hand in order to be cleared.
57 6 David Caratelli
--> c) /data/uboonedaq/lukhanin/ -> test-space for Gennadiy. Also needs to be removed manually in order to free up space.
58 6 David Caratelli
--> d) /data/OTHER/              -> data used by someone else.
59 5 David Caratelli
If most of the space is not being used by /data/uboonedaq/rawdata/ we need to free space manually. If it is urgent to free up space (i.e. data-taking should not be interrupted and the disk will fill up rather soon) you are authorized to clear /data/uboonedaq/TestRuns/. Contact any other person who is using up a considerable amount of space and ask them to quickly remove contents in their /data/ folder.
60 5 David Caratelli
If /data/uboonedaq/rawdata/ is using up a significant amount of space, the problem is probably PUBS' fault.
61 5 David Caratelli
1) identify the cause of the problem. Why is disk space not being freed? Possible causes:
62 6 David Caratelli
--> a) clear_binary_evb is having issues.
63 6 David Caratelli
--> b) clear_binary_evb does not find any new files to clear. This indicates a possible problem with one of the projects that clear_binary_evb depends on. A possible cause could be poor network speed to drain data out of the evb machine.
64 2 Michael Kirby
65 2 Michael Kirby
Questions? "Ask Kirby":mailto:kirby@fnal.gov
66 7 David Caratelli
67 7 David Caratelli
h2. Running out of Disk Space on /datalocal/ @ near1 ?
68 7 David Caratelli
69 7 David Caratelli
If the disk-usage @ /datalocal/ is above 95% as an immediate action please stop the "mv_binary_evb" project. Notify the PUBS team that you just did this and start addressing the disk-space issue
70 9 David Kaleko
71 45 Lu Ren
h2. [[What to do if dCache/enstore go down (no access to pnfs area)]]
72 31 Victor Genty
73 31 Victor Genty
h2. Running out of Disk Space on sebXX? uB_DataMgmt_PCXX_seb06_data/disk_occ
74 31 Victor Genty
75 31 Victor Genty
This is a super nova stream related issue. The super nova PUBS projects are located on ws02. Please restart the daemon on ws02. The error will be cleared after ~15min. 
76 31 Victor Genty
77 31 Victor Genty
Further notes: this particular error can happen when on of the PUBS projects for the SNS (super nova stream) gets stuck. We are using a cpulimiter to keep the load down on the ws02 machine. Something the cpulimiter can hang one of the projects and cause new incoming SNS file registration to halt. Restarting the daemon will kill and refresh the PUBS projects.
78 31 Victor Genty
79 32 Victor Genty
h2. Collaborator has asked me, the DM expert, to prevent the deletion of one or more SN runs.
80 32 Victor Genty
81 33 Victor Genty
To prevent the deletion of one or more runs in the SN stream login as uboonepro. Head over the to the SN PUBS script directory located here /home/uboonepro/pubs/dstream_online/snova. Here you will find "frozen_runs.txt". In this file insert *new line separated* run numbers. The monitoring script will read this ASCII text file, and prevent the deletion of files in this text file.
82 35 Victor Genty
83 35 Victor Genty
h2. The daemon on ws02 has mysteriously died.
84 35 Victor Genty
85 35 Victor Genty
DM experts are currently debugging an issue related to the daemon on ws02 being killed by the kernel. If you are a DM expert on shift and you find the ws02 daemon has mysteriously died. Please execute the following command to copy the log files to a safe location, then please restart the daemon.
86 35 Victor Genty
<pre>
87 35 Victor Genty
mkdir -p /data/uboonepro/ws02_daemon_failures/`date +%D`; cp /home/uboonepro/pubs/log/ubdaq-prod-ws02.fnal.gov/* /data/uboonepro/ws02_daemon_failures/`date +%D`/
88 35 Victor Genty
</pre>