Project

General

Profile

DM - Expert Documentation » History » Version 20

Version 19 (Michael Kirby, 01/04/2017 06:27 PM) → Version 20/72 (David Caratelli, 01/21/2017 09:14 AM)

{{>toc}}

h1. DM - Expert Documentation

h2. Documentation!

... is finally initiated by David Caratelli :)
https://www.overleaf.com/3384459hxbyns#/9541217/

For super-duper experts, if interested in,
PUBS base framework documentation is on "DocDB 5400":http://microboone-docdb.fnal.gov:8080/cgi-bin/ShowDocument?docid=5400

Keep updated! Also attach the latest version to this Wiki.

h2. Starting the PUBS daemon running

Details are on this page. [[Starting the PUBS online daemon]]

h2. Building up the PUBS online testbed

Details are on this page. [[Building up the PUBS online testbed]]

h2. Querying DB for errors. [[DB Query]]

h2. Project Debugging Home-Page (list of projects and debug info for each one). [[Project Debug]]

h2.
Correcting Errors in Metadata Generation From Incomplete Files

This is a problem with incomplete files that have less than one event. Details instructions here: [[Correcting Failed Metadata Generation]]

h2. Correcting Failed Near1 Binary Transfers

If files transferred from EVB to Near1 fail to transfer to the FTS dropbox, errors will appear in the Near1 Binary Transfer box on PUBS. Detailed instructions here: [[Correcting Failed Near1 Binary Transfer]]

h2. Running out of Disk Space on ubdaq-prod-evb ?

useful info: there are ~ 33 TB of disk space in /data/ on the evb machine. PUBS will try and clear data in /data/uboonedaq/TestRuns/ until the disk-usage reaches 40% of /data/uboonedaq/TestRuns/ is empty.

If this is the case there are several things one should do:
0) Idenfity who is using up the disk space. Options:
--> a) /data/uboonedaq/rawdata/ -> this is where data from "official" runs goes. Files here are seen (and should be eventually removed) by PUBS.
--> b) /data/uboonedaq/TestRuns/ -> this is disk-space DAQ people use to test things. It is not seen by PUBS and needs to be removed by hand in order to be cleared.
--> c) /data/uboonedaq/lukhanin/ -> test-space for Gennadiy. Also needs to be removed manually in order to free up space.
--> d) /data/OTHER/ -> data used by someone else.
If most of the space is not being used by /data/uboonedaq/rawdata/ we need to free space manually. If it is urgent to free up space (i.e. data-taking should not be interrupted and the disk will fill up rather soon) you are authorized to clear /data/uboonedaq/TestRuns/. Contact any other person who is using up a considerable amount of space and ask them to quickly remove contents in their /data/ folder.
If /data/uboonedaq/rawdata/ is using up a significant amount of space, the problem is probably PUBS' fault.
1) identify the cause of the problem. Why is disk space not being freed? Possible causes:
--> a) clear_binary_evb is having issues.
--> b) clear_binary_evb does not find any new files to clear. This indicates a possible problem with one of the projects that clear_binary_evb depends on. A possible cause could be poor network speed to drain data out of the evb machine.

Questions? "Ask Kirby":mailto:kirby@fnal.gov

h2. Running out of Disk Space on /datalocal/ @ near1 ?

If the disk-usage @ /datalocal/ is above 95% as an immediate action please stop the "mv_binary_evb" project. Notify the PUBS team that you just did this and start addressing the disk-space issue

h2. What to do if dcashe/enstore go down (no access to pnfs area)

This means we should not make an attempt to transfer a binary file to dcache area.
There are 2 actions to be taken before the beginning of the downtime and at the end.

h3. *%{color:blue} First action%*: ~30 minutes before the scheduled downtime beginning

This requires 2 operations that can be done in 1 step.

a) Change the RESOURCE parameter "BYPASS" value from "False" to "True" for evb=>dropbox transfer project
b) Disable near1=>dropbox transfer project

How does this work? In above a) will change the destination of a file transfer from dropbox to near1.
This way a file produced at evb by DAQ will be moved to near1 disk space and keeps the evb area available for more data taking.
On the other hand, since dcache is unavailable, we want to disable a project that is constantly cleaning up near1 area by draining files into dcache.
So we need to do b) which is to disable this project.

As of the date of this writing, relevant project names for a) and b) are:
Project name for a) ... prod_transfer_binary_evb2dropbox_evb
Project name for b) ... prod_transfer_binary_near12dropbox_near1

*How can we do this in 1 step?*
0) Log into either evb or near1 as uboonepro, then <pre>source $HOME/pubs/config/setup_uboonepro_online.sh
cfg_dump_project current.cfg</pre>

1) Edit project configuration. The easiest way is to dump the currently running configuration, alter, and upload. <pre>alias vi="emacs -nw"
vi current.cfg</pre>

2) Upload project configuration <pre>$PUB_TOP_DIR/sbin/register_project current.cfg</pre> on the command prompt, type "y" if you agree with the modification

*How to confirm the effect is in place?*
Confirm b) took place on GUI (check the "Binary Transfer [Near1] project color became gray).
Then take a look at a log file:
<pre>tail -n3000 $PUB_TOP_DIR/log/ubdaq-prod-near1.fnal.gov/prod_transfer_binary_evb2dropbox_evb.log</pre>
This log file usually shows lines like this:
<pre>[ INFO ] transfer (L: 147) >> {transfer_file} Start transfer_file @ 2016-01-21 07:02:03
...
[ INFO ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00708.ubdaq @ 2016-01-21 07:08:46
[ INFO ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00707.ubdaq @ 2016-01-21 07:08:47
[ INFO ] transfer (L: 275) >> {process_files} Waiting for 6/100 process to finish...
[ INFO ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00706.ubdaq @ 2016-01-21 07:09:19
[ INFO ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00705.ubdaq @ 2016-01-21 07:09:19
[ INFO ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00704.ubdaq @ 2016-01-21 07:09:19
[ INFO ] transfer (L: 240) >> {transfer_file} Finished copy (4541, 31) @ 2016-01-21 07:09:41
[ INFO ] transfer (L: 240) >> {transfer_file} Finished copy (4541, 30) @ 2016-01-21 07:09:41
[ INFO ] transfer (L: 240) >> {transfer_file} Finished copy (4541, 29) @ 2016-01-21 07:09:41
...
[ INFO ] transfer (L: 245) >> {transfer_file} All finished @ 2016-01-21 07:09:42</pre>
However with a) in place you should see lines like this:
<pre>[ INFO ] transfer (L: 147) >> {transfer_file} Start transfer_file @ 2016-01-21 07:21:19
[ INFO ] transfer (L: 176) >> {transfer_file} Configured to bypass transfer: run=4625, subrun=0 ...
[ INFO ] transfer (L: 176) >> {transfer_file} Configured to bypass transfer: run=4513, subrun=114 ...
...
[ INFO ] transfer (L: 176) >> {transfer_file} Configured to bypass transfer: run=4511, subrun=2404 ...
[ INFO ] transfer (L: 176) >> {transfer_file} Configured to bypass transfer: run=4511, subrun=2403 ...
[ INFO ] transfer (L: 245) >> {transfer_file} All finished @ 2016-01-21 07:21:31</pre>
Also you may check another log file:
<pre>tail -n3000 $PUB_TOP_DIR/log/ubdaq-prod-near1.fnal.gov/prod_transfer_binary_evb2near1_near1.log</pre>
which usually looks like this:
<pre>[ INFO ] mv_assembler_daq_files (L: 94 ) >> {process_newruns} Starting a parallel (5) transfer process for 50 runs...
[ INFO ] mv_assembler_daq_files (L: 176) >> {process_newruns} Finished all @ 2016-01-21 07:09:50</pre>
however with a) in place this project starts draining files from evb to near1, and you should see a log like this:
<pre>[ INFO ] mv_assembler_daq_files (L: 94 ) >> {process_newruns} Starting a parallel (5) transfer process for 50 runs...
[ INFO ] mv_assembler_daq_files (L: 128) >> {process_newruns} processing new run: run=4618, subrun=1245 ...
[ INFO ] mv_assembler_daq_files (L: 128) >> {process_newruns} processing new run: run=4537, subrun=603 ...
...
[ INFO ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_21_2_53_56-0004618-01245.ubdaq @ 2016-01-21 07:10:03
[ INFO ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00603.ubdaq @ 2016-01-21 07:10:03
[ INFO ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00602.ubdaq @ 2016-01-21 07:10:03
[ INFO ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00601.ubdaq @ 2016-01-21 07:10:03
[ INFO ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00600.ubdaq @ 2016-01-21 07:10:03
[ INFO ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00599.ubdaq @ 2016-01-21 07:10:03
[ INFO ] mv_assembler_daq_files (L: 215) >> {process_files} Waiting for 6/50 process to finish...
...
[ INFO ] mv_assembler_daq_files (L: 176) >> {process_newruns} Finished all @ 2016-01-21 07:14:14
[ INFO ] mv_assembler_daq_files (L: 321) >> {validate} validated run: run=4618, subrun=1245 ...
[ INFO ] mv_assembler_daq_files (L: 321) >> {validate} validated run: run=4537, subrun=603 ...
[ INFO ] mv_assembler_daq_files (L: 321) >> {validate} validated run: run=4537, subrun=602 ...
...</pre>
as expected.

*IMPORTANT*
Make sure to discard current.cfg to avoid a confusion to others and yourself in future.

h3. *%{color:blue}Second action%*: at the end of downtime

You basically have to revert what you have done.

a) Change the RESOURCE parameter "BYPASS" value from "True" to "False" for evb=>dropbox transfer project
b) Enable near1=>dropbox transfer project

NOTE: You cannot necessarily validate that you have done this correctly by expecting reversed behavior in the logs as described above. The prod_transfer_binary_evb2near1_near1.log behavior will not change until the backlog of files are copied from evb. The point is, even though you've just undone the BYPASS change, all the files previously marked but not yet transferred will still be transferred in the BYPASSed manner.

Refer to the previous sub-section as to how you could do this & validation of your action.
Remember to discard current.cfg.