Project

General

Profile

DM - Expert Documentation » History » Version 36

« Previous - Version 36/64 (diff) - Next » - Current version
Afroditi Papadopoulou, 07/20/2018 01:18 PM


DM - Expert Documentation

Documentation!

... is finally initiated by David Caratelli :)
https://www.overleaf.com/3384459hxbyns#/9541217/

For super-duper experts, if interested in,
PUBS base framework documentation is on DocDB 5400

Keep updated! Also attach the latest version to this Wiki.

Starting the PUBS daemon running

The daemons should be restarted every Monday-Wednesday-Friday-Sunday.

Details are on this page. Starting the PUBS online daemon

Moving all projects to a single online machine

Details are on this page. Running all PUBS projects on single server

Building up the PUBS online testbed

Details are on this page. Building up the PUBS online testbed

Mapping project name to names on GUI.

How do I find the project name (database table name) given the name of a specific box on the monitoring gui? Project GUI Map

Querying DB for errors. DB Query

Project Debugging Home-Page (list of projects and debug info for each one). Project Debug

Changing the Database Configuration for Online PUBS Online PUBS Database Reconfig

Correcting Errors in Metadata Generation From Incomplete Files

This is a problem with incomplete files that have less than one event. Details instructions here: Correcting Failed Metadata Generation

Correcting Errors Registering File Metadata and crontab entries for kerberos tickets and grid proxies

When there are SSL problems registering file metadata into the SAM database or missing crontab entries: Correcting Failed Metadata Registration

Correcting Failed Near1 Binary Transfers

If files transferred from EVB to Near1 fail to transfer to the FTS dropbox, errors will appear in the Near1 Binary Transfer box on PUBS. Detailed instructions here: Correcting Failed Near1 Binary Transfer

Expired Certificate on Near1

https://cdcvs.fnal.gov/redmine/projects/uboonecode/wiki/CSR

Running out of Disk Space on ubdaq-prod-evb ?

useful info: there are ~ 33 TB of disk space in /data/ on the evb machine. PUBS will try and clear data in /data/uboonedaq/TestRuns/ until the disk-usage reaches 40% of /data/uboonedaq/TestRuns/ is empty.

If this is the case there are several things one should do:
0) Idenfity who is using up the disk space. Options:
--> a) /data/uboonedaq/rawdata/ > this is where data from "official" runs goes. Files here are seen (and should be eventually removed) by PUBS.
-
> b) /data/uboonedaq/TestRuns/ > this is disk-space DAQ people use to test things. It is not seen by PUBS and needs to be removed by hand in order to be cleared.
-
> c) /data/uboonedaq/lukhanin/ > test-space for Gennadiy. Also needs to be removed manually in order to free up space.
-
> d) /data/OTHER/ > data used by someone else.
If most of the space is not being used by /data/uboonedaq/rawdata/ we need to free space manually. If it is urgent to free up space (i.e. data-taking should not be interrupted and the disk will fill up rather soon) you are authorized to clear /data/uboonedaq/TestRuns/. Contact any other person who is using up a considerable amount of space and ask them to quickly remove contents in their /data/ folder.
If /data/uboonedaq/rawdata/ is using up a significant amount of space, the problem is probably PUBS' fault.
1) identify the cause of the problem. Why is disk space not being freed? Possible causes:
-
> a) clear_binary_evb is having issues.
--> b) clear_binary_evb does not find any new files to clear. This indicates a possible problem with one of the projects that clear_binary_evb depends on. A possible cause could be poor network speed to drain data out of the evb machine.

Questions? Ask Kirby

Running out of Disk Space on /datalocal/ @ near1 ?

If the disk-usage @ /datalocal/ is above 95% as an immediate action please stop the "mv_binary_evb" project. Notify the PUBS team that you just did this and start addressing the disk-space issue

What to do if dcashe/enstore go down (no access to pnfs area)

This means we should not make an attempt to transfer a binary file to dcache area.
There are 2 actions to be taken before the beginning of the downtime and at the end.

NOTE: There are two projects registered in PUBS as "prod_transfer_binary_evb2dropbox_XXX". One has XXX= "evb" while the other has XXX="near1". Do NOT enable the project "prod_transfer_binary_evb2dropbox_near1".

First action: ~30 minutes before the scheduled downtime beginning

This requires 2 operations that can be done in 1 step.

a) Change the RESOURCE parameter "BYPASS" value from "False" to "True" for evb=>dropbox transfer project

Under

PROJECT_BEGIN
NAME prod_transfer_binary_evb2dropbox_evb

b) Disable near1=>dropbox transfer project

How does this work? In above a) will change the destination of a file transfer from dropbox to near1.
This way a file produced at evb by DAQ will be moved to near1 disk space and keeps the evb area available for more data taking.
On the other hand, since dcache is unavailable, we want to disable a project that is constantly cleaning up near1 area by draining files into dcache.
So we need to do b) which is to disable this project.

As of the date of this writing, relevant project names for a) and b) are:
Project name for a) ... prod_transfer_binary_evb2dropbox_evb
Project name for b) ... prod_transfer_binary_near12dropbox_near1

How can we do this in 1 step?
0) Log into either evb or near1 as uboonepro, then

source $HOME/pubs/config/setup_uboonepro_online.sh 
cfg_dump_project current_${USER}.cfg

1) Edit project configuration. The easiest way is to dump the currently running configuration, alter, and upload.

alias vi="emacs -nw" 
vi current_${USER}.cfg

2) Upload project configuration

$PUB_TOP_DIR/sbin/register_project current_${USER}.cfg
on the command prompt, type "y" if you agree with the modification

How to confirm the effect is in place?
Confirm b) took place on GUI (check the "Binary Transfer [Near1] project color became gray).
Then take a look at a log file:

tail -n3000 $PUB_TOP_DIR/log/ubdaq-prod-near1.fnal.gov/prod_transfer_binary_evb2dropbox_evb.log

This log file usually shows lines like this:
[ INFO    ] transfer (L: 147) >> {transfer_file} Start transfer_file @ 2016-01-21 07:02:03
...
[ INFO    ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00708.ubdaq @ 2016-01-21 07:08:46
[ INFO    ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00707.ubdaq @ 2016-01-21 07:08:47
[ INFO    ] transfer (L: 275) >> {process_files} Waiting for 6/100 process to finish...
[ INFO    ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00706.ubdaq @ 2016-01-21 07:09:19
[ INFO    ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00705.ubdaq @ 2016-01-21 07:09:19
[ INFO    ] transfer (L: 256) >> {process_files} Transferring /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00704.ubdaq @ 2016-01-21 07:09:19
[ INFO    ] transfer (L: 240) >> {transfer_file} Finished copy (4541, 31) @ 2016-01-21 07:09:41
[ INFO    ] transfer (L: 240) >> {transfer_file} Finished copy (4541, 30) @ 2016-01-21 07:09:41
[ INFO    ] transfer (L: 240) >> {transfer_file} Finished copy (4541, 29) @ 2016-01-21 07:09:41
...
[ INFO    ] transfer (L: 245) >> {transfer_file} All finished @ 2016-01-21 07:09:42

However with a) in place you should see lines like this:
[ INFO    ] transfer (L: 147) >> {transfer_file} Start transfer_file @ 2016-01-21 07:21:19
[ INFO    ] transfer (L: 176) >> {transfer_file} Configured to bypass transfer: run=4625, subrun=0 ...
[ INFO    ] transfer (L: 176) >> {transfer_file} Configured to bypass transfer: run=4513, subrun=114 ...
...
[ INFO    ] transfer (L: 176) >> {transfer_file} Configured to bypass transfer: run=4511, subrun=2404 ...
[ INFO    ] transfer (L: 176) >> {transfer_file} Configured to bypass transfer: run=4511, subrun=2403 ...
[ INFO    ] transfer (L: 245) >> {transfer_file} All finished @ 2016-01-21 07:21:31

Also you may check another log file:
tail -n3000 $PUB_TOP_DIR/log/ubdaq-prod-near1.fnal.gov/prod_transfer_binary_evb2near1_near1.log

which usually looks like this:
[ INFO    ] mv_assembler_daq_files (L: 94 ) >> {process_newruns} Starting a parallel (5) transfer process for 50 runs...
[ INFO    ] mv_assembler_daq_files (L: 176) >> {process_newruns} Finished all @ 2016-01-21 07:09:50

however with a) in place this project starts draining files from evb to near1, and you should see a log like this:
[ INFO    ] mv_assembler_daq_files (L: 94 ) >> {process_newruns} Starting a parallel (5) transfer process for 50 runs...
[ INFO    ] mv_assembler_daq_files (L: 128) >> {process_newruns} processing new run: run=4618, subrun=1245 ...
[ INFO    ] mv_assembler_daq_files (L: 128) >> {process_newruns} processing new run: run=4537, subrun=603 ...
...
[ INFO    ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_21_2_53_56-0004618-01245.ubdaq @ 2016-01-21 07:10:03
[ INFO    ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00603.ubdaq @ 2016-01-21 07:10:03
[ INFO    ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00602.ubdaq @ 2016-01-21 07:10:03
[ INFO    ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00601.ubdaq @ 2016-01-21 07:10:03
[ INFO    ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00600.ubdaq @ 2016-01-21 07:10:03
[ INFO    ] mv_assembler_daq_files (L: 196) >> {process_files} Copying /data/uboonedaq/rawdata/PhysicsRun-2016_1_16_19_8_22-0004537-00599.ubdaq @ 2016-01-21 07:10:03
[ INFO    ] mv_assembler_daq_files (L: 215) >> {process_files} Waiting for 6/50 process to finish...
...
[ INFO    ] mv_assembler_daq_files (L: 176) >> {process_newruns} Finished all @ 2016-01-21 07:14:14
[ INFO    ] mv_assembler_daq_files (L: 321) >> {validate} validated run: run=4618, subrun=1245 ...
[ INFO    ] mv_assembler_daq_files (L: 321) >> {validate} validated run: run=4537, subrun=603 ...
[ INFO    ] mv_assembler_daq_files (L: 321) >> {validate} validated run: run=4537, subrun=602 ...
...

as expected.

IMPORTANT
Make sure to discard current.cfg to avoid a confusion to others and yourself in future.

Second action: at the end of downtime

You basically have to revert what you have done.

a) Change the RESOURCE parameter "BYPASS" value from "True" to "False" for evb=>dropbox transfer project
b) Enable near1=>dropbox transfer project

NOTE: You cannot necessarily validate that you have done this correctly by expecting reversed behavior in the logs as described above. The prod_transfer_binary_evb2near1_near1.log behavior will not change until the backlog of files are copied from evb. The point is, even though you've just undone the BYPASS change, all the files previously marked but not yet transferred will still be transferred in the BYPASSed manner.

Refer to the previous sub-section as to how you could do this & validation of your action.
Remember to discard current.cfg.

Running out of Disk Space on sebXX? uB_DataMgmt_PCXX_seb06_data/disk_occ

This is a super nova stream related issue. The super nova PUBS projects are located on ws02. Please restart the daemon on ws02. The error will be cleared after ~15min.

Further notes: this particular error can happen when on of the PUBS projects for the SNS (super nova stream) gets stuck. We are using a cpulimiter to keep the load down on the ws02 machine. Something the cpulimiter can hang one of the projects and cause new incoming SNS file registration to halt. Restarting the daemon will kill and refresh the PUBS projects.

Collaborator has asked me, the DM expert, to prevent the deletion of one or more SN runs.

To prevent the deletion of one or more runs in the SN stream login as uboonepro. Head over the to the SN PUBS script directory located here /home/uboonepro/pubs/dstream_online/snova. Here you will find "frozen_runs.txt". In this file insert new line separated run numbers. The monitoring script will read this ASCII text file, and prevent the deletion of files in this text file.

The daemon on ws02 has mysteriously died.

DM experts are currently debugging an issue related to the daemon on ws02 being killed by the kernel. If you are a DM expert on shift and you find the ws02 daemon has mysteriously died. Please execute the following command to copy the log files to a safe location, then please restart the daemon.

mkdir -p /data/uboonepro/ws02_daemon_failures/`date +%D`; cp /home/uboonepro/pubs/log/ubdaq-prod-ws02.fnal.gov/* /data/uboonepro/ws02_daemon_failures/`date +%D`/