Project

General

Profile

DM - Expert Documentation » History » Version 48

« Previous - Version 48/71 (diff) - Next » - Current version
Lu Ren, 08/21/2018 11:48 PM


DM - Expert Documentation

Documentation!

... is finally initiated by David Caratelli :)
https://www.overleaf.com/3384459hxbyns#/9541217/

For super-duper experts, if interested in,
PUBS base framework documentation is on DocDB 5400

Keep updated! Also attach the latest version to this Wiki.

Starting the PUBS daemon running

The daemons should be restarted every Monday-Wednesday-Friday-Sunday.

Details are on this page. Starting the PUBS online daemon

Moving all projects to a single online machine

Details are on this page. Running all PUBS projects on single server

Building up the PUBS online testbed

Details are on this page. Building up the PUBS online testbed

Mapping project name to names on GUI.

How do I find the project name (database table name) given the name of a specific box on the monitoring gui? Project GUI Map

Querying DB for errors. DB Query

Project Debugging Home-Page (list of projects and debug info for each one). Project Debug

Changing the Database Configuration for Online PUBS Online PUBS Database Reconfig

Correcting Errors in PUBS

This is a problem with incomplete files that have less than one event.

If files transferred from EVB to Near1 fail to transfer to the FTS dropbox, errors will appear in the Near1 Binary Transfer box on PUBS.

When there are SSL problems registering file metadata into the SAM database or missing crontab entries.

Expired Certificate on Near1

https://cdcvs.fnal.gov/redmine/projects/uboonecode/wiki/CSR

Running out of Disk Space on ubdaq-prod-evb ?

useful info: there are ~ 33 TB of disk space in /data/ on the evb machine. PUBS will try and clear data in /data/uboonedaq/TestRuns/ until the disk-usage reaches 40% of /data/uboonedaq/TestRuns/ is empty.

If this is the case there are several things one should do:
0) Idenfity who is using up the disk space. Options:
--> a) /data/uboonedaq/rawdata/ > this is where data from "official" runs goes. Files here are seen (and should be eventually removed) by PUBS.
-
> b) /data/uboonedaq/TestRuns/ > this is disk-space DAQ people use to test things. It is not seen by PUBS and needs to be removed by hand in order to be cleared.
-
> c) /data/uboonedaq/lukhanin/ > test-space for Gennadiy. Also needs to be removed manually in order to free up space.
-
> d) /data/OTHER/ > data used by someone else.
If most of the space is not being used by /data/uboonedaq/rawdata/ we need to free space manually. If it is urgent to free up space (i.e. data-taking should not be interrupted and the disk will fill up rather soon) you are authorized to clear /data/uboonedaq/TestRuns/. Contact any other person who is using up a considerable amount of space and ask them to quickly remove contents in their /data/ folder.
If /data/uboonedaq/rawdata/ is using up a significant amount of space, the problem is probably PUBS' fault.
1) identify the cause of the problem. Why is disk space not being freed? Possible causes:
-
> a) clear_binary_evb is having issues.
--> b) clear_binary_evb does not find any new files to clear. This indicates a possible problem with one of the projects that clear_binary_evb depends on. A possible cause could be poor network speed to drain data out of the evb machine.

Questions? Ask Kirby

Running out of Disk Space on /datalocal/ @ near1 ?

If the disk-usage @ /datalocal/ is above 95% as an immediate action please stop the "mv_binary_evb" project. Notify the PUBS team that you just did this and start addressing the disk-space issue

What to do if dCache/enstore go down (no access to pnfs area)

Running out of Disk Space on sebXX? uB_DataMgmt_PCXX_seb06_data/disk_occ

This is a super nova stream related issue. The super nova PUBS projects are located on ws02. Please restart the daemon on ws02. The error will be cleared after ~15min.

Further notes: this particular error can happen when on of the PUBS projects for the SNS (super nova stream) gets stuck. We are using a cpulimiter to keep the load down on the ws02 machine. Something the cpulimiter can hang one of the projects and cause new incoming SNS file registration to halt. Restarting the daemon will kill and refresh the PUBS projects.

Collaborator has asked me, the DM expert, to prevent the deletion of one or more SN runs.

To prevent the deletion of one or more runs in the SN stream login as uboonepro. Head over the to the SN PUBS script directory located here /home/uboonepro/pubs/dstream_online/snova. Here you will find "frozen_runs.txt". In this file insert new line separated run numbers. The monitoring script will read this ASCII text file, and prevent the deletion of files in this text file.

The daemon on ws02 has mysteriously died.

DM experts are currently debugging an issue related to the daemon on ws02 being killed by the kernel. If you are a DM expert on shift and you find the ws02 daemon has mysteriously died. Please execute the following command to copy the log files to a safe location, then please restart the daemon.

mkdir -p /data/uboonepro/ws02_daemon_failures/`date +%D`; cp /home/uboonepro/pubs/log/ubdaq-prod-ws02.fnal.gov/* /data/uboonepro/ws02_daemon_failures/`date +%D`/