Project

General

Profile

Correcting Failed Metadata Generation

These instructions are for when there are errors generating metadata for DAQ files where there is only a fragment of an event in the file. This happens occasionally when the DAQ fails on the first event of a subrun. These files are not usable and so can be discarded without concern. These instructions will show you how to verify that this is the problem and how to get rid of the file both from the /data/uboonedaq/rawdata/ and also from the PUBS GUI.

Note that this error is different that the error with registering properly generated metadata that is covered here: https://cdcvs.fnal.gov/redmine/projects/uboone-operations/wiki/Notes_for_documentation

Launch the PUBS GUI

You should first launch the PUBS GUI if you aren't in the control room so that you can monitor the errors being correct and also to see that the number of queued files drops down by the number of error states that you correct.

[uboonepro@ubdaq-prod-evb ~]$ cd pubs
[uboonepro@ubdaq-prod-evb pubs]$ source config/setup_uboonepro_online.sh
Setting up PUBS for uboonepro account...
Setting up PUBS for ubdaq-prod machines...
[uboonepro@ubdaq-prod-evb pubs]$ python pub_mongui/mongui.py &

Make sure to click off the "Use Relative Counters" in the top left. The "Binary Metadata" box should have 1 or more files in the error state. Those are the files you're going after.

Track down the error using PSQL or the PUBS logs

The easiest way to do this is to log into the smc database and look for subruns with status 120. Log into ubdaq-prod-smc, setup pubs, and then log into the procdb database:

ssh uboonepro@ubdaq-prod-ws01.fnal.gov
ssh uboonepro@ubdaq-prod-smc.fnal.gov
cd pubs
source config/setup_uboonepro_online.sh
psql -d procdb

Then you need to login with the uboonepro password that can be found in ~uboonepro/.sql_access/uboonepro_prod_conf.sh Once logged in you can issue the command below to get the list of failed metadata generation subruns.

procdb=> select * from prod_binary_metadata_near1 where status =120;
  run  | subrun | seq | projectver | status | data
-------+--------+-----+------------+--------+------
 11218 |      9 |   0 |          0 |    120 |
 11284 |     16 |   0 |          0 |    120 |
 11334 |     14 |   0 |          0 |    120 |
 11338 |     20 |   0 |          0 |    120 |
 11349 |     26 |   0 |          0 |    120 |
(5 rows)

Alternatively, you can log into the uboonepro account on either ubdaq-prod-evb or ubdaq-prod-near1 and go to the log area:

[uboonepro@ubdaq-prod-evb ~]$ cd ~/pubs/log/ubdaq-prod-near1.fnal.gov/
[uboonepro@ubdaq-prod-evb ubdaq-prod-near1.fnal.gov]$ emacs prod_binary_metadata_near1.log &

Now look for ERROR in the log file. You should see something like this:

[ ERROR   ] get_metadata_no_hang (L: 582) >> {process_ubdaq_files} End GPS timestamp not found in cout...
[ ERROR   ] get_metadata_no_hang (L: 585) >> {process_ubdaq_files} End NTP timestamp not found in cout...
[ ERROR   ] get_metadata_no_hang (L: 601) >> {process_ubdaq_files} Found invalid format in decoded data...
[ ERROR   ] get_metadata_no_hang (L: 602) >> {process_ubdaq_files} 1st event:

Bad trailer. Proceeding to try to find your chosen event, nevertheless.
eventRecord.size is 528
Object gov::fnal::uboone::datatypes::ub_GlobalHeader const*.
 Software Info:
  daq_version_label=v6_21_09
  daq_version_qualifiers=
 Event Info:
  run_number=10551
  subrun_number=262
  event_number=13100

Make sure to write down the run (9102) and subrun (0) numbers in this example. Look for all errors in the file like this. The logs should be reset every week so they shouldn't be too hard to find. You could also use grep, but then you don't have the date of when the error happened from adjacent lines.

Resetting and deleting the PUBS status for those files.

First, you should temporarily stop the PUBS daemons on EVB and NEAR1.

You will now run a script that clears the PUBS values for the files that are worthless so that they are no longer in error in Binary Metadata and no longer queued in the other projects. Except for the "Binary Data Deletion", that project is set to status "1", so that the file will be deleted from /data/uboonedata/rawdata/

PLEASE NOTE THIS WILL DELETE THE ONLY COPY OF THE RAW DATA FILE YOU ARE TRYING TO FIX! It is not uncommon to have a file that is useless, just want to make sure you know this will be one of the consequences. A file with only one event that is incomplete in the file (e.g. missing PMT readout, missing one SEB, etc) is useless.

cd ~uboonepro/pubs/dstream_online/
#The command below will set the RUN and SUBRUN status to completed for all projects and delete that file from the DAQ.
./fix_failed_metadata_gen_files.sh 9102 0 #This is the RUN and SUBRUN numbers of the file you are fixing

Do this for all of the run/subruns that you found in the prod_binary_metadata_near1.log file. The errors in Binary Metadata should be cleared and the number of queued files should drop by the number of times you ran that fix_failed_metadata_files.sh script.

Another problem that has shown up is that some of the runs in the list of failed metadata generation subruns (listed with the command "select * from prod_binary_metadata_near1 where status >1;") had status 14 (all the known statuses can be found here /home/uboonepro/pubs/dstream_online/ds_online_constants.py). Again, the following commands

cd ~uboonepro/pubs/dstream_online/
./fix_failed_metadata_gen_files.sh 9102 0 #This is the RUN and SUBRUN numbers of the file you are fixing

can be used to fix them.