Correcting Failed Near1 Binary Transfer¶
These instructions are intended to help experts diagnose if an error in the Near1 Binary Transfer box of the PUBS GUI is due to a legitimate failure to archive data files, or just to correct the status of files mislabeled due to "hiccups". I say hiccups because at this time it is uncertain what is causing files to be marked as in the error state.
Description of the Problem¶
The PUBS workflow is such that if there is an error during the EVB Binary Transfer (file starts on the RAID array ubdaq-prod-evb:/data and goes to FTS dropbox on the uboonegpvm01:/pnfs/uboone/scratch volume) then the subrun file is transferred to the RAID array ubdaq-prod-near1:/datalocal/. This volume /datalocal is designed as a staging area in case the network, uboonegpvm01, or any other part of the connection to the dCache scratch pool is down. But occasionally a transfer will time out, and the subrun file will be marked for transfer to /datalocal, and then staged to the dropbox at a later point. If the binary file transfer succeeds in sending the file to the dropbox, but times out or returns an error, then there will be a race condition. Because an error is returned from the transfer, the file is transferred to /datalocal/. But when the transfer from /datalocal to the dropbox is attempted, you cannot overwrite files in dCache and so that second transfer fails. This causes an error to be registered in the Near1 Binary Transfer project. But since the file is already located in the dropbox, it should safely be transferred to tape. Checking that the file is already in the dropbox with it's metadata file or that it is already in SAM and on tape tells you that this was the case, and the file can be safely deleted from /datalocal and the subrun status reset in PUBS.
Diagnosing the Problem¶
The easiest way to do this is to log into the smc database and look for subruns with status 160. Log into ubdaq-prod-smc, setup pubs, and then log into the procdb database:
ssh email@example.com #this is the gateway machine to get through the DAQ firewall ssh firstname.lastname@example.org # this is the SlowMonCon database host ("smc") cd pubs source config/setup_uboonepro_online.sh # this is the script that sets up the PUBS environment for the shell psql -d procdb
Once you've logged in uboonepro password that can be found in ~uboonepro/.sql_access/uboonepro_prod_conf.sh then you can list the subruns with errors using this command:
procdb=> select * from prod_transfer_binary_near12dropbox_near1 where status=160; run | subrun | seq | projectver | status | data -------+--------+-----+------------+--------+------ 11112 | 202 | 0 | 0 | 160 | 11112 | 240 | 0 | 0 | 160 | (2 rows)
The most important thing to do is locate the file on /pnfs/
Once you have a run.subrun that you know is in error (see DB Query) you need to login into uboonegpvm0X.fnal.gov, setup your favorite version of uboonecode, and then try to locate that run.subrun with SAM. (NOTE: the special formatting which requires the single forward quote around the "samweb list-files" command!!!):
source /cvmfs/uboone.opensciencegrid.org/products/setup_uboone.sh setup uboonecode v06_26_01_20 -q e10:prof samweb locate-file `samweb list-files "file_format binary% and run_number = 9603.442"`
This should give something like this:
If you see the last-part in parenthesis (here == (3264@vpr863) ) then you know the file has been correctly transferred, including its metadata. If the file isn't registered with SAM, the command will give an error since the embedded "samweb list-files" command will have failed.
At this point, you can fix the status of this file for the project:
Setup PUBS as uboonepro, and go to the dstream_online folder:
cd dstream_online #this assumes that you are already in the ~uboonepro/pubs/ directory
Now run the ./fix_failed_near1_binary_transfer.sh with the run and subrun of each of the failed files.
./fix_failed_near1_binary_transfer.sh 9603 442
Where we are correcting both the transfer failure and also the verification project. (we've already done the verification so that should be set to success.) The final project of clean up will take care of itself, and actually delete the file on /datalocal/.