Project

General

Profile

Running data blinding scripts

cron jobs for blinding data

There is a cronjob installed on uboonegpvm01 that runs at 6:00 am every morning to blind data written during the previous 24 hours.

#Run the blinding scripts once a day at 6:00 am (contact Kirby if there is a need to change things)
0 6 * * * cd /uboone/app/home/uboonepro/cron;./blind_data.sh > cron_blind_data.out 2> cron_blind_data.err

The blind_data.sh script /uboone/app/home/uboonepro/cron does several things:

  • rotate the logs of the blinding script in /uboone/app/users/uboonepro/pubs_devel/log/uboonegpvm07.fnal.gov/ - the last 5 logs are kept
  • executes the blind_uboone_data.py script fours time, once each for binary, swizzled, reco, and anatree
  • note that the run limits on those commands are set to 9000 - 100000 so there may come a time when the lower limit needs to be changed in order to speed up the query

The log files in the /uboone/app/home/uboonepro/cron area can probably be deleted about once a month.

The Data Blinding Python script in PUBS

The script for blinding MicroBooNE BNB data is located in the offline production PUBS git repository (pubs/dstream_prod/blind_uboone_data.py). This script can only be run from the uboonepro account and requires that the offline PUBS environment be setup.

$> ssh uboonepro@uboonegpvm01.fnal.gov
$> cd /uboone/app/users/uboonepro/pubs_devel/dstream_prod/
$> source /uboone/app/users/uboonepro/pubs_devel/config/setup_uboonepro_offline.sh
$> ./blind_uboone_data.py binary 5000 5100 0

Usage of the script

Here is the usage message when you don't give any arguments.

[uboonepro@uboonegpvm01 dstream_prod]$ ./blind_uboone_data.py

Missing the needed input variables for blinding of data!!!

Usage: blind_uboone_data.py <binary|swizzled|reco|anatree|test> <lower_run_limit> [upper_run_limit] [fraction_for_unblind] [reprocess_old_files]
You have to give a lower limit to the files you want to blind, and possibly an upper limit.
The fraction_for_unblind is an optional integer percentage to leave unblind.
Files marked with ub_blinding.processed: true with only be processed if you set [reprocess_old_files] to true.

Note on usage:

- you have to have a valid certificate in the environment in order to make modifications to the file metadata in SAM. This is done with the setup_uboonepro_offline.sh
- you have to chose a set of files to process:

  • binary: raw files from the DAQ (only files older than 48 hours will be blinded, but any file starting with "Phy*" that contains BNB events)
  • swizzled: the raw swizzled files output from processing (only bnb and bnbunbiased files)
  • reco: the output of reco2 stage of processing (only bnb and bnbunbiased files)
  • anatree: flat ntuple files output from anatree processing (only bnb and bnbunbiased files)

- you have to give an upper and lower run range that is user to select files from SAM to blind or not. The upper and lower limit are inclusive, so if you want to potentially blind only one run, you can have upper and lower both be than run number (e.g. ./blind_uboone_data.py 9100 9100 0)
- the fraction_for_unblind is an integer value for the fraction of files that you want to have randomly left open for analysis. This means that if you set this value to "1" that 1% (or the fraction 0.01) of all files would not be blinded. Once a file is processed, then metadata for the file is set such that it won't be considered for random "open" dataset. the default fraction is 0%
- you can override the "ub_blinding.processed" metadata by setting the "reprocess_old_files" value to "true" and all files will be considered for being marked unreadable regardless of previous processing

crontab running these files

Once we've processed all of the files, then we should setup cronjobs running each night.

Initial blinding progress

binary: 3420-10000 (this is done and doesn't need to be re-run for open trigger data)
swizzled: 4000-10000 (this had a problem with run 3985 and needs to be re-run after open trigger data)
reco: 4000-10000 (this had a problem with run 3985 and needs to be re-run after open trigger data)
anatree: 3420-10000 (this had a problem with run 3985 and needs to be re-run after open trigger data)