03/13/2014 Minutes


Jon, Raphael, Susan, Nick, Dominick, Kanika, Raphael, Andrew, Chris, Gavin, Michael , Alex, Eric


Raphael to produce gsimple for Rock files, with preference given to FHC then RHC.
Nate producing Rock overlay file for Jim.


MC: Completed for full set of FD CRY cosmics. Datasets declared and being used by Chris and subsequently Luke V.
SAM Datasets are available for the quota of pclist and pcliststop files:
(added nobad since there are two corrupt files in original dataset)

each containing 10145 files (200events/spill) from an original dataset of 10168.
The files are also available on bluearc local disk:

Database failures were caused by a small glitch in the DB, Web server to serves the validity tables was down for a few minutes.
Minor problem with the retry functionality which isn't implemented correctly.
We haven't put a large enough load on the DB. If we overwhelm the system then we won't have the retry.
Jobs for 3980 files from runs 13150 -13350 were submitted on March 7. 3805 of those jobs ran successfully and produced output in the dropbox mentioned above. The failures in that set may have been caused by intermittent database connection problems. The failures seem to be somewhat back-loaded in terms of job run time.

Another 3990 jobs for all files processed from run 13350 onward were submitted on March 9. The vast majority of those jobs failed due to widespread database connection problems. Only 54 pairs of pchits/pchitsstop files arrived in the dropbox.

The cause of the database problems are unknown, but may have been related to a (late day March 7) backport to the S13-03-06 tag that changed the database address from prod to dev. On March 10, the tag was again patched to point to a web-cache port (8081) rather than the default port (8084).

The jobs for runs 13350 and onward (3990 total) were resubmitted on the morning of March 11. With a concurrency maximum of ~1100, the jobs have been proceeding without any database troubles. Since output from some of these files has already been recognized by FTS, they could not be put directly in the dropbox mentioned above. Instead, they are being placed in a temporary dropbox, namely: /nova/prod/pchits/data_not_dropbox/.
Dominick will now send the files to the official dropbox so that they get picked up by the FTS.

FTS VM overwhelmed

Handles all file transfers and queries and was getting clogged up dumping metadata from files and sam_metadata_dumper suck up a core for hours. Increasing CPU processes is one solution. Want to make sam_metadata_dumper use xrootd to read files and bypasses Bluearc. Looks like the xroot issue is resolved


Possible to make a snapshot and make a dataset with the snapshot ID.
Joao has processed to add goodruns metadata field to raw2root files.
I now see (somehow missed it earlier) the dq.isgoodrun: true being set for good subruns.

Can query SAM for a list of files with this field set in Data:

% samweb -e nova list-files "dq.isgoodrun true and data_tier pclist and calibration.base_release S14-03-06" 


Filenames have a descriptor field that helps one decipher what a file refers to. Several different patterns are possible which makes it difficult to make a dynamic file renaming scheme. Implement a simulated.label which is the correct for the specific type of generation one has done - but datasets. Script works on regular expressions.

OASIS server issues

Everyone had a certificate that expired and getting a new certificate wasn't sufficient. Incorrect person was flagged as the VO manager.
Opened a ticket , confirmation that update is over - but looks like the files haven't propagated. Raphael working with Scott Teige to understand why. Plans to change the update process to a similar system on novacfs - where it is clear the update is done. Maybe 5 hours


Adam to add some checks for IsZombie and corruption to filter out bad file outputs. This won't catch the files that have the TStreamer/TBasket errors. Nor would a simple eventdump. Chris reports that these can fail after X events (X>10) and eventdump default checks only first ten events. Perhaps we should run eventdump on all files. Verify if this is viable to run over whole file without adding significant processing time.

Outstanding Issues

Went through the issue tracker list. See Issue tracker for any updates.