15 June 2015

Paul, Susan, Joseph, David, Matthew, Gavin, Craig, Chris, Kirk, Jonathan, Bruno, Alex, Ruth, Neha, Ken

SW Tags (David/Jonathan)

  • Tag scheduled for tomorrow: any changes, should go in tonight want to include new version of ART, v01_14_02
  • Kanika has back ported a bug fix to MRE to S15-05-04a

Run History Discussion (all)

Over the weekend, Evan discovered that an ideal diblock mask was used for all FD top-up data. This was because the processing was done before the masks were inserted into the database. There is a FCL parameter to force a crash if the mask is not found in the DB, but it was not correctly set in the production tags. A fix was committed to development on May 28 (r14772), but never back ported to production tags. It should be backported.

This will require respiring all of the FD top-up data. We will grab the handful of pre-top-up data while we are at it. Matthew will run Kirk’s script to identify a list of problematic files, and Satish will retire them. Gavin is working on getting privileges for Satish to do this, and will send him detailed instructions for retiring the files.

Archiving (Pavan)

  • /nova/data/mc/S12-11-16 — deleted, freeing up 8.7 TB. Was this archived first? Pavan not attending, so couldn’t ask for sure. Satish will follow up. But we should be archiving as a general rule (and that is the plan). Update: this directory had already been archived, so only deletion was needed.
  • /nova/ana/users/cerretan archived and deleted freeing up 2.2 TB

Susan commented that /nova/prod needs attention soon. It’s currently at 80%. This is where all the production CAFs get saved, so usage is large and growing. There was a proposal to not write production CAFs to blue-arc anymore. This is probably a good idea, but we shouldn’t do it for first analysis. We will look for files that can be archived or deleted w/o worry first.

Simulation (Ruth/Paul)

Ruth — For running offsite, the issues w/ SMU and Harvard tickets have been filed. There are two issues. One looks like a CVMFS issue, or possibly a certificate issue. At SMU, it looks like the SL5 machines have 90% failure rate. In the meantime, other sites, such as Michigan have been getting better traction. About 13k/50k files have been done. It will be good to get more jobs running in parallel to speed up the process.

Paul — The nonswap Birks-Chou generation is now running and not crashing, after about 1 hour. He will start flux swap to start, and alternate intensity jobs shortly.

Reco/PIDPart (Joseph)

The Reconstruction of cosmics data is in progress. It should finish w/in a few hours. About 3-4% of files have failed reconstruction due to the BPF crash. If this is a random crash, we can probably live w/ it, but concern was expressed that the crashes are probably biased towards samples with more active diblocks. This needs to be followed up w/ the conveners.

Matthew should run Kirks’ script to verify that there are no issues w/ diblock masking in the comics samples.

The queries for the for the draining datasets take an excessively long time. Recent fixes have helped other queries, but not in this case. For his processing, Joseph will use simplified queries that do not check for ALL children of a file, merely one.

LEM (Chris)

  • FD cosmics in progress. Two more days to finish.
  • MRE in the queue next. MRE script needs bug fix, and backport.

Mix/CAF (Bruno)

  • Working on FD cosmics data as it comes through. 1200 through, another 7k today. Will keep up as they come in
  • The FD numi data is basically done, but will need to rerun because of the RunHistory/Diblock mask issue.

There were a handful of ND files with copy failures. Apparently existing files needed to be retired Gavin retired first. Gavin did the retirement, but the files still couldn’t be declared to SAM when Bruno respun, but they are still failing. Bruno will follow up offline.

Calibration Keepup (Paola)

  • Calibration keep encounter several submission issues of the sort that was plaguing others over the weekend.
  • Some jobs are still running
  • Several jobs lost connection w/ the samweb station. No logs available in some cases.
  • Several jobs failed with error codes that point at IFDH, Seems like a problem at FZU. An OSG ticket has been opened.
  • To assure that all data is being processed, calibration keep now goes back 60 days. This is needed because a dataset naming collision (now fixed) was preventing some files from being processed.

Raw2root keepup (Paola)

This is running fine.

Reco keepup (Vito)

This is generally running fine. There were some problems:

  • crash in CalHit associated w/ DB connection failures, later resolved
  • BPF memory exhaustion failure

Priorities in Jobsub (Satish)

  • See docdb 13513
  • Is the half-life for priority set properly? A half-life of one day seems rather short
  • One line of questioning what these priorities mean for jobs competing across experiments.
  • Some discussion on the test queues. Maybe we can just edit so that one job always runs immediately. The general sense was we should let SCD develop their approach for test queues, and see if it works for us.

Other business:

We have a request to process the special numi run data. Satish to follow up on exactly what is needed for these files. We will need to develop a way to use metadata to track these files and distinguish them from conventional ND data files.