8 June 2015

Gavin, Paul, Joseph, Matthew, Kirk, Ryan, Chris, Ruth, Michael, Craig, Susan, Bruno, Satish

News (Matthew/Satish)

  • Big push now is for alternate samples; Kanika is getting MRCC done while we are converging on plans for other samples;
  • FD comics — Tests over weekend indicate that (depending on # of jobs submit), run-time w/o filtering to be 2.5 weeks for pre-topup dataset only. Running tests to evaluate runtime w/ filters, but that looks to be the way we will run. The possibility of using the alternate comics rejection was raised, as it offers a substantial (factor of three) reduction in processing time. But there was reluctance to make changes in the production configuration at this late date. There may be duplicate runs/subruns in the datasets she circulated. This should be investigated, but given the long runtimes/file, it’s like that we will still need to run in filtered mode.
  • Satish has committed requisite changes to FCL. This will go into a new hotifx tag. Satish will forward necessary information to Jonathan and David.
  • Also need to generate alternate Birks-Chou and alternate intensity samples; need to converge on re-use of genie events on that score. This will also need a new hotfix Tag

Reuse of Genie Events (Gavin):

We want to reuse genie events when generating systematic samples, to reduce the impact of statistical fluctuations. Gavin has worked on the generation scripts in the past, and agreed to look into it. It turns out this was not in the scripts that Gavin handed off to Nate. The mechanism is in Nate’s scripts but it looks messy, so Gavin doesn’t want to touch it. There was some indication that Nate’s code for this may be jumping through unnecessary hoops, and it would be better to just put together something new. Gavin will investigate a bit further and update on prospects soon.

Update: As long as we don’t need to redo the overlay step (Ryan is ok with this), this looks do-able, at least for Birks-Chou;

Putting MR Brem in Production (Chris)

MR brem is not fundamentally different from MRCC, which is already done by Kanika using production infrastructure. But the current mechanism for running the MR Brem jobs means managing files on blue-arc, which is cumbersome. It would be good to get this done using the production infrastructure as well. There was some concern about a high failure rate for MR brem jobs that needs to be understood and resolved before we go ahead with this. This was an older issue, so it may in fact already be resolved.

We may need new data tiers for this. They may already exist.

Update: Satish investigated after the meeting, and they don’t appear to exist.

Tags (Jonathan/David)

  • expecting S15-05-04b for FD cosmics; just reco changes from Satish; Satish to notify Jonathan, David when ready.
  • expecting FA14-10-03x.c for systematics samples: just Ryan’s changes for systematics samples; this release is in the process of being created.; Gavin’s changes for genie event reuse may also need to go in. Bu they are all in submission and processing scripts that don’t need to be tied to a release. In any case, we should have that ready for today.
  • The backports to the S15-05-22 series not propagated to CVMFS yet (the back ports are to mixing fcl files that were never used — but should be updated to ensure that any reprocessing will yield consistent results w/ jobs that have run.
  • It is difficult for user to know when a release has been pushed out to CVMFS. The scripts will be configured to create a file whose existence will confirm that the publish has completed successfully.

Simulation (Ruth/Paul)

  • Paul has submitted the RHC (ideal conditions) jobs now (600 files and 600 jobs each for nonstop, flux swap and tau);
  • ND Cry jobs from Ruth are running, but going slow. About 8k/50k are finished. She will submit more jobs to dedicated offsite locations (Harvard, SMU, OSC) to see if she can get more nodes — and probe the sites if her jobs do not run.

Reco/PIDPart (Joseph)

Finished reco and pidpart for top-up. He is dealing with stragglers now. Joseph will start dealing with sam definitions for FD comics right away, and submit jobs as soon as go-ahead is received.

LEM (Chris)

  • Chris is completely caught up on FD data.
  • He is also working on MRCC for Kanika and MR Brem
  • There is one MRCC lemsum file w/ complicated history. Gavin got it retired, for some reason SAM still knows about it. Gavin and Chris will investigate.
  • Chris is working on decaf’s. This seems to be working fine, but does need a restart of FTS. The restart script is supposed to ssh to nova-offline. That works from offsite, but not from onsite. Chris to follow up w/ Neha.

Mix/CAF (Bruno)

  • The ND data with S51-05-22a is done. There are 5 missing files in normal geometry and 12 missing in staggered geometry. He is investigating.
  • He is starting on FD top-up data. But the draining dataset definition times out; A service desk ticket revealed that the native query run directly on the SAM db took seven hours. An update to the schema to address the problem is in progress, but there is no ETA. In the meantime, the DH group just returned to Bruno a list of files to process. Because the list of files is so long, command line tools to create the SAM definition are failing. So Bruno broke things up into smaller datasets and submitted against those. That seems to be working, and he expects the jobs to be done by the end of the day. For draining definitions, Chris suggested that rather than providing a list of files, Computing can just supply a snapshot that Bruno can use.

ND genie RW (Gavin):

It’s done, except for a few stragglers. Consistently getting 1.5k nodes. Need to make sure we use 22a tag for future refcafing

Raw2Root keep-up status (Vito)

Reco of 19 files crashed w/ memory issue — looks like a breakpoint failure; There was a suggestion to see whether a new tag can address this. But there was reluctance to make a change of redo versions so soon (current tag is S15-05-07a). We will soon start back processing of FD data. That may be an appropriate time to switch to a new tag.

Calibration keep-up status (Paola)

Calibration keeps running w/o major problems; There have been some submission errors, but generally transient. Jobs are being submitted for offsite locations. We will switch to targeting jobs at specific offsite locations to probe the health of different sites, and get problems addressed; and to report on site failures.

The job submission errors are generally some kind of authentication failure. They have been seen by the OPOS group for other experiments as well. As far as can be determined, the issue is a server-side threading problem. It is difficult to test on the preproduction server, b/c the problem is load dependent, and only the production server has a sufficient number of submissions for the problem to arise.

New FHiCL Validation Scheme

Kyle Knoepfel presented a new scheme for FHiCL validation that is coming down the pike (see slides attached to agenda). It looks promising, but will require a lot of recoding on our part to take advantage of it. ART will remain backwards compatible with the old means of reading FHiCL files.