Project

General

Profile

28-March-2016

Attending: Satish, Felipe, Paola, Chris, Enrique, Joe, Paul S, Kanika, Paul R

Computing issues (All)

Running at Wisconsin

On Sunday, a black hole node developed at Wisconsin. Enrique will file a GOC ticket. Satish will cut a new version of NGU that removes Wisconsin from the list of recommended sites. There was some discussion of the best way to implement this including basing it off of the head of NGU, or basing it off of the CURRENT version. The former is technically easier, and there has been some running with TEST, so we are reasonably confident that it won’t cause any issues.

SAM slowness

Especially on Sunday and seemingly continuing into today, we are faced with the problem of SAM being very slow, often to the point of being unusable. This extended to dataset counting queries, project starting and even dataset deletion. There was some skepticism to the notion that the issues with dCache could be causing all of these problems. The issue of the frequently dying SAM projects was also raised, although there is no solution in hand for that at the present time. Satish will follow up with Rob, and Kanika will update her ticket to increase the priority and

Tags (Paul)

Last week Paul cut S16-03-25, as a regular biweekly snapshot. He also cut R16-03-03-prod2reco.b, needed for MRE; and S16-01-07.a to fix some database issues for reco-keepup. He is also preparing a new release based on prod2genie that includes a new version of nutools (1.24-02) needed to test the new version of geant. He is waiting for new builds of novadaq and novaddt to finish it. Nightly builds seem to be in good shape.

Dataset Definitions (Joe)

Joe is working on creating the cosmic dataset definitions right now.

He’s also been through the definitions on the website. Most definitions are ok, except for the ND numi ones. They are currently good to use, but are potentially unsafe. Joe has produced more robust definitions, which Satish will add to the website, replacing the current ones. Joe also found some batch specific definitions. Satish will clean these up.

Joe and Enrique will follow up on getting the calib-func-X definitions created and validated.

Chris requested that the Birks sample defintions be announced to the collaboration. Satish will do this.

ND Numi cleanup (Satish)

This is done.

Samples

ND Data (Bruno)

Bruno was not able to attend today. ND numi processing is mostly done, but there are more missing files than can be explained by the epoch boundary problem. He will need to investigate when he returns from holiday.

Calib Func X (Enrique)

This went well but Satish and Alex killed the straggling jobs on Friday to make room for MRE. FD nonstop has 270/6k files left to drain, which is acceptable. FD flux swap is missing 900 files, which is too much. Enrique will investigate and resubmit some cleanup jobs. ND is close enough.

AWS ND nogenieRW (Paul/OPOS)

There were about 57k files to process, of which about 3/4 are done. Paola is about to submit a recovery dataset. Files are being copied back to FTS by Marc Mengel. The recovery jobs will use decalreFiles and declareLocation options. Paola will first run some small test jobs setting nova.special to aws_test, so that they don’t pollute the standard definitions.

MRE (Kanika)

To review, MRE happens in 3 steps:
  • generation of artdaq files,
  • reco+pid
  • caf-making

The DAQ step is mostly done, but recovery jobs are not going well, plagued by the SAM issues seen over the weekend. Reco+pid had good success on Saturday, but many problems starting on sunday morning because of the SAM issue. This also prevented Kanika form starting up SAM projects to get additional jobs running. She did manage to get MRE data running. LEMServe is at 50% load, because copy-backs are slow (this in turn because of the dCache issues).

Kanika will make an ECL entry with the MRE definitions for collaboration use. They were made by hand, not with Joe’s defgen script.

Kanika will continue running the MRE CAF making over the course of the week.

There was some discussion of what the optimal strategy was for this and the other remaining samples. Satish and Alex will discuss it this afternoon.

We will soon need to start generating MRE systematics. Alex R will be handling this. DAQ generation can start now, but LEM evaluation will have to wait for other samples.

FD cosmic nue (Felipe)

Felipe has been submitting jobs (~85k) over the weekend. Many clusters seem to have finished successfully, but Felipe is waiting on output or draining datasets to confirm they are ok. He did run into the dying project problem as well, which hampered progress. Felipe will send a detailed summary today.

Raw2root keepup (Paola)

This is running smoothly, although on Thursday around 500 jobs crashed because of copy out problems. The issues seems to be resolved now and things are running smoothly again.

Reco keepup (Qiulan)

This has generally been proceeding smoothly, with one failure. The file seemed to run fine interactively, but cannot be processed on the grid. Qiulan will send an email reminder about this issue.