Attending: Satish, Joe, Enrique, Justin, Fernanda, Chris, Gavin, Paul R, Paola, Paul Sail, Tapasi, Andres-Felipe
- Alex has proposed that we reorganize how we handle production requests. Rather than have individual people being responsible for individual stages of processing (e.g. Enrique for MC reco), one person will be responsible for an entire chain for a given sample. The transition period will be somewhat painful, but it should broaden the expertise within the group In the meantime, everyone is encouraged to review the generation documents at:
- The priority for novapro has been enhanced: the novapro priority factor has been reduced to one from one million, which is the standard for regular users. As a reminder, the lower your priority factor, the more resources you get.
- We have found that using redmine to track production processing requests in detail has not worked out as well as expected. Hence, we will continue to use redmine for tracking of requests by the collaboration, and to keep some sense of how things are going overall, but not for the details. Alex and Satish are exploring other options for this purpose.
Tags and Releases¶
- The nightly build broke over weekend because we ran out of space on /nova/app. The problem has been fixed.
- Three tags have been made recently over the weekend
+ New snapshot S15-11-06, which will be used for keepup in the future. Update It appears that this tag is broken, so this will have be addressed before keepup can resume.
+ FA14-10-043x-br and FA14-10-043x.d for horn off and new ND position MC. But this will need a back port (as mentioned during processing requests below).
+ R15-11-03-br and R15-11-03-ana.a for first analysis. Analyzers should commit using the svn_branch_commit to make sure that updates are also propagated to development
+ Upcoming tags: A new build for the daq. New builds for new versions of ART. There was some discussion of using feature branches for the new versions of art. Kanika, has suggested that for art v1.17.03 we simply put the change into development. For art v1.18, which incorporates root v6, we will need to be more careful.
- Thanks to Kanika, Jonathan and Gavin for all their help over the weekend in sorting these multiple issues out and cutting extra releases.
- Jobs at the OSC cluster are still not working.
- For the first time since Enrique has taken on this role, he has managed to get jobs started at Harvard. However the jobs all crashed. This is because kerberos is not installed on the Harvard nodes (which it is not required to be). There was some discussion of whether we should expect Harvard to install kerberos or if we need to edit our scripts to use alternate authentication methods.
- SMU and Fermilab have been working hard to get an instance of FTS running at SMU. They are close, but don’t have it quite yet, and have uncovered some bugs in the process that need fixing.
- Overall there are seven or eight sites that are not starting jobs.
- Tapasi observed that her jobs all failing at SMU. She should ping Enrique to understand why (that said her jobs are pretty much all done now).
- Chris asked if we could prioritize test jobs so that they also run first. We will investigate.
DB Code Crash Report¶
Ten files have been identified that need edits as outlined in the slides. The commits are ready to go, and Sijith has been given the go-ahead to make them. It was suggested that he make commits package by package to ensure they can be more easily backed out if needed.
Sijith asked whether we want to also make changes to code that writes to the database. This is generally outside the focus of the task he was requested to perform, but it seems like a good idea for him to investigate with the authors of the code to determine if this is worth doing.
Sijith will also remove the unused BadDataFilter_module.
Metadata Check for Unclosed Files¶
Adam looked at couple of files Identified by Fernanda as problematic, and compared to good files. Her initial list of bad metadata parameters looks correct.
There is a checksum parameter present, but he can’t tell if it is correct. There was some question as to what the online.runsize parameter means. There was some
speculation, but no clear answer. Satish is working on a grid script to correct the metadat, but it s not finished yet.
OPOS Grid Monitoring Report¶
- OPOS has started to monitor NOvA grid usage and production systems (e.g. FTS).
- They have identified a bug in the grafana monitoring tools and gotten it fixed.
- They have identify problematic jobs and are working with the submitting users to help resolve them, and commented that users have no way of telling why jobs have become held.
enforcing memory limits will be more strict
- The noticed 151 abandoned jobs (some from late july). They raised the question of how to identify such jobs reliably. Ideally we can automatically kill them. The existence of an associated sam project was considered, but not all jobs use sam. At least very old jobs could be purged automatically.
- They monitoring ten FTS instances and have identified error files. There was some discussion of the best way to deal with such files. At present the current model of informing production when the number of such files is excessive and letting us make a choice on how to proceed seems the best for now.
- Feedback is welcomed. Satish made the suggestion that a tool (command line would be fine) to determine the status of a file in FTS would be very useful. OPOS will explore the possibility.
- ND Top-Up MC: The generation is still in progress. It’s been going smoothly up through yesterday, at which point a large number of jobs became held. Currently there are about 22k/35k files left to generate.
- Mini Production: ND CRY generation is now complete. Paul will send around datasets, at which point Satish will file a request with OPOS to process the sample through the calibration chain. Paul now needs to generate the FD CRY sample We are still waiting on the fix from Robert Hatcher to enable generation of genie with 2p2h.
- Shifted ND position MC: Paul ran tests, but they failed. He has coordinated with Gavin, who now understands the issue. He will commit the fix, which is a pattern mismatch in GenieGen.fcl.
We will make an exception in this case, and back port the fix to FA14-10-03x.d (which has not been used for anything yet).
- Horn-off MC: Enrique will work on this, but it requires the same updates needed for the shifted ND position MC. It is on hold until those issues are resolved.
- Ideal conditions: Tapasi has completed generation of these all the extra ideal conditions MC. There were some intermittent issues with jobs not being able to find flux files at SMU.
- Raw2root: This is running smoothly. Some jobs were failing on Saturday, and an incident has been opened.
- NuMIReco Keepup: This hasn’t been started yet, but will be resumed today. This is an urgent matter.
- BNB Reco Keepup: This hasn’t been started yet, but will be resumed today. This is not as urgent a matter.
- Calibration: This has been proceeding smoothly, but has been plagued by some of the problems (BadChannels, RemoveBeamSpills…). These should be addressed with the new keepup tag.