Paul R, Enrique, Justin, Qiulan, Alex, Paul S, Felipe, Tapasi, Kirk, Biao, Joseph
We will have a number of new processing requests coming in soon. This will include a number of updates to the simulation from Adam, as well as a new version of genie that supports generation of the 2p2h process. Today at the sim meeting, we will decide what changes will go into the tag that will be used for generating the new samples. We will first generate new calibration inputs, including new cosmic ray MC and new data calibration inputs. In parallel, we will also need to generate new genie MC. The calibration group will produce a new set of calibrations, after which we will reconstruct the new genie MC using those new calibrations.
Satish will be on vacation Sunday Sep 27 - Saturday Oct 3, during which he will be completely unreachable. He will also be on vacation Friday Oct 9 - Monday Oct 12, during which he will be in sporadic contact.
Chris has made some commits to address the 1.5 us timing offset bug. These changes will break reconstruction of files not produced with the same commits. A warning should be sent to nova_offline, but it was felt that Chris should send the message when he gets back from traveling, as he is the one with th relevant expertise here.
SW and Tags (Paul S)¶
There have been no problems with the nightly builds. Paul cut a new tag last week, and updated redmine with the list of changes that went into this release. It was rather extensive, as it had been some time since the last tag. A new tag is slated for this Friday. Paul added the new ShowerE package from Fernanda to the nightly build.
Filtering and Merging (Kirk)¶
Kirk presented some studies on the impact of using CosVeto to reduce file sizes and speed up reco processing. This will hopefully allow us to reduce the computing requirements needed for production, in particular for far detector processing. At present, CosVeto uses tracks CosmicTracker. There is potential to use WindowTracker instead, but first the cuts would need to be tuned.
Signal loss is around 0.2% for both numu and new, the NC loss is <1%, whereas a 65% (old) -90% (new) rejection is observed in the cosmic samples. Significant reductions in file size and processing time are observed in turning on the CosVeto at all, but the differences between old and new seems rather minimal. This appears to be correlated with the number of hits in slices surviving the cuts, and because a substantial fraction of the file size comes from CalHits, which is run before CosVeto. In both cases the file reduction size does not seem to be enough to allow the scale of file merging we had envisioned.
Justin will investigate the contributions to file size and report on his findings next week. He should investigate both where the big contributions are coming from, and if there is any redundancy within the files that could be trimmed.
There was some discussion of how the CosVeto cuts were optimized, and whether this could be revisited. It is possible, but Kirk is skeptical that it will be fruitful.
Alex raised the possibility of reducing file size by throwing away information (hits) in slices that do not survive CosVeto. This may be technically challenging to do, and care must be taken to ensure that the utility of the output files is not compromised.
Offsite Status Report (Enrique)¶
Enrique’s tests have had the most success at SU-OG, Caltech, FZU, MIT and (fraction of jobs that run to completion). Omaha and Michigan have also done well, but not as good. However at some of these sites, jobs have tended to idle for a long time. At quite a few sites, jobs have idled past the 24h time limit (Wisconsin, UCSD, TTU, UChicago, SMU_HPC, OSC, Nebraska). When not directed, jobs tend to go to MTW2, but jobs there tend to crash (this is understood to be a result of missing software packages).
Enrique has also assigned a score (from 1-16) to all sites, a ranking metric that is described in his slides.
Alex has commented that job memory requirements will start to be enforced at Fermilab job slots. This means that if you request 2.5 GB of memory, and your job consumes more than that, the job will be killed. We need to characterize how much memory all of our jobs take.
Alex has sent Enrique tools for offsite job monitoring that are used by mu2e. Enrique is investigating them.
First tests on 2p2h (Paul R)¶
Paul presented his work on getting ready to run the simulation jobs to generate 2p2h events (either as separate jobs or as an additional process). He has received instructions from Robert Hatcher on how to run these jobs, but the requisite splines are not available yet. Paul didn’t have an ETA on those splines, but in the simulation meeting, Robert reported that he expected them very soon.
In the meantime, Paul has been submitting test jobs. The first set of tests crashed with an error that he has identified and fixed. He needs now to submit another round of test jobs.
crash in code fixed, need to resubmit
Respinning Strategy (Paul R)¶
Paul also presented plans for running the first analysis respin (needed for the NC/Sterile analysis) on the amazon farm. We will be charged on both CPU cycles and data storage. In principle, these jobs need only to run pidpart, mixing and caf making. So we can reduce costs by producing a single set of jobs that do all three steps at once. It is necessary to first drop existing presel objects. Paul is in contact with Andrew Norman to determine if this can be done at the start of the job.
Redmine Review (all)¶
The generation of extra ND Monte Carlo is nearly done (only 100 files to go). Enrique is following code behind. Tapasi has not been able to submit jobs yet as she did not have production privileges. Update Tapasi’s issue seems to be resolved now.
ND Top-up data processing is now complete. Generation of the corresponding MC jobs is commencing now. Paul will send around the base dataset definitions.
Keepup Status (OPOS)¶
Raw2root keepup (Felipe)¶
Some errors associated with ifdh timeouts were observed but the errors appear to have been transient. Jobs are running smoothly now.
Calib keepup (Qiulan)¶
- The "No mask found in DB" error was observed, but has not since last Monday.
- Some output files failed to transfer. But they transferred successfully after retrying again.
- Some output files got duplicate files. Qiulan believes that some jobs last so long time that the next keepup project process the same input files again before the FTS declare it.
This is about 80% complete and should be done in a few days. There were some issues with dCache, now resolved that slowed things down a bit.
As of this meeting, Tapasi had not received production credentials. Update Alex prodded the service desk, and now she does have the credential.s