22 June 2015¶
Satish, David, Paul, Jonathan, Susan, Chris, Joseph, Ruth, Alex, Ryan, Neha, Ken.
Computing Issues (all)¶
Ruth and Satish have been having problems with jobs crashing at SMU. In Ruth’s case, it’s that her jobs cannot find the flux files. For Satish, it’s difficulty finding libraries. Ruth will follow up with Gavin, Satish w/ service-desk people.
The IFBeam db fix has been pushed out everywhere, last Wednesday. But we are still seeing short queries. Neha will try to get more information to Satish to sort this out.
SW Tags (David/Jonathan)¶
We had been planning to cut a new tag to use the new version of ART. Unfortunately, the new version (actually NuTools) has a new package, Database, that collides with a NOvA package. This is preventing us from adopting the new version of ART. Yet another version will soon be released that changes the name of this package to avoid this collision.
The FD Birks-Chou B &C tau jobs still have a few files to go. For ND, Paul is having trouble finding the flux files for jobs running at SMU, which has delayed production of these files. From subsequent email discussions, this is in part due to some changes to art_sam_wrap.sh which needs to be pushed to FA14-10-03x.c (it’s already in FA14-10-03x.a).
Paul is also working to get the Alternate Intensity files going.
ND CRY: At present, 17k/50k files are done. Ruth had started to submit jobs onsite to increase throughput, but saw a high failure rate. She is investigating the issues now. The job failure issues at Harvard have been identified as
because some required packages were not installed. The Harvard admin is on vacation, and will address these issues when he gets back. In the meantime, it’s best to avoid submitting jobs to Harvard.
Topup FD MC
The jobs have submitted but not yet finished. After doing the POT math, Ruth has decided on an additional 500 files for each flavor set. She will also add the FZU sites back into her submissions.
Joseph is waiting for file retirement completion to start submitting jobs again. Satish has completed retiring the files from SAM, but b/c of permissions issues he encountered when first getting started, some files have not been deleted from their physical locations. He will attempt to get that resolved today.
BirksB samples: mostly done, except for a few stragglers. Satish will drain the datasets this evening (note that lemsum and pidpart files were not showing up in SAM; see below).
BirksC samples: nonswap and flux swap are well underway, but not yet done. Tau jobs were submitted at same time, but have not yet started running.
Calibration shifted ND MC files are well underway. Completion is still estimated for later today, except for a few stragglers.
Birks on hold until lemsum files show up in SAM. Satish is confident that the files have been produced and copied back, but for some reason they are not showing up in the datasets. Satish will investigate for an FTS issue. Update: Satish has identified the issue and restarted FTS.
For Calib shifted ND MC, Satish’s definitions don’t work properly because the nova.subversion was not properly set. Chris has put together workaround definitions that enable him to submit jobs anyway, in the meantime. Subversion will get properly adjusted once jobs are complete.
MRE jobs are on hold in the BG. Chris also has a huge pile of jobs from Nitin. He’d love to get these into production, but it seems now is not the best time to make that effort.
Mix/CAF (Gavin, Bruno)¶
FD fluxsxwap and nonswap calib shift MC is done. BirksC, BirksB, Calib shift ND MC will begin when they become available
Gavin helped Paul debug submissions.
Mixing/cafing of data done for now. Drained ND Data; There are a number of instances where the lempart file had location, but was not present. Retirement is needed for these files requirement, but not done yet.
Raw2Root Keepup (Paola)¶
Raw2root has been running smoothly since last week. There have been some submission errors, some of which should have been resolved. The jobs themselves have been running w/o issues.
Calibration Keepup (Paola)¶
Jobs at FZU were crashing because of copy back errors. Paola opened an SNOW incident, but for the time being has stopped submitting to FZU. Jobs at other sites seem to be running ok. About a week ago, some jobs were starting to fail because the calibration keep uses an old version of BadChanList. We should update to a new tag for calibration keep-up. Satish will follow up for the next keep request.
Reco Keepup (Vito)¶
Early on this week, reco keepup was running ok, with some transient FTP errors on copy-back. Mid-week, they started to observe jobs crashing due to BPF failures (fixed in subsequent tags) and missing diblock mask errors. The latter is because the diblock mask was not inserted into the database yet. FD Reco was therefore passed until this has been resolved.
For ND, as of Saturday and Sunday, no new files were being seen. Satish commented that he was on shift, and files seem to be present. Paola commented that raw2root keep was proceeding smoothly. Paola ad Vito will investigate to see where files were going missing.
Bruno suggested to update the reco tag. Satish will suggest a tag, and Bruno will do some quick tests to verify that this is ok.
Question about FZU - do we want dedicated storage, to use as staging area for data. The consensus was yes. Satish will send an email to Tanya.
Qiulan requested that we update submit_nova_art.py to support MIT among list of supported sites. OPOS would like to use this for calibration keepup.