Attending: Satish, Susan, Paola, Felipe, Justin, Chris, Paul S, Paul R, Enrique, Joe, Bruno, Tapasi
SW Tags (Paul S):¶
Development has been upgraded to art 1.17. This works fine for SLF6, but breaks SLF5. Alex sent around a message to see if anyone is using SLF5, and indicating that we would like to drop support for it, but has gotten no reply. There was some discussion on the list about our use of nutools v1.16 instead of v1.17. Paul has contacted Lynn Garren to understand the differences Provided they seem innocuous he will upgrade us to the latest version.
Bruno asked when there will be a new tag, as he is anxious to see reco keepup resume. There were some issues that need resolution before the next tag can be made. Satish will follow up with Alex and Paul to ensure that these are addressed promptly.
Paul will be traveling to the UK for a few weeks. He will still be available through the 20th or 21st, after which he will be on vacation up through the 29th.
The nightly build last night had an issue with cafmaker and ANG issues. Paul will follow up with the ANG group.
Offsite Status Report (Enrique)¶
Jobs at the UChicago and MWT2 clusters have been crashing with errors in our software. As these sites tend to start running our jobs faster than others, this is especially problematic. There was some speculation that this was a CVMFS issue, but it was not clear. Enrique will send an email to the list to get to the bottom of this.
Enrique also commented that Tom has about $10k to spend on computing resources that we would like to direct at NOvA. He’d like feedback on the most effective way to spend those funds. Tom will send an email to Satish and Alex.
Jobs are not starting at OSC. Enrique is in contact with the admins there to sort out the issue.
OPOS Monitoring Report (Felipe)¶
There are a large number of error files in the raw-data drop boxes. The run coordinators have been notified, but have not replied. Satish
The run-coordinators have not replied to the notification of raw data FTS. Sent on the 24th. Satish will follow up with the RC team.
There were also a large number of errors in the offline dropboxes FTS. Felipe will send around another notification.
File Size Report (Justin)¶
Justin has created a new DataQuality package to allow storage of the information from flatdaq needed for DQ evaluation. This would be run at raw2root time, so to take advantage of it, we’d need to change the tag for raw2root, and to reprocess existing artdaq files. Susan asked why it was necessary to drop rather than simply never build the flatdaq object. This is because then the DQ information never becomes available for the new package to extract. Chris suggested this could also be solved by merging this code into the existing daq2rawdigit package, but commented that this solution would also work fine.
Processing Status Reports¶
Raw2root keepup (Felipe)¶
This is running smoothly. OPOS is processing 6500 files/day in 338 jobs. No errors have been seen.
Horn-off data/MC (Enrique/Chris)¶
The MC CAF jobs have not started yet. They should take three hours to finish
A significant number of the data jobs have been crashing with memory issues. The jobs have been crashing both offiste and onsite. Chris asked what exactly the error message was. Enrique will send a message to the list. There are 1200 files, about 30-40% of which have been processed so far. Enrique should also send around the dataset definitions.
ND New Position MC (Enrique/Chris)¶
Several files failed during the reco stage. Enrique will reprocess them today. The majority of the remaining pid/caf files should be done today.
FD New Position MC (Joe/Chris)¶
This is done, except for the decaf step. Satish will follow up with Bing.
ND Mini-prod calibration (Qiulan/Paola)¶
The cosmics triggers are complete, except for twelve files crashed files. The DDActivity triggers also nearly complete, although there
were a number of crashes. Paola will reprocess these files to see if the problem recurs. ND epoch 3b is on hold.
FD Mini-prod calibration (Qiulan)¶
All the files should be available now.
ND Mini-prod CRY+Calib (Bruno)¶
This is nearly complete. Somehow the later tiers had fewer files than expected. Bruno is topping up these samples now, and they should be done shortly.
FD Mini-prod CRY+Calib (Paul R)¶
Jobs over the weekend were held by condor because of requested memory. Paul resubmitted, and the jobs are running now.
FD Mini-prod GENIE (Tapasi)¶
Tapasi has been trying to get jobs running since since last Thursday. The issue is that her jobs cannot find the flux files. This was fixed with using the fluxloc scratch options. She observed that her jobs were idling for long time, although Paul commented that the issue was probably that jobs were crashing before establishing a process with SAM. So based on the project monitor, it would appear that jobs had not started. Since Sunday here jobs have been seeing a no fluxloc option error. Paul commented that this was likely because the new position MC requires a custom script for copying the flux files to the job node. If the flux files were copied out of the scratch area to the usual area, this wouldn’t be needed. Satish will follow up with Raphael to ensure this happens.
Amazon Running/NC Respins (Paul R):¶
Last week, Paul S updated cvmfs with the new ifdhc needed for using S3 at amazon. This part of the process now seems to be working.
However, jobs are now failing with runNovaSAM.py. The error thrown is for "Invalid syntax,” but the syntax is clearly correct. Paul will attempt to debug this issue.
NOvA is currently running 5k jobs, well over our quota. That might be delaying our onsite jobs. Fermigrid is quite busy, mostly because of us.
Anna commented that we need to move the novapro crontab from gpsn01 to novagpvm10. Since this is happening soon, it is an urgent matter. Right now some of the cron jobs copy novapro over kerberos tickets that are needed because of a check performed by
submit_nova_art.py. But these tickets themselves cause problems, so Anna proposes modify submit_nova_art.py to disable this check. Vito will send around a proposed patch for Satish to review. Paola commented that this is probably not the most robust way forward, and there was general agreement, but that a better fix would take time to implement properly. Because this needs to be resolved promptly, we will probably use Vito’s patch for now and switch to a more robust method in the future.
There is a new document from Neha coming documenting security rules for use of the accounts used by OPOS (and others). Satish had some objection to the timeout of sixty minutes on the novapro account. Anna will communicate Satish’s concerns.