Matthew, Jonathan, Bruno, Craig, Chris, Joseph, Michael, Jeny, Paola, Alex, Ryan, Gavin, Gabriele, Susan, Nick, Dominick, Ruth, Satish
- Calibration identified a problem with ND files missing hits in the muon catcher. The underlying issue has been resolved - see below for processing information
- Data handling issues (from Gabriele): All the tickets boiled down to 2 issues: (1) INC000000516732: inefficient query: we are coding in SAM an equivalent query with much higher performance now. (2) INC000000513512: database inefficiency problem: this was solved. The hardware has tiered storage (with fast and slow disk). The SAM DB VM was using by mistake a slow disk. Storage was remapped yesterday (Mar 19) and we observe that several queries are 4-5 times faster now. We need to keep an eye on the batch system efficiency to verify that this was the main culprit for the inefficiency.
- Job submission issues (from Gabriele): job submission does not handle gracefully lot (thousands) of concurrent job submission (INC000000513510 & INC000000513517). The problem seems related to the OS configuration of our submission machine. We have increased of an order of magnitude the number of file descriptors in the submission server machine on Mar 18. The system is fully functional now. We are planning to test the system with a large number of concurrent submissions. This test will give us some confidence that this solution solved the issue, although, since this seems related to some race conditions, it may not be a definite proof. We'll monitor production closely should this issue repeat in the future.
The issue as outline in an email from Gabriele:
The NOvA software distribution is very large and publication and distribution to Stratum 1's take several hours (36 h last time). While publication happens at the shared OSG repository (OASIS) , other VOs cannot use the system. We [CD in discussion with Craig and Andrew] agreed that NOvA will start using the fermilab local repository: being dedicated, it has less impact on the OSG community. We also agreed that NOvA will reduce the size of the distribution by removing flux files from OASIS. Flux files in the local repository (if already installed) will not be removed for now, but should not be used unless previously agreed. Flux files will be kept at Fermilab on the BlueArc. Future campaigns using old software distributions will get flux files from BA. [NOvA] will validate results running from the BlueArc by Apr 14, so that NOvA can remove flux files from OASIS (the date is tied to an upgrade of cvmfs at OSG but can be pushed 2 weeks if the deadline is missed). For the production next week, we agreed that NOvA will still use flux files from OASIS at OSG dedicated resource, after an initial slow pre-stage to the remote squid caches, happening today. Future production campaigns with new software distributions will use a new Genie Helper that uses ifdh for transferring the flux files from dCache. These campaign will be able to run everywhere on OSG. Ruth and Gavin will test the new distribution method after the current production run. Marc Mengel will help tune the flux file delivery.
There are two important issues are raised above: firstly that NOvA must migrate to a new CVMFS repository; and secondly that we need to fundamentally alter how we handle flux files.
We need to be able to make files in old releases for our first analysis. Ryan pointed out that this will involve running thousands of jobs in the coming few months. As a result we need a plan to be able to remake these old files. Any such plan will need to be scalable. After some discussion it appears that there are a few options:
- Put the flux files on blue-arc and use CPN (or dCache and IFDH) to retrieve these during grid jobs job. The option of putting them on dCache should avoid having to worry about locks and would be more scalable.
- Declare the files to SAM and have them fetched through IFDH.
Of these, the first option is easiest. To implement this Gavin will need to change a few lines in the scripts that him and Ruth are currently working on. They'll then have to test scalability. It was pointed out that this fix will not work for offsite jobs (the second one should).
The plan is for Gavin and Ruth to test this option in parallel with producing files. They will aim to have concluded before April 14th (the deadline discussed above). We can push back on this deadline, but would like to avoid doing so. Following this Gavin, Ruth and the experts will help to validate the genie helper function that will provide our future mechanism for handling flux files.
During this discussion it was pointed out that if we want to continue using the current model of flux files based on CVMFS in the near future then we should pre-stage datasets to the remote sites that we want to use. This is apparently as simple as a wget command, however it should be done by hand.
As an aside, the reason for having to remove old flux files from OASIS is that after the upgrade, there will be no space for our flux files on the new server.
Andrew started work on this last week. Jonathan has begun to catch up on what's been done since his return. He reports that it appears that some, but not all, software has been published so far. Jonathan is in contact with Andrew about this, and is currently waiting to hear back from him. It was pointed out that the partial publish may be related to a real problem. Jonathan will open a ticket about this after this meeting.
Good runs list status¶
The good runs list are final as far as experts in DQ are aware. Ryan pointed out that we can always cull runs found to be bad after the fact easily, but that it is harder to add runs back in.
Bad channels status¶
Bad channels needs to be re-run once all the bugs are tracked down. This should be completed before the end of the week, all being well.
The good news is that Ruth has managed to make the first set of validation files. She started with the rock secondaries, ran 400 jobs and got 180/400 files back, with another ~200 jobs looking good, but that seemed to have hung and then die. She is not sure what happened here yet. Gavin and Ruth will work on this.
During the above discussion Ryan pointed out that we should be able to use the existing rock-secondaries. This had been overlooked by Matthew. As a result the new plan is for Ruth to re-assess if we have enough existing files, and if not, to top them up. Then to proceed onto generating validation files. The plan is to submit two 10% samples, one onsite and one off - to ensure the files are made as quickly as possible. We will then ask simulation to validate these files before proceeding to generate the full set.
There was some discussion about the problem with Ruth's hanging jobs was the same as a problem reported by Jeny (see below). Ruth will investigate and post to the ticket if applicable.
This was started on Friday March 13th and took a week to complete. In total 151k files were produced. Paola reported lots of delays related to issues discussed above. 14 files failed in the end due to incomplete metadata: online.runnumber is not filled. Paola reported these on the ticket. The plan is for her to retry once more then we will forward any remaining issues on to the raw format experts (the Indiana group was suggested here).
ND raw2root ticket was opened on friday. Paola set the projects going on the same day and they all completed the same day. FTS wasn’t getting staggered files as no FTS configuration was setup for them. This configuration was added and all files are SAM available now.
All ND triggers are complete with around a 1% error rate. All the failing files fail with error code 65 - bad channels fails - Paola will put these on the ticket and we will communicate this to the experts. The MC is nearly done, only 250 files remain to be run, with no errors yet.
Done. Draining is done. Match with FD pre-shutdown.
Reco keep-up for new artdaq¶
This has been disabled since friday. Jeny needs to retire the old files and delete them, Matthew should explicitly request this on the ticket.
CosmicVeto for nu_e¶
Nu_e have requested that the cosmic veto object be written into numi files but not cut on. Kanika will add this functionality.
We can’t start samweb projects with the NOVA production role still but we can use jobsub itself to start the project. Dominick would like to know why this environmental variable (GRIDUSER) has been changed? Is it a job-sub bug? Satish thinks that he should be able to get something that works soon.
- Script for moving error files out of the dropbox. Would like it to be put into the crontab. Matthew and Jonathan to do this.
- rdcol error in cmap error. This should be fixed. Do we have solution? Matthew to find what is written down and poke it in Chris’s direction.
- raw2root keepup is missing subrun zeros with bad event times. Fix? In place? Matthew can open ticket on that on Wednesday.
- Dominick asked how to build someones own software as part of the nightly build? Chris replied that if this is a UPS product then we do it as part of our externals as normal.