Attending: Alex, Adam, Satish, Vlad, Tapasi, Paul Rojas, Cristiana and Bruno
Report on training¶
- Vlad: with only 2 days of training so far, he's managed to submit the first jobs and monitor them. Although he had issues with some sam project, this is unrelated to him. Apparently, Alex reports there seems to have been widespread samweb stability issues, so he's going to open a ticket. Vlad has also worked on creating definitions, following Enrique's instructions. He's also included an ECL entry about this. About the samweb issues, Vlad reports that sometimes it took him up to 5 minutes to create a definition.
- Adam: he submitted LEM jobs with success. In fact, too much success, up to the limit of hitting LEMServer a bit too hard. After fiddling a bit with the values, and with Paul's instructions, he's managed to submit jobs and there are about 600 files already processed. Jobs are taking around 3 hours, which is a bit more than expected but not significantly. However, these jobs seem to be using too much memory, so we need to investigate this further. Adam will work on creating definitions for these, updating the tag to prod2reco.d.
- Cristiana: didn't have the chance to overlap much with Joseph. She will start working on making submit_nova_art.py capable of working using definitions created by sam for users
General status of production¶
- We need to check the infrastructure for creating decafs before we start running on FD data. In particular, for the unblinded cafs. If this doesn't converge, we will start running without producing decafs and produce these later on.
- The production of concatenated files is on hold, as there still seem to be issues regarding the POT. Paul thinks this actually has to do with him not using the right location, so he's running a new test and will report on the outcome.
- ND calibration with Y-slope is almost complete
- Paul committed a fix to a fcl in novaproduction. This is to include the override in the LEM release, so they're seen by LEMServer in the appropriate tag. We're going to need a new version of novaproduction soon, but for the time being he's just running using a local fcl. Alex requests that these jobs are tested using a lower memory request. It it works, it would become the new standard configuration.
- There are still O(20000) nogenierw files that need to be transferred from Amazon (up to a total of 60000). The draining jobs will need to be run several times to take this into account.
Declaring files on the nodes¶
Alex suggests that it will be very good to implement this feature, that has already been tested in Amazon, in FermiGrid too. But we will need a volunteer to pick up this project. Declaring the metadata should be easier and will make a significant improvement to our workflow (this being the most costly operation for the FTSs). Declaring the location will be more difficult, but it should be feasible too. A good way of testing this is by listing datasets and looking for files with a virtual location. This could also be an opportunity to train Adam on SAM queries.
During the collaboration meeting, Dan Hershey showed that early runs had issues, in particular between the number of good runs matching between data and Monte Carlo. Although low statistics may explain this, the whole good runs system seems too unstable. In particular, the fact that they may change over time may lead to inconsistencies and adds all sorts of difficulties for production. Alex think a better system will be to make queries on good runs over a database in real time, so he's setting a meeting with the experts to work on this. If it works, good runs conditions will check an independent database instead of use the dq.isgoodrun flag. If this doesn't work or it's too difficult to implement, we might push this step to analysers.
Paul has tagged prod2reco.d, so we can produce deCAFs that are consistent with the new nue fiducial volume and avoid the LEM errors. Ideally, next deCAF respins will come from the analysis groups.
Satish has finished a script to produce these