Jonathan, Paul R, Enrique, Biao, Vito, Felipe, Qiulan, Chris, Joseph, Jon, Alex, Paul S
The raw2root version for ND has been updated to S15-08-12. This is updated for keep-up, and we are also back-processing old files with this tag. This has introduced a version incompatibility with the production of calibration input files. In principle we could go back to an older release, but Satish understands that calibration group will be updating their procedures soon anyway, so he has requested that OPOS pause ND calibration keep-up.
We had our first novasoft meeting yesterday. At that meeting it was decided that we should pursue the migration to git, but cautiously. There are some authentication issues to sort out before any changes are made, but once we do, it is likely that we will use the production script packages as a test-bed. So this group may be among the first to encounter any problems with the migration. In the longer term, we also anticipate that we will require more validation before tagging a new release. This will require some cultural changes on our part, and also some changes to our workflow.
This could also change our current biweekly tagging schedule. The original motivation behind this somewhat arbitrary schedule was to give developers a warning and regular expectation of when tags would be coming.
At yesterdays liaisons meeting, Alex was informed that priority groups have now been rolled out. However, only priorities have been set not quotas. Alex proposes that OPOS use the standard production group priority group until the quotas have been set. Alex is awaiting an email announcement indicating that we can start using them. Some code changes on our end may be needed to take advantage of this. Satish will follow up with Alex on the details.
SCD was receptive to the changes to sam4users that we decided to request at the production workshop. They are starting to work on them, and anticipate a new version on the timescale of a month.
SW tags (Paul Sail)¶
Paul has been keeping an eye on our nightly builds. A problem with the builds occurred over the weekend, but with help from Gavin and Jonathan, Paul was able to identify and contact the relevant experts, who fixed the problem. He also pushed a new version of NovaGridUtils to CVMFS (v01.42).
Paul will be cutting a new tag tomorrow. For now we will stick to a biweekly tag schedule, but as discussed above this may change in the future. Normally a notification of a new tag is sent out in advance, but in this case we will not bother, as we have fallen behind our regular tagging schedule, and we'd like a new one rather soon. Future warnings of a new tag will be given out a week ahead of time.
Database Tagging Procedures (Jon)¶
At the production workshop, some concern had been expressed about the infinite lifetime of db query caches that were implemented to deal with the very large number of queries coming from the big FA production run. The concern was that because the caches have an infinite lifetime, updates to the database would not be picked up by subsequent queries, if the query results had already been cached.
To deal with this, Jon proposes that we maintain infinite lifetime for queries with a tag, but that keepup queries would run without a tag. This means that keepup jobs will use the latest and greatest results from the database. In the meantime, big production runs will need to use a new tag for bad channels and all other tables. This will require that production inform offline (and the DB table maintainers) that new tags are needed before a new campaign begins. Thereafter, adding new data requires that new tags need to be generated. Jon notes that a new run of the bad channel evaluation will be needed at some point anyway.
This does have the potential to induce delays in the schedule for production processing. As noted, it also requires good communications between the production group and the DB table maintainers. Finally, it is to be expected that early in the next campaign, we may well face a large number of database connection failures as many queries are made before the cache has been sufficiently built up. It was commented that we need to be sure that we really do crash in all of these cases. Jon also commented that it would be good to identify this particular issue as a consequence of the early running and not raise alarms unnecessarily.
Offsite Status Report (Enrique)¶
Enrique has been using MC checkout to test offsite jobs, and has been tracking job success rate, time between submission and termination, and idle time. He has found many jobs terminated after idling for one day without ever actually running, but the reasons for this are unclear. His jobs did request 2.5 GB of memory, which could explain some of the problem, but his earlier jobs only required 2 GB, and encountered similar issues. A long term goal for the ART is to reduce it's memory footprint so that we can request only 2 GB of memory, but that will be a large effort that takes quite a while to resolve. To help get to the bottom of some of these issues Enrique will be in touch with the offsite admins. He will also be compiling a list of resources available to jobs (memory, disk space etc) at different sites.
Retirement Script Updates (Gavin)¶
This is done. It was observed that the syntax is somewhat arcane, but this may be ok as the retirement script is an expert tool. It was also observed that as this is a python script, it should be easy to make it somewhat more user friendly.
- ND extra MC: Of the 20k requested files, 5.5k are left to go. So this is nearly 75% done.
- ND Rock Nu: These jobs have started running. So far only 50/3700 files have finished. Runtimes are roughly 300 minutes/job. Aggressive use of offsite resources has been recommended.
- ND topup MC — Paul has generated the fcls. Before submitting jobs he first needs to fix metadata bug.
The reconstruction of the extra ND Monte Carlo had been paused because the inputs had the wrong metadata, so that the inputs did not show up in the draining datasets. This has now been resolved, and Enrique started submitting jobs again this morning. However the pidpart files were not showing up in the pidpart definition. Update Enrique has investigated this, and determined that the pidpart files were generated with S15-05-04, whereas the definitions assume S15-05-04a. This is because of the complicated history of the FA processing that required a respin of the pidpart files. He will now have to launch pidpart jobs with S15-05-04a.
- ND extras MC
- ND Top-up data
Both the extra ND MC and the stray file from the ND top-up data are in progress, but files are not showing up in SAM. Further investigation is needed.
No inputs are available at present, so there is no report.
calib keepup (Qiulan)¶
Keepup has been proceeding relatively smoothly. Three classes of error have been observed.
- No diblock mask found in db. This was because the masks had not been inserted in the database at the time of the error. Updates have resumed, and the errors cleared.
- A missing branch error due to the updated version of art used in ND raw2root compared to the version of art used for calibration jobs. ND calibration has been paused for the moment (see the news section)
- Two failures associated with the ifdh cp command. A new version of ifdhc has been rolled out that should be robust against the underlying problem (apparently having to do with an overloaded disk cache). But it was pointed out that it is not really possible to change ifdh versions for an existing tag, so we cannot simply update versions. The problem should be transient, and Qiulan will attempt to rerun the failed jobs.
raw2root keepup (Felipe)¶
This has been proceeding smoothly, and no errors have been observed.
raw2root backprocessing (Vito)¶
This has been proceeding smoothly. All the ND files have been processed. Three files have crashed -- they are all corrupt files that had been noted during previous keep-up tickets.
The FD NuMI stream had already been done and required no back processing. Of the FD cosmics, only two files are left. One file was corrupt, the other exhibited a strange error, noted in the ticket. Satish will follow up. The “others” stream contains 1.4 M, and Vito has processed 10k of these. This is proceeding slowly because files are not on the cache and need to be retrieved from tape.