Attending: Bruno, Satish, Paola, Joseph, Felipe, Alex R., Adam, Chris, Alex, Tapasi, Enrique, Jose
Overview and top-up¶
Jaroslav froze data today, ND and FD final runs. Now in hands of data quality. Jose is running bad channels validation. It happens only for good runs, so there’s a bit of running condition. It takes 1 day for FD around 11 am. Kanika asks why not to validate bad runs. Bad channels expected to be validated today or tomorrow, v14. ND will happen at the same time. Good runs is unknown as of now.
Still using dq.isgoodrun metadata, the database didn’t converge on time.
Second step is making the release, once we know the version of the database. Kanika also wants to add info to the deCAFs, so that should be in the next prod2reco.e tag. Plan to start work on Wednesday morning. Probably ready by next Monday.
Kanika was respinning CAFs, but some of them failed, but the last file taking very long time. SAM losing track of the job seems like the most likely reason. Advice is to kill these jobs and drain. SAM stabilities are still occurring with high frequency. Snapshots are free from most of the issues with SAM.
Good runs stalled last week, but it seems to be working now.
Failure to open files in pnfs:, it’s rare, but it keeps hitting people. It’s a big issue when running big jobs on full datasets. Alex is going to push SCD for this to be fixed.
Enrique observes jobs taking too much memory offsite. That’s not a problem because we optimise and then drain, but the problem is maxConcurrent counts held jobs as running jobs.
Tapasi has mostly completed this sample with 6% errors. She will work on the definitions and we’ll either re-run or drain. Tapasi sees differences between extra metadata. Fluxswaps appear to use the old FCL file with pidpart files, so it’s using the FirstAna version and they don’t have CVN or others, so we’re probably going to run again, but perhaps from the reco. Do we do a smaller sample or do we delay FD data? We’re going to need all the definitions and we’ll have to start re-processing. We’re going to need to retire the latest files. Tapes and Joseph will work on the definitions, and Tapasi will send more jobs today.
Maintain ~ 4000 jobs running offsite, with 4 submissions, 1 per epoch. The progress is slow but steady. Not many problems nor errors. Probably about 1 more week to finish the whole thing. Samples are huge, but the progress is good. 3b sample, the one running at SMU is the one performing worse. Apparently there are nodes at SMU that are causing trouble, so Enrique will resubmit elsewhere. Enrique sees very long transfer periods, but it seems to be just because of the sheer number of files. Good news is that we’re not killing deCache. Alex is trying to get NOvA to access the LHC connection, which should make a difference. Excellent demonstration that we can run reconstruction offsite, except for LEMServer. Load balancing seems like the biggest issue.
All processes except for nue veto are done. There are a few failures, still not clear why. 100% efficiency is not expected, Alex points out. caf_respins.inc memory limit seems small (800 Mb). Kanika will resubmit after the veto cosmics are done.It’s possible that the problem was in declaring, which is easy to check if their location is virtual. Respining decafs may not be possible due to metadata and filename not being consistently filled.
Bruno will check the logs to see why a fraction of the jobs fail without an output.
Keep-up without major problems, raw2root were stuck in the FTS last week waiting for enstore location. On reco, there were a few error files, and OPOS are debugging. Stalled files are FD, which is a very high priority thing to resolve, so Paola will take care of this today. All data from today will have been processed by tomorrow noon.
The setup is pretty much converged into a final stage, so Satish is working on migrating more and more datasets into the new system. Next ones will be Kanika’s respins of the CAFs. Satish has a new version, not on the repository, which will replace the other one (because overwrites names). ETA for completing migration is this week.
Multiple location dropbox¶
Satish has everything except for support from Amazon. Satish needs the actual S3 location, unfortunately each one uses its own, so perhaps there’s not a case for including Amazon into this system.
Defgen and metadata¶
Joseph committed a new version which should work for pretty much every sample. There are examples and options for backwards compatibility. There seems to be a lot of special-casing and details. The first thing is (no)rock and (no)genierw which should probably be in the filename to make it easier to use. Essentially all MC has genierw except for special cases, so this should go into special. Same for no-rock. RHC needs work if we want to read if from the raw2root, but in principle it should be trivial to do it from run ranges. Tools might not be ready to prepare a scheme for this yet, but we want to get started in thinking on how are we going to approach.