Notes taken by M.Tamsett & C.Group
Matthew, Jonathan, Alex, Craig, Eric, Adam, Kanika, Nate, Gavin, Ryan, Jeny, Paula, Chris, Satish & Dominick
Matthew spoke to Michael Gheith who was receptive to the suggested extensions of the SAM monitoring web page. The two major suggestions being the ability to include history before the start time of the deamon making the web page, and the ability to filter on more meta-data and/or predefined definitions.
Action Item: Matthew will compile a list of definitions and useful metadata and send it to Michael for inclusion in the next iteration of the web page.
An alpha version of a web page to monitor the results of production tests has been deployed here. The website is mostly complete and only needs some final bugs ironing out with the back end and response to failure modes before it is ready for actual usage.
Action Item: Matthew will finish this off.
It was mentioned by Craig that it would also be good to collect statistics from run jobs and have these available in some comprehensible form. Eric has tools for collecting these from his jobs and Dominick has been working on something similar as well.
Action Item: Matthew, Eric and Dominick should meet and come up with a plan for this.
RAW 2 ROOT issues (Nate, Adam, Jeny & Paula)¶
Raw2root broke earlier this week but seems to be back up again now. Suggested that this was likely some problem with job sub & dagman that SCD caught and fixed. Jeny & Paula submitted their first full fledged jobs on Tuesday and these were done by Wednesday - this is suspiciously slow w.r.t Adam's jobs.
Action Item: A second round of testing will be run then Jeny, Paula & Adam will meet next week to hand-over.
pnfs down (Nate, Gavin)¶
“Catastrophic” - Gavin wrote a script to move some files which broke things. Happily this wasn't a disaster in the end and all was recovered. Better protection against this sort of thing is needed in future. It turns out that the permissions were set wrong. Changes will be made to prevent this happening in future. Nate says that it still would have been impossible to actually kill the data. It scary, but thankfully not catastrophic.
Action Item: Craig will follow up with Andrew and his group.
FPE's and WDA fix (Dominick)¶
See Dominick's slides (docdb-11873). FPEs are completely gone (within statistical errors)! This new-found stability allowed the identification of a new problem found Kalman track merge. The fix to libwda memory management has exposed a time out problem between web server and database (load balanced). Other modules would fail “silently” with this bug as there is no good way to know the number of rows in advance. DB experts are aware of this issue.
Metadata issues (Gavin, Adam)¶
Problems introduce due to new ART version introducing new and altered metadata parameters and then auto-filling them. For example, first event & last event (which should be ints) are filled as triples; ARTROOT was added as a file format (tier?) where we were using ROOT; Extra fields not given to SAM; Also parents block is filled with full paths - not appropriate for sam (stomps on what Adam builds); fcl parentage not currently propagated correctly.
Artists, probably need to talk to Robert Illingworth to check changes are workable with SAM. Or/give us an option to opt out and not be stomped on.
Nate talked to Chris Green about this. They thought they were filling parameters first, so our stuff would overwrite theirs, this is not the case. Parentage stuff will be fixed in the next version of art. See: art issue 6823
Nate mentioned that a plugin type exists which can be used to insert metadata. Nate will send this to Adam.
Simulation tag progress - geo, noise, run conditions (Nate, Ryan)¶
Geo is finalised, noise models nearly there. Adam has a few more changes that should be committed today. He'll look at full gain ND data ASAP. Near detector and noise model still pending.
Are the new geometries fully validated? Maybe not? Should request a sim meeting report.
Nate's timing hooks are done, and the functionality has been incorporated into MakeSimFcl. The official new empty event module requested from the ARTists is not ready in time for us. Art release turn around a topic for conveners meeting. These SIM hooks are sufficient, for the SIM generation. We don’t need to wait for Joao.
Disk space discussion (Craig, Pavan)¶
If we blow away /nova/data/novaroot we get 57 TB free! The consensus was to remove all the FD/CAF files and to check that the NDOS stuff is indeed in SAM before wiping that.
Action Item: Gavin & Pavan will wipe the aforementioned data and check if the NDOS stuff made it to tape.
Craig is following up with Zukai who dumped 21 TB on the file over the weekend. Need to some up with a good reason for this or to help him. Quotas? Need to provide him with a system to do this.
It was mentioned that blue ark disk space will be a problem when first analysis starts as everyone will dump files there.
Remade ND overlay status (Gavin)¶
He is through reco to PID. Ready to point Chris to the pidpart files as they’re done. CAFs ready tomorrow.
Misaligned reco jobs & Simultaneous GENIE+CRY jobs (Nate)¶
He made FHiCL files for both of these yesterday - but they were not ready to submit before grid draining for today. Found some FTS issues… They aren’t being picked up by SAM again.
Action Item: Once FTS issues are ironed out (Gavin, Dominick & Nate) Nate will submit these jobs.
MC generation hand over (Nate, Eric)¶
Nate spoke to Eric, this is in good shape.
- Eric updated his site ping job and has been running it to check that all of our sites are still awake. He also committed it. The plan going forward is that Nate will make a cron job of it. His samlog parser is in the repo.
- Gavin asked if we had names next to running the RECO/PID/CAF stages. Action Item: Matthew will find people for this.
- No meeting next Monday.