Attending: Satish, Shih-kai, Chris, Joe, Alex, Justin, Paul S, Sijith, Enrique, Paola,
- We now have a fourth FTS instance (novasampgpvm01), and Satish has updated the configurations to appropriately balance loads between all four instances.
- The FTS version has been updated to v5_0_0 so that it can send data to fifemon.
- the FTS VMs now mount pnfs via NFS v4, so that we can read text files using existing code. This was needed to extract metadata for FCL files, and would also be needed for reading json files for metadata.
SW Tags (Paul S)¶
- Alex and Paul have finished putting together the branch creation and commit/merge scripts. Paul needs to complete some testing. They should be ready to go by the end of the day. Satish will also test the branch commit/merge script.
- Paul is just about ready to make a new snapshot, but is still waiting for a few extra inputs from the beam people.
- The required commits for FA14-10-03x.d are in place, but Paul cannot make it compile. He seems to be missing some external packages, and is attempting to get this sorted out.
- We have a couple patch releases based off of S15-10-10a planned. The first is to incorporate 2p2h by default. The second is to include new calibration constants, once they are ready. Allegedly the changes needed for 2p2h are already in place, but there was some confusion about this. Satish will follow up with Alex.
- The nightly build has had some issues, but working now.
Offsite Update (Enrique)¶
- SMU had a hardware problem last week. This was responsible for a large number of jobs failing at SMU, which Alex had noted (these were Paul’s generation jobs). The problem has been addressed now.
- Enrique has been corresponding with Alex Sousa to understand why jobs have not been starting at OSC. There has also been some discussion with the local admins to get to the bottom of the issue.
- Enrique us working on getting together a website, to summarize offsite grid performance,
- Paola observes that offsite jobs take often take more than 24h to start. However SAM projects time out after no activity for 24 hours, meaning that when the jobs do start to run, they will be unable to get any files. Several possible solutions were discussed: increasing the timeout period to 48 hours; providing a switch to control the timeout period on for the start-project command; providing some kind of keep-alive feature so that we can ping a project from outside of the jobs and keep them from timing out. Satish also suggested that Paola communicate with Enrique to understand why jobs take so long to start and see if the problem can be fixed.
Extra ND Monte Carlo (Satish)¶
This is largely complete. There are some oddities in the file counts that resulted from duplicate files produced in an early stage of the generation whose descendants were incompletely retired. Given that we expect to be respinning this soon anyway, it doesn’t seem worth it to chase down every last inconsistency.
FA NC Respins¶
- Preparation for running on Amazon: This is ready to go, and is waiting for the go-ahead from Gabriel that the required changes on the back end have been made before Paul can start submitting to amazon. Paul made some local changes to his submission code to allow him to take advantage of the amazon resources. He will investigate to see if any of these should be fed back into the standard tools, and follow up with Satish.
- FD Genie MC: Tapasi has completed these jobs.
- We still have a number of other samples to process (ND MC, FD data, FD top-up MC), but we will hold off on starting until we have validation of the FD MC samples from the NC/sterile group.
ND Top-up Monte Carlo (Paul R)¶
Paul submitted jobs over the weekend, but they ran at SMU and crashed. Paul has excluded SMU from his list of sites to run at, and jobs are now running fine. About 3k jobs have finished already.
Preparation for Mini-Production (Satish)¶
Paul has created the FCL files for CRY production and submitted his jobs. They seem to have run to completion, but the resulting output has not been copied back. It appears that the jobs are not even attempting to copy the results back. He continues to investigate. There were also some other, seemingly unrelated errors that need to be corrected, and he is investigating these as well.
Preparation for Horn-off MC (Satish):¶
Raphael has finished generating the flugg files and they are now stored on the /pnfs/numix area. However generation of the gsimple files is stuck as he tries to access those files. He is trying to understand the problem now.
Extra Ideal Conditions FD MC (Tapasi)¶
This is running smoothly, except that frequently jobs idling for more than 24h. These were onsite jobs. AT present there are only 37running and 300 jobs idle. Offsite are running fine, so this is likely because with recent increased usage, nova pro priority is getting hurt. We may need to beef up the priority of novapro.
There have been a number of jobs crashing: some because of issues with removebeamspills, some with BadChannel and some with a crash in PathLengthInCell. Paola is investigating these crashes with Satish and the relevant experts.
Paola is planning to make a job status report at the next Production meeting.
There are 27k error files in FTS. Chris says it should be easy to modify the error file cleanup script to pick them up (as it is the error files are too recent to be picked up by the script, but we want them out of the way now. Satish will take care of this.