Matthew, Craig, Chris, Joseph, Alex, Michael G., Ruth, Satish, Jonathan, Susan, Jeny, Paola, Dominick.
- The first analysis branch has been retired for now and Jonathan removed it from the repository. It can be resurrected at any time with a simple command. * FTS log cleaning - there was a misplaced “s” causing this not to be working as expected. Paola fixed this, and it now appears to be working. * Metadata change, to quote Robert Illingworth: “I’ve run the update on Simulated.firstRun and Simulated.firstSubRun [to change them to ints]. Simulated.cycle was already stored as a numeric type. Most of the other parameters in your original list appear to be things that don't benefit from being stored as numbers - they have lots of identical values and I'm guessing don't get > or < type queries.”. It was also noted that a date type does exist that we could use if we choose. * Breakpoint fitter has been added to the production scripts. The nightly tests have shown that it is working well, but is quite verbose. Michael Baird has reduced it’s verbosity. It also adds ~10% to the size of NuMI data files, and a lot less to other file types. * The old jobsub should have been removed for analysis jobs this morning. It was noted that the priority system is still missing from jobsub client. * No objections were raised to rolling all the GPVMs forward to SLF6.
Downtime related issue discussion - what is broken, have solutions been identified (all)¶
Since the downtime on Thursday, at least two major issues have been hampering production:
- Slow SAM queries (up to an hour). Some even time out with a 503 error. * Proxy issues preventing submission with jobsub_client.
A few tickets have been opened about these. We encourage everyone to open further tickets to hammer home how big a problem this is to us. Currently production has almost ground to a halt because of these issues. As a further action item, Craig will communicate our distress to CD. Further points noted in this discussion:
- jobsub_rm command now needs additional options to delete jobs that belong to nova pro, need to specify —role=Production. Satish noted that it would have been nice to have been warned of interface changes in advance. * Dominick noted that the symptoms may point to a server configuration change. He urged everyone to open tickets so CD can grasp the underlying issues.
Are we ready to close the Epoch (Matthew et al)¶
We talked a few weeks ago about closing a data Epoch on March 1st. However, my feeling is that we might be better off waiting another two weeks to a month so the open issues: geometry, N_DCMs, sag, hadronic deficit etc can be closed. No objections were raised to this. The plan is to revisit this decision in two weeks.
Simulation handover (Ruth)¶
Ruth has been working hard on Nates scripts and has alternately been bitten by the above problems and by some small issues with Nate's scripts. Gavin volunteered to help her make progress. The immediate plan here is to make a small, 10% sample, with an altered ND geometry to test that everything works.
Ongoing Reco (Satish)¶
The two remaining ND hadronic variation samples are with Satish. He reports no progress due to the above issues. He also noted that he had submitted his jobs offsite to OSC and SMU. But these failed due to a typo in the OS specification. He will try again when he is able.
Ongoing PID/CAF (Gavin)¶
LEM is done for the three samples (two hadronic variation samples and the new baseline). Gavin reports the he has mixed two of the samples, he then need to CAF these and mix and CAF the third sample. His plan is to force these through with the old jobsub!
Paola reports that the majority of jobs are done, but that she is having lots of trouble related to the downtime. The FD cosmic sample has the most number of files pending. She also reported that processing time has gone up by a lot. 40mins to 770mins for ND files.
The plan is for Paola to close the open ended data dataset for now and finish processing these with the current tag. Keep-up may then be migrated to a newer tag with some speed fixes. She will also report as datasets are completed and Matthew will communicate these piecemeal to calibration as they complete.
Reco keep-up (Jeny)¶
ND done. FD are still running. Some still pending due to downtime issues. She is also preparing a test for the keep-up cron jobs, but will await resolutions with jobsub client issue.
Jobsub_client with shared accounts (Jeny)¶
Jeny has run some tests using jobsub_client and shared accounts which succeeded. This is a bit at odds with Paola's understanding that proxy based jobs are currently broken - something required for the cron jobs to work. Again this is closely related with the downtime related issues reported above.
Dominick noted that jobsub_client is still missing a key feature in the form of priority tiers.
FTS crontab and restarts (Gavin)¶
Gavin noted that after the reboot of the sam GPVMs the FTS did not restart correctly due to outdated cron tab. He fixed this with a manual restart.
Matthew and Jonathan will audit the novapro account next week and put in place a system of use as well as version controlling key scripts, crontabs etc. Dominick noted that we should also look at bin directory in nova_pro home.