Attending: Satish, Joe, Alex, Art, Paul Sail, Vito, Paul Rojas, Justin, Paola, Gavin, Enrique, Chris
Alex noted that an explanation with a straightforward test for the calibration issue has been posited, involving changes to an arbitrary parameter. The tests should be completed sometime today. If this bears out, the rest of the calibration process can also begin, and we should be able to start reconstruction soon.
Art commented that there are several cron jobs running on gpsn01 that he would like to stop. Details will follow in an email message, but to summarize these are jobs to fetch kerberos tickets and production proxies. To the best of anyone’s knowledge these tokens are no longer needed. Ceasing these cron jobs should remove the last obstacle to shutting down gpsn01, a task that has been planned for some time. Art will turn off the cron jobs tomorrow, and see if anyone notices.
On Thursday the 21st (shortly before the collaboration meeting) dCache will be down for most of the day, as well as other systems. Alex has already sent out one notification, and will send out another as we get closer.
SW Tags (Paul S)¶
The latest tag (S16-01-07, the stormtrooper release) was cut on last Friday, straight off of development. At present, he’s working on cutting a new mini prod branch and tag to support MEC. The only change is to the externals. He is having some trouble to find the correct version of genie to use, and will consult with Robert Hatcher to sort this out.
A new snapshot is due out to upgrade the version of nutools to v1.19, and also for a minor art version bump. This involves lots of updates, so will take some effort. There was some discussion of whether we will soon want to update the version of genie in the development trunk, anticipating the prod2 campaign. The principle objection was that before anyone has looked at the events out of genie 2.10.2 it’s difficult to be sure that this is the version we want. Tests the Xuebing and others are running should provide insight here. Once we settle on the genie version, we will have to send a request to Lynn to get a version of nutools built against the correct version of genie.
Rationalizing MC Generation (Paul/Gavin)¶
—cvmfs and FCL generation options are obsolete and will be removed. At present, most of what has been accomplished was to migrate
NovaGridUtils. The integration with
submit_nova_art.py is now proceeding. Chris had some concerns that by having
make_sim_fcl ksu to novapro might break the script for regular users. Since this is usually the easiest way for such users to generate custom MC, this ability should be preserved. In particular Jeremy recently asked questions along these lines, and it would be good to have a pat answer for users like him. Paul commented that the use of the novapro account will only be activated when writing FCL files to the dropbox.
The integration of
submit_nova_art.py is the most difficult task, and might need some modifications to
submit_nova_art.py to allow for the fact that these jobs take FCL files as inputs, not art files. Satish commented that the appropriate thing is probably to modify
runNovaSAM.py to deal with this case gracefully. At present generation jobs just use
art_sam_warp.sh. Dominick commented that this editing
runNovaSAM.py could involve some rather deep changes, and that modifying
submit_nova_art.py to simply use
art_sam_wrap.sh in these cases is probably easier.
Removal of scripts from old locations still needs to happen. Gavin will help with additional work, although the precise division of labor is still unclear.
Satish commented that in the modifications to
make_sim_fcl, it should make the Production pakgage version part of the fcl file names
Flatdaq Dropping (Justin)¶
In his last report, Justin had difficulties getting his jobs to finish because the jobs were exceeding their memory limits. He has since adopted the strategy of running out development instead of the old S15-05-22 tag. This appears to have addressed the crash. He has reproduced the field size savings and they are in line with previous expectations, but doesn’t yet have a full suite of numbers. Justin will send them around once he has them.
Full chain Jobs (Satish)¶
Satish has a merged FCL file ready, but testing has been impeded by a missing package (fann) on CVMFS. Paul had installed this package, and it’s unclear why it’s gone now. Paul will install the package again, at which point Satish will retry his tests.
Stashcache Updates (Joe)¶
Joe has been on vacation until today, so has no news to report. He will resume looking at this now.
The developer was on vacation until this week, so there is little news. There is a draft checklist that is nearly ready to circulate. After iterating on it a bit more, Alex will circulate it to the group for comments. After making any changes from that feedback, we will start using it to track production activities.
Analysis Skiming (Alex)¶
The idea here is to skim out slices we don’t care about, primarily using CosVeto, reducing file size and processing time. The idea is to put it as early in the chain as possible to reduce the skimming complexity. The principle issue is how to handle Michel electron reconstruction. Chris asked if the plan was to just run this on FD data, which it is. Generally speaking, however, this would introduce differences in how data and MC is processed, so it might not be the wisest choice. Chris also commented that it would be good to have some subsample processed without the skimming cuts to allow us to study the impact of those cuts.
Memory Usage (Alex)¶
Fermigrid is moving to partitionable slots with 2 GB memory each, meaning that if our jobs need to request more than 2 GB, the resource cost charged to nova will double. To avoid this we would like to ensure our memory usage stays below 2 GB. A great deal of press has been achieved by Chris optimizing how AttenCache loads its data. The reco+pid jobs are barely under the 2GB threshold, so more gains are probably needed, especially in light of the full chain jobs, which will probably consume more memory.
Offsite Status Report (Enrique)¶
In response to Alex’s question Enrique confirmed that at sites where nova has dedicated resources, our jobs always start. Alex observed that the number of idiling jobs recently decreased significantly and plateaued at about 200 jobs. The timing and size of the fall off, as well as the number of residual jobs is consistent with a cleanup of old jobs that Enrique performed.
There are issues with the SMU and OSC clusters, which Enrique is following up on. We currently have no access to the UChicago and MWT2 sites. Enrique is checking with SCD to see if new ticket is needed to follow up. Getting jobs to run at MIT is difficult, given how heavily used it is.
Raw2root Keepup (Vito)¶
This is running smoothly.
Reco Keepup (Paola)¶
A couple jobs failed with transfer problems. Some jobs are failing as they exceed job memory limits. The suggestion is to try to requesting jobs with more memory. At present only 2 GB of memory is requested, but for reco jobs it should be at least 3 GB in this tag. At some point we will switch to a more recent tag, and gain the benefit of reduced memory usage.
Miniprod ND Genie (Bruno)¶
Miniprod ND MEC (Gavin)¶
Gavin is waiting for the tag from Paul S to proceed.
Amazon Status (Paul R)¶
Before the winter holidays, Paul had gotten single jobs to run. More files have since been moved to amazon, allowing an expanded number of jobs. In the lats few days, Paul submitted 10 jobs with 15 files each. Half finished successfully, half crashed or didn’t start. The jobs that crashed evidently failed when copying files over from S3. This appears to be an issue of an incorrect file location, but Paul isn’t sure. He has advertised the error message to the appropriate people, and will follow up. The next step is to submit ~100 jobs, and providing that works, a full scale set of jobs.