SamWiseMinutes3 10 14¶
Dominick, Chris, Michael, Kanika, Craig, Adam, Andrew, Jan, Gavin, …
— Dominick: Run PCHits to get files ready for calibration studies.
— Adam: Modifications to metadata module need for CAF metadata
— Multiple art jobs per grid section —
—> strange python issue.
— IO problem on Bluearc
— Investigate 10 hour metadata job…
— Database issue – can we understand what failed.
— Nate: working on the overlays
— Gavin: Validation files for reconstruction could be made soon (cosmic and small batch GENIE)…
— Is making upgrade to log-file parsing script.
— log file – all files in condor tmp
— production script makes a log of all projects — dumps metadata
— IFDH_fetch can be used on the grid nodes.
— In our UPS setup script, it uses a variable called $file, but doesn’t clear it or reset it. Yuck.
— IFDH_CP took 5 hours to complete an easy copy — 30 minute increments. Uses grid ftp door.
— ran a lot Friday night into Saturday – 4000 files (13.. Period late Feb)
—> Drop box and populating SAM — FTS is a slow step.
— Another batch yesterday – production database broke down.
—> Many reported database failures. Missing stuff varied. Can’t reproduce today.
—> Chris: also noticed the problems — not validity-based tables.
—> Succeeded connections were fast (<1 s), but failures were taking a long time.
— New setup scripts last week to move to “prod” from “dev” (back ports made a day after tag was released – late Friday)
—> Currently one subRun per job.
— FTS : 150 files per hour (30s per file). Is this reasonable?
— FTS can declare files much faster than that. Pending 3400 (waiting for tape). 6400 new files (extracting meta data). Waiting for NSTORE tape levels.
— Is Bluearc the slowdown? I/O off the BlueArc — may need a different intermediate staging location.
— Could be stuck extracting the metadata
— For Eric it took 4 days for FTS to catch up.
— For drift need everything – attenuation and absolute, just need a good time period.
— Chris will look at MC files today — bottleneck for everyone tuning.
— Dominick will spend a little time to get the multiple sections run.
Gavin FD MC – PC hits:
— Passed to calibration group
— Ran 10,000 jobs 2E+06 spills (4.0E+06 spills is the goal for a full production)
— Ran over the weekend – Jobs ran Friday, all files in SAM, on site, (CVMFS problems),
—> 2000 jobs perfect, — 10 errors, “fatal root error”, 8000 jobs in 2nd dataset, only missing 20 missing out of 8000 out of one recovery.
Dominick – about 5% failure from from the 4000 jobs that ran.
Nate: Python script is limiting factor for running multiple files per job section:
— python script isn’t working properly on grid nodes.
— "os module not found"…
— suggest to do “which python” and dump out python environmental variables.