SamWiseMinutes3 31 14¶
Here are my notes from the meeting today:
Attendance: Nate, Gavin, Michael, Eric, Andrew, Craig, Chris, Susan, Nick, Dominick, Kanika, Ruth, Adam
—> In better shape than I thought – thanks!
—> Still needs work: Everyone should spend one hour updating the wiki ASAP.
— we have some validation files for SIM group.
— they may be able to present these at collaboration meeting
— status should also be shown at the collaboration meeting.
— Gavin is trying to reproduce the files Chris made for official version.
— Luke needs PCstop files – Gavin helped him get started with SAM for a look at a small dataset.
—> Absolute energy calibration.
—> No point in reconstruction without calibration.
—> Chris has a sufficient data set now.
Ruth’s jobs — stalled, needs reconstruction and CAF.
—> Ruth is okay to not have them for the meeting.
—> Would he happy with truth only files.
Possible goal: get calibration jobs running again by the end of the day.
—> Need a new drop box on DCACHE for calibration. (need 4096 subdirectories Adam)
—> Need a sorting script working
—> FTS on cache should be mostly caught up by the end of the day.
—> Dominick generates 700-1000 files per hour could be possible.
— Production stalled due to FTS backlog
— Can’t restart production until the backlog clears.
— Blue Arc should clear at the current rate in about 30 hours. DCACHE drop boxes should clear in about 10 hours.
— dcache dropboxes should be mostly clear by the end of today — can we start writing to them again?
— dcache dropboxes require sorting – nothing likes 40,000 files in one directory.
Disk space issues:
— Craig will find a “disk space manager” to complete the cleanup of BlueArc
—> first prod then data disk.
—> This will be an entry level task for getting someone plugged into production efforts.
— Current status: /nova/prod 100T 89T 12T 89% /nova/prod
—> FTS dropbox still has 3TB
—> /nova/prod/data still has
—> Everything else is in MC area
— Goal for disk space before starting up jobs again?
— Moving one old MC sample?
Is FTS down? I can’t access the monitoring page?
— DCACHE each chewing through 4500 files
— BlueArc — 7200 files still in that area ( 250/hour )
— Tape system is so back logged —
— Restore queues are so huge that they are affecting our write queues.
— Number of files is so large, that
— Adam submitted a project for all raw data since August. 338,000 files.
—> Things from late Dec onward were on DCACHE, went quickly.
—> Now, the rest of the files are hogging the queue.
— SAM system is doing the right thing, ganging them together for a full tape
— There was a SAM bug, that was requiring repeated access of the tapes.
—> They are helping us manually.
— Move some of the overhead from the FTS onto the grid nodes.
Declaring files to SAM is slow due to the tape system overload.
We have also allocated more resources for FTS.
For FTS, large files with smaller numbers are better.
Challenges concatenating files:
— Book keeping
— Many places in the code may assume that the there is only one run and run section.
— Can we concatenate files at Ash River before shipping them to Fermilab?
— The metadata tool may not work on concatenated files
— Raw data parser will also need to be checked (C++)
— Raw2root processing has a memory leak.