FermiGrid Bluearc Unmount Task Force¶
Charge¶
To write a proposal including a straw man schedule for a staged unmounting
of bluarc disks from the Fermigrid worker nodes.
- to improve application portability to general OSG sites
- to reduce Bluearc server overloads affecting Fermigrid
The proposal needs include a feasibility study of which file systems
can be unmounted and the impact to both the user groups and service providers.
(e.g., grid and cloud services, data management, network storage).
Additionally, it should include an analysis of the frequency of the problems
over the last 6 months and the metrics we need to demonstrate
that this approach would make offline computing significantly more robust.
Task Force Members¶
Who | role | |
Gerard Altayo | gerard1 | Fermigrid |
Dmitry Litvintsev | litvinse | DMS |
Marc Mengel | mengel | SCS/SDP/DM |
Andy Romero | romero | SNS |
Marco Slyz | mslyz | Aux Files |
Steve Timm | timm | Fermigrid |
Matt Tamsett | tamsett | NOvA analysis |
Arthur Kreymer | kreymer | SCS/SDP/DM and chair |
Input to this process is welcome from all interested parties.
We will work directly with CS Liaisons.
Working Process¶
We see this task force as sharply focused on preparing the proposal and strawman schedule,
and providing pointers to supporting documentation.
The primary product of the Task Force is the Plan, official as of V1.0,
http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5522
That documentation should already largely exist.
We are collecting pointers here,
improving documents slightly as needed,
and giving an executive summary.
Since there is a nearly complete overlap with FIFE membership,
we have had short meetings immediately after FIFE weekly meetings as needed.
Most of the work is being done in smaller discussions
between appropriate small groups of interested people,
especially those with conflicting opinions.
We have opened Redmine Issues to track detailed technical discussions.
- Overall milestones: https://cdcvs.fnal.gov/redmine/issues/7100
- Alternate data mount points: https://cdcvs.fnal.gov/redmine/issues/7123
- Move /grid/data to Data head: https://cdcvs.fnal.gov/redmine/issues/7121
- Locks based on Bluerc disk: https://cdcvs.fnal.gov/redmine/issues/7122
- GridFTP scaling: https://cdcvs.fnal.gov/redmine/issues/7124
Background¶
Bluearc is a proprietary,
high performance NFS/CFS/FTP server used
widely for core computing services at Fermilab. See performance
details in this WIKI>
Small projects not needing dedicated space work under the /grid directories.
Larger projects have their own app and data volumes.
- /grid/fermiapp - 2 TB
- shared software, and small client software
- /grid/data - 27 TB
- data files for smaller clients
- legacy files before /<project>/data areas existed
- /<project>/app - 41 TB
- software
- /<project>/data ( and similar ) - 1600 TB
- non-executable data files
App areas are typically accessed directly, and have not been a performance problem.
/grid/fermiapp supports smaller clients,
and the small parts of large
client software appropriate for this area,
such as tools to establish a
working environment for users.
The data areas are intended for 'project' files,
which are relatively
volatile and are not archived to tape with the usual Fermilab
DCache/Enstore system. Bluearc has quotas,
and files are not removed
without user intervention. Until the 2014 deployment of DCache
scratch pools,
Bluearc was the primary resource for Intensity Frontier
clients' non-archived files. Many of these files is being moved to
DCache now,
initially concentrating on managed production files.
There remains a need for Bluearc style storage,
with quota,
persistence,
and low overhead for cases needing many small files.
Bluearc files are served by 'head's. These can become overloaded
when too many simultaneous unique transfers are attempted.
We have reduced the rate of overloads and their impact by placing
data and app areas on separate heads,
and by requiring use of a
software layer called ifdhc,
to regulate access to data files.
Use of ifdhc or its predecessor cpn have been required since 2009.
See the performance documentation in this WIKI.
The only allowed direct access to data has been to Auxiliary highly
shared files,
for which alternate access methods are now available.
Proposal¶
While we learn to make effective use of newly deployed DCache scratch areas,
and for use cases which are not a good fit for DCache,
we continue to need Blueac project areas for unmanaged user analysis files.
In order to provide portability to general OSG sites,
and to avoid overloaded to servers from unregulated client access,
we require all access to Bluearc to go through the ifdhc layer.
Any direct access ( cd to a directory, read or write directly ) should fail.
This has been the policy for at least two years,
enforced by manual intervention to cancel offending jobs.
We will remove the risk of such overloads by removing the traditional
/*/data type mount points on Fermigrid GPGrid worker nodes.
This will be done project by project.
New projects should not have the /*/data mounts established.
Due to the lack of appropriate FTP servers, and to minimize risk,
we will initially move data on Fermigrid via hidden NFS mount points.
This lets us proceed immediately while evaluating long term alternatives.
Smaller clients can dismount completely, if their volume is low.
This is all transparent to the users
Storage services¶
Documents and summary of capabilty and capacity of the services
- Bluearc
- Low latency robust file handling
- Aggregate 1/2 GByte/second
- DCache
- Moderate latency with file access restrictions for efficiency
- Supports both tape-backed and volative storage, with small file support for tapes.
- Highest throughput, 1 GByte/sec per pool, limit is overall network
- Protocols include dcap, ftp, nfs 4.1, webdav, xrootd
- Self throttling by pools prevents overloads
- GridFTP
- Presently of limited capacity and support.
- Highly shared GBit network
- Best effort maintenance on highly shared Fermicloud VMs
- ifdh locks
- presently via files in /grid/data, should move elsewhere
- rate limit seems about 5/second
- present limit 5 per group
- matches Bluearc capacity while avoiding head overloads
Usage¶
See FGBUsage for details.
The core Bluearc usage is- APP areas
- contain project and user software releases
- available via NFS mounts and CVMFS
- have not been the source of overloads
- DATA areas
- for volatile project files and temporary production files
- presently there are some highly shared 'Auxiliary' files.
- should always be accessed via ifdh cp
Mounts¶
See FGBMounts for details.
Summary of Fermigrid mounts, space in TB, from 2015/03/18
Project | app | data | AUX | comments |
argoneut | 2 | 35 | F | |
cdf | code is not BA project areas are mounted interactively only |
|||
coupp | 1 | 12 | ||
d0 | 1 | project areas mounted on d0grid,clued0 | ||
des | sim 24 orchestration 100 des20, des21 not BA |
|||
ds50 | 1 | 25 | ||
gm2 | 4 | 21 | ||
ilc | accelerator 1 ilc4c 4 sid 5 ild 5 |
|||
lariat | 2 | 8 | ||
lbne | 2 | 30 | data2 = 30 | |
minos | 7 | 236 | F,L,T | |
minerva | 2 | 240 | F | data2 = 50 data3 = 25 |
mu2e | 1 | 70 | data2 = 10 | |
nova | 10 | 140 | ana 95 prod 100 |
|
nusoft | 2 | 25 | F,L | |
e906 | 1 | 3 | ||
uboone | 3 | 52 | F | |
lar1nd | 1 | F | ||
grid | 2 | 27 | app mount is /grid/fermiapp |
AUX highly shared files include F - beam flux, L - library, T - template
Flux files model neutrino beam characteristics, and are used by all neutrino beam projects in their simulations.
Library files are used in analysis jobs to identify particles by matching to event data. Shared files can be up to 100 GB, sometimes loaded into memory for speed. Mainly used by Minos and NovA.
Template file collections are usually under a few GBytes, used in various ways by analysis jobs.
In some cases most efficient running requires shared cache on worker nodes. Existing direct access provides this via local worker memory cache and Bluearc head caching. But this does not work on OSG, and opens the door to overloads when non-shared files are accessed.
We think AUX file tools presently under test will provide good performance and OSG readiness. See the Impact Statement.
Some performance issues have been due to access to
/grid/data, /nusoft/data, Dzero project areas, etc
Some performance issues come from outside Fermigrid
( Miniboone farm, etc )
Performance¶
See FGBMon for details.
Bluearc performance is monitored in detail both internally and on clients.
We have logs and sometimes plots and alarms for- Server open files
- Server loads and performance metrics
- Fermigrid client open files
- Client data rates for the major file systems and hosts
- Gridftp server availability
Issues¶
There are a variety of root causes of overloads.
The classic symptom is that the Bluerc head gets very busy
At this point the head can detach from the network,
causing NFS mounts to go stale or readonly.
This is usually due to user scripts on Fermigrid not making use of the
ifdhc layer to access Bluearc.
- Accelerator Division / APC has a job structure that writes directly to /grid/app and /grid/data, they need to be onboarded to jobsub/ifdhc otherwise their jobs will break.
- CDMS was recently flagged for heavy direct bluearc access, they need to be onboarded as well.
- Some of the MARS groups (marslbne we think) do some direct bluearc writes, or did.
See FGBIssues for details.
Impact¶
See FGBDataImpact for details of the impact of Data dismounting.
Summary tables :
Low impact- no or light use of Aux file
- low data rates, allow fallback to GridFTP servers
Project comments argoneut coupp/pico des /des/orchestration ds50 e906 seaquest genie gm2 ilc accelerator, ilc4c, ilcd, sid lariat lbne possible marslbne direct access issue ? mu2e numix nusoft flux files ? uboone
High impact
PROJECT | COMMENTS |
d0 | not using ifdh |
minerva | Flux |
minos+ | LEM, Flux, Template |
nova | Flux |
cdms | direct writes to /grid/data -- not on boarded to ifdh or jobsub |
accelerator | direct writes to /grid/data -- not on boarded to ifdh or jobsub |
patriot | direct writes to /grid/data -- not on boarded to ifdh or jobsub |
marsmu2e | direct writes to /mu2e/data/users/outstage/* |
Proposal¶
In order to provide portability to general OSG sites,
and to avoid overloads to servers caused by unregulated client access,
we require all access to Bluearc to go through the ifdhc layer.
Any direct access ( cd to a directory, read or write directly ) should fail.
This has been the policy for at least two years,
enforced by manual intervention to cancel offending jobs.
SCHEDULES¶
Strawman Schedule¶
- /*/data
- immediate - stop mounting new project data on Fermigrid
- done Jan 2015 - release ifdhc v1_7_2 supporting alternate mount points
- moot - unmount data from GPWN local batch, very lightly used
- schedule - move projects as appropriate to alternate or no mounts
- Some projects can use ftp fallback immediately, at a modest scale
- /grid/data
- schedule - should be moved to the data head
- Should first move ifdhc locks to somewhere on the app head, possible /grid/app
- SNS group notes the existing /grid/data is on disk that is vintage 2006. If it is going to continue it has to be on different hardware
- schedule - should be moved to the data head
- /*/app
- Use CVMFS for base releases
- discussing test release support in jobsub_client
- Shared File Access
- Feb - alien-cache
- dcache/xrootd
- ifdhc support is in ifdhc v1_7_2