Project

General

Profile

FermiGrid Bluearc Unmount Task Force

Charge

To write a proposal including a straw man schedule for a staged unmounting
of bluarc disks from the Fermigrid worker nodes.

The motivation is :
  • to improve application portability to general OSG sites
  • to reduce Bluearc server overloads affecting Fermigrid

The proposal needs include a feasibility study of which file systems
can be unmounted and the impact to both the user groups and service providers.
(e.g., grid and cloud services, data management, network storage).

Additionally, it should include an analysis of the frequency of the problems
over the last 6 months and the metrics we need to demonstrate
that this approach would make offline computing significantly more robust. 

Task Force Members

Who email role
Gerard Altayo gerard1 Fermigrid
Dmitry Litvintsev litvinse DMS
Marc Mengel mengel SCS/SDP/DM
Andy Romero romero SNS
Marco Slyz mslyz Aux Files
Steve Timm timm Fermigrid
Matt Tamsett tamsett NOvA analysis
Arthur Kreymer kreymer SCS/SDP/DM and chair

Input to this process is welcome from all interested parties.
We will work directly with CS Liaisons.

Working Process

We see this task force as sharply focused on preparing the proposal and strawman schedule,
and providing pointers to supporting documentation.

The primary product of the Task Force is the Plan, official as of V1.0,
http://cd-docdb.fnal.gov/cgi-bin/ShowDocument?docid=5522

That documentation should already largely exist.
We are collecting pointers here,
improving documents slightly as needed,
and giving an executive summary.

Since there is a nearly complete overlap with FIFE membership,
we have had short meetings immediately after FIFE weekly meetings as needed.

Most of the work is being done in smaller discussions
between appropriate small groups of interested people,
especially those with conflicting opinions.

We have opened Redmine Issues to track detailed technical discussions.

Background

Bluearc is a proprietary,
 high performance NFS/CFS/FTP server used
widely for core computing services at Fermilab. See performance
details in this WIKI>

There are four major storage areas,
 the last of which is our primary focus here.
Small projects not needing dedicated space work under the /grid directories.
Larger projects have their own app and data volumes.
  • /grid/fermiapp - 2 TB
    • shared software,
 and small client software
  • /grid/data - 27 TB
    • data files for smaller clients
    • legacy files before /<project>/data areas existed
  • /<project>/app - 41 TB
    • software
  • /<project>/data ( and similar ) - 1600 TB
    • non-executable data files

App areas are typically accessed directly,
 and have not been a performance problem.

/grid/fermiapp supports smaller clients,
 and the small parts of large
client software appropriate for this area,
 such as tools to establish a
working environment for users.

The data areas are intended for 'project' files,
 which are relatively
volatile and are not archived to tape with the usual Fermilab
DCache/Enstore system. Bluearc has quotas,
 and files are not removed
without user intervention. Until the 2014 deployment of DCache
scratch pools,
 Bluearc was the primary resource for Intensity Frontier
clients' non-archived files. Many of these files is being moved to
DCache now,
 initially concentrating on managed production files.

There remains a need for Bluearc style storage,
 with quota,

persistence,
 and low overhead for cases needing many small files.

Bluearc files are served by 'head's. These can become overloaded
when too many simultaneous unique transfers are attempted.

We have reduced the rate of overloads and their impact by placing
data and app areas on separate heads,
 and by requiring use of a
software layer called ifdhc,
 to regulate access to data files.
Use of ifdhc or its predecessor cpn have been required since 2009.
See the performance documentation in this WIKI.

The only allowed direct access to data has been to Auxiliary highly
shared files,
 for which alternate access methods are now available.

Proposal

While we learn to make effective use of newly deployed DCache scratch areas,
and for use cases which are not a good fit for DCache,
we continue to need Blueac project areas for unmanaged user analysis files.

In order to provide portability to general OSG sites,
and to avoid overloaded to servers from unregulated client access,
we require all access to Bluearc to go through the ifdhc layer.
Any direct access ( cd to a directory, read or write directly ) should fail.

This has been the policy for at least two years,
enforced by manual intervention to cancel offending jobs.

We will remove the risk of such overloads by removing the traditional
/*/data type mount points on Fermigrid GPGrid worker nodes.
This will be done project by project.
New projects should not have the /*/data mounts established.

Due to the lack of appropriate FTP servers, and to minimize risk,
we will initially move data on Fermigrid via hidden NFS mount points.
This lets us proceed immediately while evaluating long term alternatives.
Smaller clients can dismount completely, if their volume is low.
This is all transparent to the users

Storage services

Documents and summary of capabilty and capacity of the services

  • Bluearc
    • Low latency robust file handling
    • Aggregate 1/2 GByte/second
  • DCache
    • Moderate latency with file access restrictions for efficiency
    • Supports both tape-backed and volative storage, with small file support for tapes.
    • Highest throughput, 1 GByte/sec per pool, limit is overall network
    • Protocols include dcap, ftp, nfs 4.1, webdav, xrootd
    • Self throttling by pools prevents overloads
  • GridFTP
    • Presently of limited capacity and support.
    • Highly shared GBit network
    • Best effort maintenance on highly shared Fermicloud VMs
  • ifdh locks
    • presently via files in /grid/data, should move elsewhere
    • rate limit seems about 5/second
    • present limit 5 per group
      • matches Bluearc capacity while avoiding head overloads

Usage

See FGBUsage for details.

The core Bluearc usage is
  • APP areas
    • contain project and user software releases
    • available via NFS mounts and CVMFS
    • have not been the source of overloads
  • DATA areas
    • for volatile project files and temporary production files
    • presently there are some highly shared 'Auxiliary' files.
    • should always be accessed via ifdh cp

Mounts

See FGBMounts for details.

Summary of Fermigrid mounts, space in TB, from 2015/03/18

Project app data AUX comments
argoneut 2 35 F
cdf code is not BA
project areas are mounted interactively only
coupp 1 12
d0 1 project areas mounted on d0grid,clued0
des sim 24
orchestration 100
des20, des21 not BA
ds50 1 25
gm2 4 21
ilc accelerator 1
ilc4c 4
sid 5
ild 5
lariat 2 8
lbne 2 30 data2 = 30
minos 7 236 F,L,T
minerva 2 240 F data2 = 50
data3 = 25
mu2e 1 70 data2 = 10
nova 10 140
ana 95
prod 100
nusoft 2 25 F,L
e906 1 3
uboone 3 52 F
lar1nd 1 F
grid 2 27 app mount is /grid/fermiapp

AUX highly shared files include F - beam flux, L - library, T - template

Flux files model neutrino beam characteristics, and are used by all neutrino beam projects in their simulations.

Library files are used in analysis jobs to identify particles by matching to event data. Shared files can be up to 100 GB, sometimes loaded into memory for speed. Mainly used by Minos and NovA.

Template file collections are usually under a few GBytes, used in various ways by analysis jobs.

In some cases most efficient running requires shared cache on worker nodes. Existing direct access provides this via local worker memory cache and Bluearc head caching. But this does not work on OSG, and opens the door to overloads when non-shared files are accessed.
We think AUX file tools presently under test will provide good performance and OSG readiness. See the Impact Statement.

Some performance issues have been due to access to
/grid/data, /nusoft/data, Dzero project areas, etc

Some performance issues come from outside Fermigrid
( Miniboone farm, etc )

Performance

See FGBMon for details.

Bluearc performance is monitored in detail both internally and on clients.

We have logs and sometimes plots and alarms for
  • Server open files
  • Server loads and performance metrics
  • Fermigrid client open files
  • Client data rates for the major file systems and hosts
  • Gridftp server availability

Issues

There are a variety of root causes of overloads.
The classic symptom is that the Bluerc head gets very busy
At this point the head can detach from the network,
causing NFS mounts to go stale or readonly.

This is usually due to user scripts on Fermigrid not making use of the
ifdhc layer to access Bluearc.

  • Accelerator Division / APC has a job structure that writes directly to /grid/app and /grid/data, they need to be onboarded to jobsub/ifdhc otherwise their jobs will break.
  • CDMS was recently flagged for heavy direct bluearc access, they need to be onboarded as well.
  • Some of the MARS groups (marslbne we think) do some direct bluearc writes, or did.

See FGBIssues for details.

Impact

See FGBDataImpact for details of the impact of Data dismounting.

Summary tables :

Low impact
  • no or light use of Aux file
  • low data rates, allow fallback to GridFTP servers
    Project comments
    argoneut
    coupp/pico
    des /des/orchestration
    ds50
    e906 seaquest
    genie
    gm2
    ilc accelerator, ilc4c, ilcd, sid
    lariat
    lbne possible marslbne direct access issue ?
    mu2e
    numix
    nusoft flux files ?
    uboone

High impact

PROJECT COMMENTS
d0 not using ifdh
minerva Flux
minos+ LEM, Flux, Template
nova Flux
cdms direct writes to /grid/data -- not on boarded to ifdh or jobsub
accelerator direct writes to /grid/data -- not on boarded to ifdh or jobsub
patriot direct writes to /grid/data -- not on boarded to ifdh or jobsub
marsmu2e direct writes to /mu2e/data/users/outstage/*

Proposal

In order to provide portability to general OSG sites,
and to avoid overloads to servers caused by unregulated client access,
we require all access to Bluearc to go through the ifdhc layer.
Any direct access ( cd to a directory, read or write directly ) should fail.

This has been the policy for at least two years,
enforced by manual intervention to cancel offending jobs.

SCHEDULES

Strawman Schedule

  • /*/data
    • immediate - stop mounting new project data on Fermigrid
    • done Jan 2015 - release ifdhc v1_7_2 supporting alternate mount points
    • moot - unmount data from GPWN local batch, very lightly used
    • schedule - move projects as appropriate to alternate or no mounts
      • Some projects can use ftp fallback immediately, at a modest scale
  • /grid/data
    • schedule - should be moved to the data head
      • Should first move ifdhc locks to somewhere on the app head, possible /grid/app
      • SNS group notes the existing /grid/data is on disk that is vintage 2006. If it is going to continue it has to be on different hardware
  • /*/app
    • Use CVMFS for base releases
    • discussing test release support in jobsub_client
  • Shared File Access
    • Feb - alien-cache
    • dcache/xrootd
    • ifdhc support is in ifdhc v1_7_2

Data Unmount and Removal Schedule