DataDisks vs Sites

This page is going to collect some ideas on Data Disks versus sites.

Basically when a job runs someplace, and wants to consume a file, it makes sense to prefer file locations that are "closer" -- for some suitable definition of close. For example, if we have a file with replicas at CERN in EOS and at Fermilab in DCache and at BNL in their DCache, clearly a job at CERN , Fermi or BNL would prefer the location there at that site; and probably most sites in Europe would prefer CERN to either U.S. location. A job at some other US site might prefer Fermi and BNL fairly evenly. as they both hang off ESNet from their perspective, so in that case it may make sense to load-balance the two.

So to implement this it makes some sense to have knowledge of sites, which Data Disks reside at which sites, and some relative cost factor between sites.

Sites are fairly easy to define, but they do have multiple names. When you are a grid job, you have info in your environment like:


and of course these are all names for computing parts at Fermilab; and none of them are quite the right spelling for requesting Fermilab from OSG, and none of them are quite what a human would call them...

Okay, so basically, we need 3 or 4 database tables:

  • DataDisks_to_Sites: just a map that says what datadisks are at what sites
  • Sites: Site name, description, etc.
  • Site Aliases: various names for sites depending how you look
  • Site_graph: site to site pairs, with cost(?).

Then we can ask:

  • how many files for a given dataset reside at each site
  • What sites are "close" to sites x, y, z...

and use those answers to pick a site list to send a job to where it will have the files it wants close at hand.

Or we can ask:

  • What data disks are close to sites x, y, z

So we can pick locations to transfer files to in advance of a job launch to those sites, if we want to pre-place files.

And finally, we can better ask:

  • Which replicas of files are preferred at site x

to choose delivery locations for consumers of SAM projects.

.h2 (too?) Simple continental cost model

One way to approximate the cost of transferring data to/from a given site is a simple continent based model, where you have 3 levels of cost:

  1. You are at the same site as the SAM data disk, or a co-located site
  2. You are not at the same site, but on the same continent (i.e North America)
  3. You are on different continents (i.e. North America vs Europe or North America vs South America ...)

This cost model is tweak-able by defining fake continental boundaries. But it does let us populate the model easily initially. We can also maintain this as a fallback for actual timed/reported costs if we get those.