Project

General

Profile

Information about job submission to OSG sites » History » Version 185

Version 184 (Kenneth Herner, 06/26/2020 12:28 PM) → Version 185/192 (Kenneth Herner, 10/30/2020 03:11 PM)

h1. Information about job submission to OSG sites

This page captures some of the known quirks about certain sites when submitting jobs there.

h2. What this page is

Most OSG sites will work with with the jobsub default requests of 2000 MB of RAM, 35 GB of disk, and 8 hours run time, but at some sites there are some stricter limits. Additionally some sites only support certain experiments as opposed to the entire Fermilab VO. Here we list the OSG sites where users can submit jobs, along with all known cases where either the standard jobsub defaults may not work, or the site only supports certain experiments. *Information on this page is provided on a best-effort basis and is subject to change without notice.*

h2. What this page is NOT

This page is *NOT* a status board or health monitor of the OSG sites. Just because your submission fits in with the guidelines here does not mean that your job will start quickly. Nor does it keep track of downtimes at the remote sites. Its sole purpose is to help you avoid submitting jobs with disk/memory/cpu/site combinations that will never work. Limited offsite monitoring is available from https://fifemon.fnal.gov/monitor/dashboard/db/offsite-monitoring

h2. Organization

The following table lists the available OSG sites, their Glidein_site name (what you should put in the --site option), what experiment(s) the site will support, and finally any known limitations on disk, memory, or CPU.

h2. Important notes and caveats: READ THEM ALL!

*NOTE 1:* In some cases you may be able to request more than the jobsub defaults and be fine. If you do try a site and put in requirements that exceed the jobsub defaults, sometimes a

%{color:red}jobsub_q --better-analyze --jobid=<your job id>%

will give you useful information about why a job doesn't start to run
(i.e. it may recommend lowering the disk or memory requirements to a certain value.) We provide information about the largest successful test we have had for memory if above 2000MB.

*NOTE 2:* Under supported experiments, "All" means all experiments except for CDF, D0, and LSST. It does include DES and DUNE.

*NOTE 3:* The estimated maximum lifetime is just an estimate based on a periodic sampling of glidein lifetimes. It may change from time to time and it does NOT take into account any walltime limitations of the local job queues at the site itself. *It also does not guarantee that there are resources available at any given moment to start a job with the longest possible lifetime.* You can modify your requested lifetime with the --expected-lifetime option.

*NOTE 4:* Take care to convert approproately when using the --memory switch with units in jobsub_submit. To stay consistent with HTCondor, *1GB = 1024MB in the --memory option*, not 1000 MB. So --memory=2GB is really --memory=2048MB, and so on and so forth. Thus if you are trying to structure your submission to fit within a certain constraint, and you are using GB as your units, remember to convert appropriately. All memory numbers on this page are in MB, the default HTCondor memory unit.

*NOTE 5:* Asking for exactly the estimated maximum time is not a good idea because you can't guarantee that your job will match exactly at the beginning of the glidein lifetime. If you need close to the max time, be sure to ask for slightly under it. Of course, don't ask for the full time if you don't really need it!

*NOTE 6:* The "jobsub defaults are OK" phrase does NOT include an SL6-only provision! If you do not specify the @--OS@ option in jobsub, the @DesiredOS@ job classad will not be set, and then you can run in either SL6 or SL7. The default is NOT SL6 only. At this point (June 2020), SL6 is very close to end of life and you should really be migrating to SL7 as soon as possible. We recommend controlling the OS and container by setting the SingularityImage classad as shown here: [[Singularity_jobs]]

|_. Site Name |_. --site option (sorted) |_. Supported Experiments |_. Known limitations |_. Maximum memory (MB) |_. Estimated maximum job lifetime |
| Brookhaven National Laboratory - ATLAS | BNL | All | jobsub defaults are OK, no multicore | 3000 | 23 h |
| Brookhaven National Laboratory - SDCC | BNL | DUNE only | jobsub defaults are OK, 4 cpu max | 12000 | 47 h |
| CBPF | BR_CBPF | DUNE only | jobsub defaults are OK, 2 cpu max | 6400 | 71 h |
| University of Bristol | Bristol | DUNE only | jobsub defaults are OK, max 8 cpu | 20000 | 71 h |
| Boston University ATLAS T2 | BU | All | jobsub defaults are OK | 2500 | 8-12 h |
| University of Victoria | CA_Victoria | DUNE only | jobsub defaults are OK, still commissioning | 10000 | 46 h|
|
Caltech T2 | Caltech | All | jobsub defaults are OK | 4000 | 25 h |
| IN2P3 Computing Center Lyon | CCIN2P3 | DUNE only | jobsub defaults are OK, 4 cpus max | 10000 | 29 h |
| CERN Tier 0 | CERN | DUNE only | defaults OK, 10 cpu max | 20240 | 46 h |
| CIEMAT | CIEMAT | DUNE only | defaults OK, StashCache not available | 2500 | 63 h |
| Clemson | Clemson | All | jobsub defaults are OK | 14000 | 23 h |
| Colorado | Colorado | All | jobsub defaults are OK | 16000 | 46 h |
| Cornell | Cornell | All | jobsub defaults are OK | 2500 | unknown |
| FermiGrid | FermiGrid | All+CDF+LSST+D0 | jobsub defaults are OK | 16000 | 95 h |
| University of Florida HPC | Florida | DUNE only | jobsub defaults are OK | 4000 | 24 h |
| FNAL CMS Tier 1 | FNAL | All | jobsub defaults are OK | 16000 | 24 h |
| "Czech Academy of Sciences":http://monitor.farm.particle.cz/total_overview.php | FZU | DUNE and NOvA only | Do not use if you have an input tarball | 4000 | 47 h |
| University of Washington | Hyak_CE | All | available resources vary widely | Tested to 7000 | 3.5 h |
| JINR HTcondor CE | JINR_CLOUD | DUNE, Mu2e, NOvA | no multicore jobs | 2500 | 46 h |
| JINR Tier 2| JINR_Tier2 | NOvA only | no multicore jobs | 2500 | 46 h |
| Lancaster University | Lancaster | DUNE, g-2, uboone | defaults OK | 16384 | 71 h |
| University of Nebraska (Rhino) | Lincoln | All | jobsub defaults OK, max 8 cpu | 32000 | 23 h |
| University of Liverpool | Liverpool | DUNE and g-2 only | jobsub defaults OK, max 8 CPU | 16384 | 71 h|
| Imperial College London | London | DUNE only | max 4 cpu | 8192 | 47 h |
| Queen Mary University of London | London_QMUL | DUNE only | defaults OK, max 4 cpu per job | 16000 | 46 h |
| University of Manchester | Manchester | DUNE, g-2, uboone | currently SL7 only
single-core only for now| 3500 | 71 h |
| ATLAS Great Lakes Tier 2 (AGLT2) | Michigan | All | currently SL7 only | 2500 | approx. 10 h|
| %{color:red}MIT% | MIT | All + CDF | jobsub defaults are OK %{color:red}blocked due to very high eviction rate%| 2500MB | unknown |
| Midwest Tier2 | MWT2 | All | jobsub defaults are OK
single core jobs will take a very long time to run if requesting more than 1920 MB of memory. | Tested to 7680 | 5 h |
| Red/Sandhills | Nebraska | All |jobsub defaults are OK;
Some slots are SL6 Docker containers on SL7 hosts | Tested to 8000 | 48 h |
| NIKHEF | NIKHEF | DUNE only | jobsub defaults OK | 4096 | 29 h |
| Notre Dame | NotreDame | All | aim for short jobs due to preemption., max 8 cpu | 32000 | 24 h |
| Crane | Omaha | All | jobsub defaults are OK | 4096 | 47 h |
| Nebraska HTPC | Omaha | All | jobsub defaults are OK, 8 cpu max | 32000 | 23 h|
| Ohio Supercomputing Center | OSC | NOvA only | jobsub defaults are OK | 4096 | 19 h |
| PIC | PIC | DUNE only | defaults OK | 16384 | 59 h |
| INFN Pisa | Pisa | g-2 only | single-core only, multi-core coming | 2500 | 24 h |
| Rutherford Appleton Laboratory T1 | RAL | DUNE only | jobsub defaults are OK, 2 cpus max | 8000 | 59 h |
| Rutherford Appleton Laboratory T2 | SGrid | DUNE only | jobsub defaults are OK, 2 cpus max | 8000 | 59 h |
| University of Edinburgh | SGridECDF | DUNE only | jobsub defaults are OK, StashCache not available | 4000 | 48 h |
| University of Oxford | SGridOxford | DUNE only | jobsub defaults are OK | 16000 | 48 h |
| University of Sheffield | Sheffield | DUNE only | jobsub defaults are OK | 16000 | 48 h |
| Southern Methodist University | SMU_HPC | NOvA only | jobsub defaults are OK | 2500 | 24 h |
| Stampede (TACC) | Stampede | MINOS only | unknown maximum disk | 32000 | unknown |
| Stanford Proclus | HOSTED_STANFORD | All | varies but jobsub defaults should be OK | 16384 | estimated 12h |
| Gina, SURFsara | SURFsara | DUNE only | defaults OK, 4 cpu max | 16000 | 24 h |
| Syracuse | SU-ITS | All | request --disk=9000MB, no multicore yet | 2500 | 46 h |
| %{color:red} Texas Tech% | TTU | All but mu2epro and seaquest | jobsub defaults are OK
%{color:red} 2015/11/20 down since OSG software upgrade% | unknown | unknown |
| University of Chicago | UChicago | All | linked with MWT2; recommend --memory=1920MB or less per core
Many nodes have 3.x kernels on them, so be sure to set the UPS_OVERRIDE environment variable appropriately. | Tested to 7680 | 5 h |
| University of California, San Diego | UCSD | All |jobsub defaults are OK | 4000 | 13 h |
| University of Bern | UNIBE-LHEP | uboone only | Requires some special option with the --lines jobsub option: --lines='+count=1' --lines='+memory=3700' --lines='+runtimeenvironment = "APPS/HEP/UBOONE-MULTICORE-1.0"' --lines='+runtimeenvironment = "ENV/PROXY"' --lines='+runtimeenvironment = "APPS/HEP/UBOONE-OSG-WN-CLIENT-3.3"' (Note: There is no longer a requirement to have a core specified for each unit of 2000 MB of memory) | 3700 | 48 h |
| Grid Lab of Wisconsin (GLOW) | Wisconsin | All | jobsub defaults are OK | 8000 | 23 h |
| Western Tier2 (SLAC) | WT2 | uboone only | jobsub defaults are OK | 2500 | 10 days |