Project

General

Profile

October 30 2013 glideinWMS Stakeholders' Meeting

Slides: https://indico.fnal.gov/getFile.py/access?contribId=0&resId=0&materialId=slides&confId=7597

Notes from Ruth Pordes:

  • How long will the project support V2 @ Fermilab?
    • As long as the stakeholders need it.
    • Inventory of where V2 FEs and Factories are running, will look into this.
    • Can run V2 front end against a V2 or V3 factory. But not a V3 front end against a V2 factory.
    • CERN cloud ops have been using V3 factory, now moving to v3_2_1.
    • OSG GF at UCSD already on v3. CERN GF used by mainstream CMS will move to V3 shortly. OSG GF @GOC planned soon, but no set timeline yet.
  • What is the overhead of moving from V2-V3 - CMS?
    • Input from OSG Factory - like any other upgrade except for one conversion tool - Jeff
  • Release date for V3.2.2 - Fermilab?
    • No. need to clean up.
  • For the future - any stakeholder should feel empowered to give requirements and input to the project at any time. Not synchronized with the meeting
  • Use case for reporting to different collectors - Local site admin capabilities to see what is happening to the glidein on their pool - GLOW/Condor. Burt to talk directly to Miron.
  • Improving the monitoring: Goal is to find out why jobs are not running (effectively)
  • And the mail lists
  • Platform support - SL5, SL6, Centos5, Centos6. If can drop '5' can simplify some of these issues. Factory and FE – few installations – could propose to drop SL5? Ask Stakeholders to come back before the next quarterly meeting what the plans are for Factory and FE EOL SL5. Burt will come back with a list of the advantages to drop SL5.
  • Roundtable
    • OSG VO FE– input given through Brian in writing. And follow up will be done.
      • OSG User Support – difficult to debug the pilots – is this included in the OSG written input. Can this be “high priority”. Burt doing proof of concept on this, but needs real work to either use APFMON for the future or rewrite.
      • GlideinWMS V3 has function “error cluster” which should give a lot of the information needed.
      • Improved bash function library would make testing and use of the information /class ads and testing of new applications/sites easier. Burt has something using vagrant that will bring up factory/FE/CE using own CA for testing. Will talk to Mats about whether this is more generally useful.
    • CMS UK – Adam Huffman – How did the project emerge?
      • Status of (console) logs back from the cloud: different for every Cloud implementation. Needs thought. Not on the release schedule yet
      • Documentation – new people giving feedback.
    • GLOW/Condor - Usability and ease of installation and operation are high priority to address/maintain.
      • Condor_who – allows an application on a running machine to connect and get information on what is happening in a glidein.
      • How do we expose as much information as possible about the glideins and the condor pool to the local system administrator OSG Operations – Rob, Igor, Jeff – Concerns are addressed in the plan (monitoring, scalability), Automated clean up when problems occur. List sent from Jeff to Burt for factory issues. Config file hierarchy work is very useful – organizing the config files better for “site level” attributes across all sites. All tools based on config file format will need changing so this is a big change. Parameterized reference variables – so no repetition of information. External files to include in the config files. ?configd directory? Include structure? – both potentially. Config file code – can this be a contribution. Monitoring – currently either get everything or drill down one entry at a time. Would like to have flags that point to problematic entries in the “everything view”.
    • Fermilab - IF Gabriele - NOvA would like to preferentially schedule jobs on resources “owned” by the VO, then on Fermilab (friend), then on Community Clouds, then commercial clouds. Is there a structure of ranking where the pilots are starting. Similar use case that OSG is talking about.
      • Can be done at the Condor layer.
      • Do it with time-based scheduling at the moment in the FE, CMS pioneered it and using it currently in production
      • Can provisioning step affect when resources are pulled in first, second etc.
      • Policies - Is it a strict ordering or a fraction or… ? Allocation based? Miron – be very careful with defining and supporting such policies. Current basic capabilities are fairly immature. These are very hard problems and need to be tackled very carefully.
    • Fermilab Operations – Joe, Steve – Everything already a redmine issue. Thank you to the GlideinWMS team for getting NOvA on the cloud. Endorse usability requests.
    • US CMS/CMS Operations – Lothar - aligned with plan for next release shown. New requirements coming through OSG with the resource provisioning needs. Look at understanding how to use monitoring that is available. Sorting out issues across GlideinWMS/Condor/CMS s/w.

Again please note: Burt is always available to talk about anything GlideinWMS related …