Project

General

Profile

May-11-2018

Slides: https://indico.fnal.gov/event/17079/


Meeting Notes


Present

Margaret Votava - FNAL Project Sponsor/Scientific Computing Services Associate Head
Parag Mhashilkar - Project Lead
Marco Mambelli - Technical Lead
Dennis Box - Project Member
Lorena Lobato - Project Member
Marco Mascheroni - Project Member/OSG Factory Operations
Burt Holzman - HEPCloud Sponsor Proxy
Tony Tiradani - HEPCloud Technical Lead
Antonio Perez-Calero - CMS L2 Manager
Brian Bockelman - OSG/CMS
Tanya Levshina - Scientific Distributed Computing Solutions Department Head
Steve Timm - HEPCloud Technical Advisor
Joe Boyd - FIFE


Communication

  • CMS really wants to be part of the strategy to replace frontends with the decision engine

Support


Project Management

  • HEPCloud and glideinwms communication channel. How does that work?What does HEPCloud need from glideinwms to succeed?
    • GlideinWMS has already done some of that work already as part of NERSC, AWS and GCE demonstrations. Both projects are working closely with HTCondor. At this point project believes that this is adequate.
  • Put these meetings in an Indico category.
  • There is a roadmap for who is working on which pieces in the big picture to communicate to stakeholders, ie who has been working on which issues.
  • Marco Mambelli has two students for summer working on two sub projects in the GlideinWMS
    • Improving monitoring
    • Understand redundant queries to HTCondor and reduce them

Roadmap

  • Some of the tickets that were assigned to the release in April have been moved to v3.4 and will be released in May end. There are significant changes and Marco want to have more time for thorough testing these changes.
    • Brain suggested to communicate such change to the focus in future through the stakeholders mailing list
  • Brian would like to see the features/support for factory operations be more visible in upcoming stakeholders meeting. He also wants this to be captured in redmine.

Technical

  • Singularity
    • In recent security fix release of Singularity they stopped supporting some of the options OSG, CMS and GlideinWMS were using in this script. These options were not tested by the developers. Support was dropped without any communication from the project. This was caught in the OSG release testing and GlideinWMS and other VOs provided the necessary fixes to their scripts. Brian was happy that the issue was caught before rolling into production and coordinated effort to get the fix out.
    • Brian saw the Singularity meetings organized by Marco to productive and step in right direction. He wants to see these meetings to be organized more frequently until the resolution is achieved and get a sign off from the stakeholders for the singularity support in GlideinWMS
    • Marco is working with FIFE to test the Singularity support required. Some of the changes will be released in v3.4.1. Marco will scheduled a followup meeting soon.
    • CMS and OSG have solved Singularity image selection problem in different ways. GlideinWMS uses solution similar to OSG. Can CMS change to common solution?
    • Tanya reported that the FIFE managed to run at Ohio Super Computing (OSC) There were some problems with WN reporting wrong os versions for the container. For now, FIFE will drop the request for a specific OS as this creates confusion and enable it again in June
  • Killing glideins and glidein_off tool
    • Brian scripted the functionality in past and found it useful.
    • There is a tool already provided by the glideinwms but needs to be fixed to support shared port for collector and CCBs
    • Marco Mambelli: We are revisiting all the tools and fixing them as needed
    • Brian Does CMS wants to defrag the glidein or kill glideins?
    • Antonio: We are waiting on the pilot reload now and will be moving to pilot drain. Currently CMS is identifying tools and options available before deciding on possible solution
  • condor_switchboard
    • Moving condor_switchboard support from HTCondor team to the GlideinWMS team seems to be a step in opposite direction. Now GlideinWMS is responsible for additional software.
    • During discussions with the HTCondor team we identified a need for privilege separation for certain type of resources. This discussion will continue offline to understand need and possible stop using switchboard in factory. GlideinWMS team is already working with the HTCondor guys for this. Currently the team only took on a thin layer of switchboard from the HTCondor team for its use only.
  • CMS Report presented by Antonio
    • Queries of the Frontend to the ITB collector may be redundant: Details in linked Google doc: https://docs.google.com/document/d/1gt8F4_-ZJih3Cjrbr4J7L3IFZOujxDXTA8ze7EYHFow
      • several query on the same cycle from the same schedd
      • some unprojected queries
    • CMS wants to control explicitly the number of resources available:
      • limits for cores. Current limits are about glideins
      • Marco will open a ticket and follow up w/ Antonio
    • Other requests from CMS are:
      • Tool to produce the draining of a fraction of the pool
      • Debug pilot. A pilot that can feedback info back to the frontend/manager of the pool. Information about the pilot, failure, node, etc.
  • new Requests from Factory operations can be found at #19946
  • Decision Engine
    • Brian: is it there anything we could do to test Decision Engine?
    • CMS interested in participating in testing and evolution (Diego would be involved)
    • CMS adding effort to the project by providing testing - probably
    • Diego will be 2 years w/ CMS
    • Tasked with regular operations and a nice project w/ mid range vision (could be about DE)