Project

General

Profile

June-28-2017

Slides: https://indico.fnal.gov/conferenceDisplay.py?confId=14775

Meeting Notes


Present

Margaret Votava - FNAL Project Sponsor/Scientific Computing Services Associate Head
Parag Mhashilkar - Project Lead
Marco Mambelli - Technical Lead
Hyunwoo Kim - Project Member
Marco Mascheroni - Project Member
Dennis Box - Project Member
Dave Mason - USCMS Tier1 Facility Manager
James Letts - CMS L2 Manager
Antonio Perez-Calero - CMS L2 Manager
Tony Tiradani - HEP Cloud Technical Lead
Steve Timm - HEP Cloud Technical Advisor
Krista Majewski - CMS Factory Operations
Joe Boyd - FIFE Support
Ken Herner - FIFE Support
Stu Fuess - Scientific Computing Facilities Associate Head/Scientific Computing Division Deputy Head
Jeff Dost - CMS/OSG Factory Operations
Mats Rynge - OSG Liaison
Brian Lin - OSG
Diego Davila - CMS Operations


Communication

  • Hiring for a new developer is in final stages

Project Management

  • Next Stakeholders meeting to be scheduled in September 2017 timeframe
  • v3.2.20 will be tentatively released in second half of July
  • v3.2.21 will be tentatively released around September stakeholder meeting

Technical

  • There has been a long standing feature request that would provide a useful tool to answer "Why my job is not running?" Project's intention is to provide a tool useful for users and frontend operators to assist in debugging. This is a hard problem to address since a glidein request and the glidein itself crosses several administrative boundaries during its lifespan.
    • Adding a new site is a long and painful process. Mats would like to see the tool assist with this issue. As per Jeff this is a factory operations issue. Adding new sites requires some manual steps and communication with the site administrator while setting up a new site.
    • As per CMS, users to not interact with the problem so this tool will be useful to the frontend admins.
    • After some conversation, general direction for the project is to approach the problem incrementally
  • Jeff wanted the understand the priority of scaling factories to 600+ entries. With addition of new VOs and European sites, number of entries in the OSG factory is expected to go beyond 600.
    • Some changes were implemented (#7799) and released in GlideinWMS v3.2.19. GlideinWMS team will do more thorough testing for scaling the factory upto 1500 entries (#17067)
  • Jeff wanted the understand the priority of Factory monitoring issues because of changes in the semantics (glideins v/s slots) on the frontend side (#14559). This is a high priority issue as it makes factory debugging harder.
    • This issue will be addressed in v3.2.20 release
  • Jeff wanted the understand the support for HTCondor 8.6 in the factory
    • Steve: HEPCloud factory has been operating with the HTCondor 8.6.3 without issues.
    • Parag, Marco: Configuration and errors in the HTCondor logs related to secondary collectors should not impact factory. Factory collectors do not see huge number of startd classads as seen by a VO collector
  • Jeff raised concern with recent releases of Glideinwms. Since 3.2.14, new features have introduced new bugs and side effects.
    • Parag: We will be focusing on more testing and stability with new releases
    • Margaret: Project team is short of resources and new hire is expected to relieve some of these stress points
  • James Letts presented CMS general goals
    • In recent months, CMS has been focusing on improving the scalability noticed with HTCondor (depth-first filling) and GlideinWMS (adapting to the glidein pressure with reduced number of jobs)
    • v3.2.19 addresses issues with Glideinwms adapting to the glidein pressure with reduced number of jobs (#16414)
      • Mats: This should also benefit OSG VOs
    • James also mentioned pending issues and issues addressed so far in recent releases
  • Mats reported that based on his experience supporting singularity takes lot of effort than what OSG initially anticipated. GlideinWMS team should take that into consideration
    • Parag: Support for singularity will be available in v3.2.20. We are already benefiting from the work done by Brian Bockelman. We are simplifying the scripts provided by Brian and will releasing them in v3.2.20. Containerization is outside the scope of GlideinWMS itself.
  • GlideinWMS team has heard the stakeholders concerns and complaints listed above and will be paying more attention to them. We will be talking internally within the team and also directly to the respective stakeholders to address them.