Project

General

Profile

November-15-2016

Slides: https://indico.fnal.gov/conferenceDisplay.py?confId=13265

Meeting Notes


Present

Margaret Votava - FNAL Project Sponsor/Scientific Computing Services Associate Head
Parag Mhashilkar - Project Lead
Hyunwoo Kim - Project Member
Marco Mambelli - Project Member
Dennis Box - Project Member
Dave Mason - USCMS Tier1 Facility Manager
Tony Tiradani - HEP Cloud Technical Lead
Brian Bockelman (OSG/CMS) - OSG Liaison for GlideinWMS
Jeff Dost - OSG Factory Operations
Joe Boyd - FIFE Support
Tanya Levshina - Scientific Distributed Computing Solutions Department Head/FIFE Support
Stu Fuess - Scientific Computing Facilities Associate Head/Scientific Computing Division Deputy Head
Steve Timm - HEP Cloud Technical Advisor


Communication

  • Nothing specific

Project Management

  • Next Stakeholders meeting to be scheduled in February 2017 timeframe

Technical

  • Margaret Votava is interested in understanding if CMS is happy with the current status of the project.
    • Dave Mason: There are important scale issues but they are mostly on the HTCondor side.
    • Parag Mhashilkar: Project will keep an eye on the development on the HTCondor side and adapt appropriately

Factory Operations

  • Jeff Dost was interested in resurrecting the Milestone to Aggregate Factory Monitoring. Jeff has students at SDSC that are currently working on this. As the user communities are moving towards custom monitoring solutions, in previous stakeholder's meeting it was decided that this milestone is now less important.
    • Parag Mhashilkar: We will re-instate the priority and keep tracking this milestone and will integrate contribution from SDSC when available. Project is moving towards making as much monitoring information as possible through classads to facilitate custom dashboard style monitoring.
  • Since recent changes in the semantics of glideins v/s slots in the frontend, factory monitoring has become ambiguous and there is a scaling issue in the factory monitoring
    • Request is now tracked through issue #14559
  • Factory operations recently observed that wget is not available on worker nodes at some of the sites and requested that the glidein scripts fall back to curl before giving up.
    • Request is now tracked through issue #14558

OSG (Brian Bockelman)

  • Brian mentions that initial packaging of singularity (glexec replacement) is available through OSG upcoming repos. Singularity provides isolation at the OS level and is container based solution. Currently, it does not yet have same traceability functionality as glexec.
    • HTCondor 8.5.8 should have required changes to support singularity and a pilot site will be available to try it out. Configuration knobs are currently documented in the HTCondor ticket.
    • Parag Mhashilkar: If it is straightforward to deploy, Marco Mambelli can try to deploy it in one of the test cluster project uses for testing. This should make it easier to make appropriate changes in GlideinWMS and test them.
  • Brian wants to understand if the feature "Advertise payload info in the glideins classad on glidein termination" will add information to the glidein job's classad or in the glidein's startd classad.
    • Parag Mhashilkar: We would like to add the info to the job's classad
    • Request is now tracked through issue #13277
  • Brian wants to understand how the info from the feature "Collect performance statistics for the factory and frontend services for health monitoring" will be made available
    • Parag Mhashilkar: Through the classads. Either existing glideinWMS specific classads will be selected or new classad type will be created.
    • Request is now tracked through issue #11851
  • Brian mentions that there were few issues with the held glideins and their recoverability in past. He wants the project to work with the HTCondor team to iron out those issues
    • Parag Mhashilkar: Currently, job's holdreasoncode and holdreasonsubcode for CondorCE and cloud glideins are both 0. This makes it difficult to identify transient and permanent errors. Factory however does have functionality to lookup the holdreason to distinguish types of errors. This list can be expanded until HTCondor fills in correct codes and subcodes.