Project

General

Profile

May-08-2019

Slides: https://indico.fnal.gov/event/17326/


Present

Margaret Votava - Project Sponsor/SCS Quadrant Head
Parag Mhashilkar - Project Lead
Marco Mambelli - Technical Lead
Dennis Box - Project Member
Lorena Lobato - Project Member
Marco Mascheroni - Project Member/OSG Factory Operations
Burt Holzman - HEPCloud Project Sponsor
Steve Timm - HEPCloud Technical Advisor
Antonio Perez-Calero Yzquierdo - CMS
Brian Lin - OSG Software
Jeff Dost - OSG Factory Operations
Marian Zvada - OSG Operations
Tanya Levshina - FIFE
Ken Herner - FIFE
Joe Boyd - FIFE
Mike Kirby - FIFE
Edgar Hernandez - OSG/GLOW


Communication

  • Stakeholders/Admins should pay close attention to release notes for changes to HTCondor configuration. This is extremely important if the admins do not use configuration shipped with the glideinwms rpms but instead have their own version managed by puppet/chef

Support


Project Management

  • Next Stakeholders meeting on July 10, 2019. https://indico.fnal.gov/event/17328/
  • Burt: Is the list on the roadmap slide ordered? Working with the HPC sites without network connection is important to HEPCloud
    • Mambelli: Its not ordered. Work depends on HTCondor to provide a mechanism that lets GlideinWMS make HTCondor work

Roadmap

https://cdcvs.fnal.gov/redmine/projects/glideinwms/wiki/RoadmapSummary


Technical

  • Pilots not terminating correctly on certain sites
    • Fix in 3.4.5 results in HTCondor terminating correctly but there are some lingering process at Purdue and this is impacting CMS jobs. Marco is are working with the Purdue admins
  • Singularity Discussion
    • Singularity changes in 3.4.5 is related to OSG and CMS singularity wrappers
  • Share Port Discussion
    • Edgar: Which HTCondor daemons do the config related to shared port impact?
      • Mambelli: Sometimes schedd and sometimes collector. In 3.4.5, we are allowing condor to use shared port based on the what you specify in the config. In past versions schedd was not using the shared port daemon
    • Edgar: does that mean we do not need secondary collectors or just secondary ports?
      • Mambelli: We do not need secondary ports but still need secondary collectors
      • Antonio: CMS has been using shared port already. Secondary collectors are configured to use shared port. Backup is using secondary ports
    • Complete migration to shared port needs close coordinations with the admins
  • Singularity Discussion
    • Edgar: Does singularity works over WAN?
      • Mambelli: We have some setup but if there are suggestions that will be useful. Its not easy because of site firewall issues. Condor started singularity will work if using condor 8.8. For condor_ssh_to_job to work you need to start singularity in unprivileged singularity mode since condor is not started as root
      • James: we are encouraging t2 sites in USCMS to move to unprivileged singularity. Caltech is completely done and purdue on the way. requires red hat 7.6
      • Steve: will it break other users having singularity scripts?
        • Edgar: should not. to have unprivileged singularity you can use one available in CVMFS
        • Mambelli: glidein is using singularity from CVMFS if needed behind the scene. Work done with condor support. There is a kernel version requirement and need to enable option that allows unprivileged singularity invocation
        • Edgar: this is very cool. One way to do it is pilot advertise if it is using privileged or unprivileged?
  • Python 2 -> Python3 discussion
    • Edgar: as long as OSG ships glideinwms 3.4 and supports the OSG version we need to support python2 based version
    • Brian: OSG 3.5 will drop support for RHEL6. Regular support for 3.4 version will continue for for 6 months once 3.5 is out. OSG support will support el7 which has default for python 2.7
    • Edgar: Migrating glideinwms from one machine to other is difficult. This is also applicable to all services. Also we are restricted to support certain versions for example: LIGO is starting run and we cant touch software until middle of next year.
  • Brian: whats the support model for glideinwms 3.5 and 3.6
    • Mambelli: glideinwms 3.5 will go in upcoming and become 3.6.
    • Brian: osg 3.5 will be available by end of summer and that starts timer for 3.4 and glideinwms needs to be support one in 3.4.
    • Parag: Glideinwms version support model is similar to HTCondor. If the factory-frontend communication protocol does not change, we can support older and newer frontends provided factory is on latest release. However, older frontends may not have access to newer features.
  • Discontinuing support for GT2/GT4
    • Jeff: There might be some lingering entries in the factory config. These sites have not been working for a while now and site admins have been communicated. These sites should not be an issue.
  • Discontinuing support for glexec
    • Steve: Dune is still forced to use glexec at some site in Europe and they are working with the admins to stop using it. It will be couple of months for changes to take into effect
  • Antonio: do you have info on the monitoring?
    • Edgar: we have a prototype solution for factory sending info to osg gracc. Need to have student spend more time and merge the code to glideinwms and then we need to understand how to do this in frontend
    • Mambelli: we had a student who did this last summer and will be available this summer. Work is on a branch.

ACTION ITEMS

  • Marco Mambelli
    • will send info on ticket related to "Pilots not terminating at certain sites" - #22509
    • will look at the comments on GlideinWMS release notes made by Brian Lin and get back to him
    • will get back to Edgar on topic related to Singularity and WAN
    • good idea we can investigate pilot to advertise if it can use privileged or unprivileged singularity. Need to open a ticket with more details
    • Coordinate monitoring related tasks with Edgar based on the work done by students