Project

General

Profile

Stakeholders Meeting November-13-2019

Indico: https://indico.fnal.gov/event/17332/
(Slides)

The meeting time was sent wrong in Indico, it was set to 11 am instead of 10 am. We believe it was due to the timezone change. Marco already changed and he will correct for the future meetings.


Present

Marco Mambelli - Project Lead (room)
Lorena Lobato - Project Member (room)
Bruno Coimbra - Project Member (room)
Marco Mascheroni - Project Member/OSG Factory Operations
Antonio Perez-Calero Yzquierdo - CMS
Jeff Dost - OSG Factory Operations
Krista - HEPCloud Technical Lead
Marian Zvada - OSG Operations
Edgar Hernandez - OSG/GLOW
Joe Boyd -
Ken Herner- FIFE Technical Led (room)
Burt Holzman - HEPCloud Project Sponsor - (room)
Mike Kirby -
Andrew Norman- HEPCloud Technical Lead (room)
Steve Timm - HEPCloud Technical Advisor
Tony Tiradani - HEPCloud Technical Lead


Updates

  • Lorena is leaving the project - Bruno is joining the project
  • GlideinWMs 3.6 released Sep 25 (OSG production)
  • GlideinWMS v3.6.1 RC yesterday, 12 Nov
  • Project effort reduced to 2.50 FTEs + 1 on call student (couple of times per month)

Communication

  • Reminder for the stakeholders to review periodically the tickets/priorities
  • Improvements for communication from the stakeholders meeting ideas
  • More support. There in an incompatibility with the HTCondor 8.8 configuration in OSG 3.5 (was added a file disabling GSI authentication and overriding the GWMS configuration). We have fixed for version 3.6.1

Action items

  • Done a roadmap overview
  • Marco started to work in adding a GPU cluster to the ITB Frontend/Factory in Fermicloud.

Releases

  • GlideinWMS v3.6. released Sep 25, in OSG production renaming of v3.5.1
  • OSG and CMS production Factories are still v3.4.6
  • 3 releases in pipeline: v.3.6.1 in the production series for OSG 3.4 and 3.5. There will be 3.6.2 in production series (mid Dec) and 3.7 in OSG upcoming.

Roadmap:

Several stakeholders were asking about what does it mean different colors in the slides
TODO: remember to explain the meaning of colors (or add a legenda on the slide)

TODO: Burt suggested adding motivation of every point to be aware who is requesting the features.

We are postponing some feature drop (3.7.1 vs the scheduled 3.6.2)
  • Glexec discussion about who is using it and if we should support it. -> It seems no one (Steve Timm confirmed that no one is using it in Dune)
    • ACTION: Marco will follow up on this
Roadmap - High priority list (Black: Completed. Blue: in progress)
  • Migration to single-user Factory brought some additional work. Marco Mascheroni was improving Factory tools as support
    ( Dennis working on token authentication - having a system working without x509 certificates and support sites with sci-token.
    • token-auth for Glideins almost completed. Will follow the Sites support
    • Steve Timm suggested to have meetings with responsible Fermilab to talk about integrating other token systems.
      • ACTION: Talk to Burt
  • Marco Mambelli working on Singularity support - More and more VO are using singularity (same environment that Fermilab or running customized images). Collaboration with HTCondor and requested by OSG and CMS
    • HTCondor collaboration. Hardening Singularity and expanding use cases.
    • Allow VO scripts. VO needs change faster than GlideinWMS release cycle. This way they would be independent
  • Marco Mascheroni is working on CRIC mainly for CMS (Requested by them) - postponed a little bit and hope it’s in the next release
  • Improve modularity and code quality (specially of Frontend). Part of this will be migrating to Python3
Roadmap Overview - Graphic
  • Explained the graphic. Colors assigned to developers according to their focus. But it does not mean that this person will be only working on that. All collaborate with each other
  • We are a bit behind with the initial plan related to token authentication but still working and moving forward
  • HPC - We are already working with HPC on some sites but some places are still going on negotiation process.
    • Discussion between Burt and Andrew. GlideinWMS is waiting on HEPCloud to initiate talks w/ Argonne. Everyone involved should have a meeting. Andrew is working on this and suggested Marco would have to be involved in it.
    • Discussion between Antonio and Marco about CMS HPC resources in Europe. Will the work on Theta (Argonne) benefit them?
      • Theta has no inbound or outbound connection from the workernodes. The development will hep with similar cases but HPC resources are all one of a kind, so the problem may be different for another resource.
      • Depends on HTCondor configuration to allow the communication? Part of the problem is still under definition and negotiation with the managers of the facilities. It is not clear what they can provide us (edge nodes?)
      • Antonio pointed out that CMS and him specially are very interested in HPC, specially because also BSC (Spain) is focused on HPC
Roadmap – other topics
  • Move main repository to Github
  • Move to python 3
    *Modernize configuration and monitoring
    • Green color contribution are from summer students. Same question from Burt, which is the motivation for the summer students proposals? Answer: Combination between internal and other ones from OSG point of view and they will continue the work where students left it (example monitoring)
    • Edgar says they (OSG) wanted to improve GlideinWMS monitoring as well and had som students contributing to it
    • Some of the features will be coming in the next releases (3.7)
    • Move to YAML
  • Other items but lower priority
    • Deploy Glideinwms in containers
    • Modernize configuration and move the documentation to Jekyll
Completed RC for release 3.6.1 (OSG 3.4 and OSG 3.5)
  • Just mentioned the notes in the slides. Some highlights
    • Https support was requested by security
    • Stop considering held limits was requested by Factory Operators
    • The fact the factory is single user now brought some complications
    • Improved diagnostic logs
  • no questions
  • Planned release 3.6.2 - mid dec
    • Burt asked about ssh_job. Marco is in touch with Greg to see if it’s really working. HTCondor says that it supposed to work but
      • Marco still going back and forth with HTCondor team through all use cases.
      • Burt highlighted that this is high priority from the management point of view. He says that HTCondor team could give a better response. ACTION: Marco will receive Burt support to manage this
    • Adding VO scripts before job is invoked. Periodic scripts use startd_cron which invokes the periodic scripts in the environment of the glidein. Marco will send the Edgar the details as they are working on the same. Edgar said that we could do it better than now. Marco agreed
    • Adding shell scripts checking to CI. Edgar asked which is the status
      • We are currently checking python, futurize, pylint..etc
      • We don’t have a shellchecker (shell linting). We have been testing it and we’ll add it to the daily reports
    • Marco would like to have the migration to GitHub by the end of year.
      • Edgar concerned about Jenkins supporting docker.
      • Marco said dockerization is added to the list. Edgar said that he will be done with Frontend dockerization in a couple of weeks.
      • Possibility to check if Jenkins support containers -> Dennis said that our CI doesn’t support well containers. Edgar insisted to move to Travis.
Planned release 3.7
  • We will include the work also from summer students. It’s already done but we wanted it in this version because we don’t want to interfere in the stable production version.
  • Also it will be the first release after the health checker CI
Questions
  • Jeff mentioned about policy for job release (from held) discussed in the last meeting. He would like to know if we are gonna considered for the next release or not.
    • Current behavior is to consider in the pilot behavior. Now we would like to opposite. Have a list of error code that can be recovered and kill the job if the error code does not match. ACTION: Marco will send Jeff the 3.7 ticket details

Developers spotlight

  • Marco Mascheroni
    • Antonio asked if the pilot draining if it considering the future timestamp or no
      • ACTION: Marco has double-check for future timestamp and tell hims about it. They can follow up maybe tomorrow
    • Edgar requested to expand the bullet number 4 and explain the logics
      • Marco explained.There are some limits like held and total to make Factory to stop. There was a specific case where all the jobs went to held when opportunistic nodes were reclaimed and the total limit was hit because of held jobs and we don’t want to stop the submission from the Factory for it.
      • It’s for all the sites. If it hit the hold limit it will be stop. But if it hits the total limit it will stop counting held ones.
      • Discussion between Marco and Edgar. Jeff commented with this is the first step for it, because they have more ideas about;
  • Lorena.
    • Burt asked about HPC workflow project as he doesn’t understand. ACTION: Marco will send the proposal and explain to him
  • Dennis
    • Jeff is asking the meaning of "token authentication is per entry/factory". Dennis confirmed that it is correct (from the frontend perspective) there will be different tokens for each entry.
    • The tokens are from the user collector (generated). Per entry will request a different token for the pilot authenticating w/ the collector. It’s confusing because currently, the same credential is used for that and for the authentication with the CE.
    • The pilot token will be all automatic, no configuration, no factory operators work.

Final Questions/Comments

  • Antonio pointed out slide 8 and 9 related to the milestones. Even if they show some activities, there are some of them that are not assigned (in white).
    • Bruno is ramping up
    • Other developers are fully scheduled for now
    • There will be group effort or someone will be reassigned

Next stakeholders meeting is in two months. 8th of January. ACTION: Asking about CERN closure, if we need to postpone one week. Antonio will check.