Project

General

Profile

Stakeholders Meeting September-11-2019

Indico: https://indico.fnal.gov/event/17330/
(slides)

Here are the notes, skipping what it is clear from the slides


Present

Margaret Votava - Project Sponsor/SCS Quadrant Head
Marco Mambelli - Project Lead
Lorena Lobato - Project Member
Marco Mascheroni - Project Member/OSG Factory Operations
Antonio Perez-Calero Yzquierdo - CMS
Jeff Dost - OSG Factory Operations
Krista - HEPCloud Technical Lead
Tony Tiradani - HEPCloud Technical Lead
Thomas – On call collaborator
Andrew Norman -
Marian Zvada - OSG Operations
Ken Herner - FIFE
Edgar Hernandez - OSG/GLOW
Joe Boyd - Burt Holzman - HEPCloud Project Sponsor
Steve Timm - HEPCloud Technical Advisor


Updates

Release 3.4.6 released beginning aug
Release 3.5.1 today
Reminder: 3.4.6 is last version gt2/gt5 3.4.6 and also glexec
4 people working in the project + still on call collaborator and one summer student


Communications

Singularity problem due to only internal testing. As soon as we shared with the sites there were lot of improvements but this was a problem from Marco that he shared a patch via email instead of publish about
Improvements for communication from the stakeholders meeting. Specially effort related to HEPCloud, we will have a person from HEPCLoud in order to be in touch of it


Action items

Started discussions about CREAM support in HTCondor, OSG and GlideinWMS
Started discussion specially with Edgar for GlideinWMS in containers: delayed but in progress.
Communication effort with Edgar and Thomas to coordinate the monitoring
Still discussion about publishing Glidein Logs


Releases

  • GlideinWMS v3.4.6 released August 8, in OSG testing, eligible for production
  • GlideinWMS v3.5.1 released September 18
  • 3.4.7 in OSG 3.4 will merge eventually in 3.5.2. 3.4.6 In upcoming
  • Roadmap: More details in the Technical part
Completed releases
  • 3.4.6. Just mentioned the notes in the slides
  • 3.5.1 also mentioned the notes in the slide
    • Now there is a singularity publishing if there it’s unprivileged or unprivileged.
    • Streamlined and documented release testing
Future releases
  • 3.5.2 : mentioned notes
    • Fix Factory monitoring when interacting with Decision Engine for HEPCloud
    • Factory and Frontend monitoring under https
    • Not supported anymore tarball installation
  • 3.6 OSG upcoming focused in HTCondor token-auth for glideins
  • 3.7 python 3 focused

Any questions about the releases so far? -> NO
Brief reminders about features that will be dropped before going through the roadmap


Technical

Roadmap - High priority
  • Token authentication - having a system working without x509 certificates and support sites with sci-token.
  • Steps the - points from collaborator with HTCondor Glideinws 3.6
  • Last step will help us to install FE and Factory without worrying about authentication
  • Singularity support - More and more VO are using singularity (same environment of Fermilab WNs or running customized images).
    • Working on improving support
    • Test more sites with different configuration environment
    • troubleshooting
    • Adding new features used by VOs. Earlier had been used with a base image on few sites. Running different images or on many more sites can cause a lot of problems
  • CRIC mainly for CMS
  • Simplify code and modernize and broaden and streamline testing to help us to speed up development and releases
Roadmap – other topics
  • Move to python 3
  • Monitoring modernization - you will hear more from Thomas and Leo
  • Collaborate with HTCondor team to support new HPC sites with stricter policies
  • Deploy GlideinWMS in containers
  • Modernize to YAML
  • Re-evaluate upgrade/reconfig mechanisms
  • Move of the documentation to Jekyll - summer student didn’t complete the task
Any questions?
  • Burt asked how we set the priorities: For HEPCloud the maximum priority is new HPC sites with stricter policies. Andrew agreed but python 3 is prominent feature too but we need high priority in the migration to HPC. He saw operations are so far priority but containers and HPC should be higher.
  • Edgar got a student (hopefully for the rest of the year) to work on containers - the same one that it was doing the monitoring work for him earlier
  • Burt is surprised that there is no so much push from development side for it. Edgar said that FE is still lightweight (it could be for two years) so no urgent for them. They are asking where is coming the push from tokens (OSG).
    • Margaret said that she understands that GlideinWMS has to be one of the early adaptation systems for tokens.
  • Jeff commented that high-priority roadmap missed being more careful with the changes that can break connections between FE and Factory. Is there any discussion going about?
    • Marco commented we did because we actually have started to create a testing matrix and roadmap. [1]ACTION
    • Related to singularity is that before 3.5 people here were installing and testing everything in fermicloud and then started to be used more often in OSG and so on.
    • About the patch, there was a mistake in sending the patch initially via email (this opened the possibility of saving the file w/ CR+LF and causing problems). Will not happen again, there will be downloads in the file section if future HOT-fixes should arise
    • Compatibility issue related to the boolean-string comparison.
      • Jeff is concerned about what is considered for deciding to release minor or major releases
      • [1]Burt is asking how do we make sure if our Quality control process is effective or not. ACTION: sent the matrix and wiki information related to the testing to stakeholders meeting in order to have feedback.
    • Marco commented we created ITB Frontend and a Factory with the same configuration that ITB/Production configuration.
      • Edgar can you get a GPU CE to add in the Fermicloud FE? Marco said, so far out of sites (for the moment Nebraska). Burt said to Marco that there are several GPUs also at Fermilab (ACTION). Tony said there is not currently any entry connected to the sites with GPUs (no CE and Factory entry).
  • When condor_ssh_to_job is available, this will be also more efficient for the troubleshooting
  • [2] more discussion related to monitoring
Summer Interns projects
  • Thomas : He was already here last summer and he will continue the work in glidein logs to have them more accessible and readable. Jeff commented that it could be nice to have the opportunity to work together with their USD students as they are doing something similar.
  • Leo: Summer student from Italy working on improving the Glidein

Developers spotlight

  • Thomas monitoring [2]
    • They will ask about who is using this and who is interested. Edgar said everyone wants to have access to the old logs to have a better analyze
      • Edgar and Joe commented that there are a lot of problems in security for these tools. Edgar said that he already requested to have a testing host at least to have a proof of concept in the USDC cluster but he has no host yet
      • Can this run in the Factory? Asked Antonio. Edgar said yeah but he doesn’t want this because there are a lot of pressure for Factory operators already with controlling services.
      • Antonio said, for example, that the CERN ITB Frontend could be a good candidate