Project

General

Profile

GWMS Stakeholders Meeting July-10-2019

Slides: https://indico.fnal.gov/event/17328/
(direct link)


Present

Margaret Votava - Project Sponsor/SCS Quadrant Head
Marco Mambelli - Project/Technical Lead
Lorena Lobato - Project Member
Marco Mascheroni - Project Member/OSG Factory Operations
Burt Holzman - HEPCloud Project Sponsor
Steve Timm - HEPCloud Technical Advisor
Tony Tiradani - HEPCloud Technical Lead
Antonio Perez-Calero Yzquierdo - CMS
James Letts - CMS
Brian Lin - OSG Software
Jeff Dost - OSG Factory Operations
Marian Zvada - OSG Operations
Ken Herner - FIFE
Edgar Hernandez - OSG/GLOW


Communication

Parag left and now Marco is now the Project leader of the project

  • Seeking stakeholders input for future GlideinWMS releases: Dropping tar file distributions, requiring HTCondor Python bindings and make collectors shared ports the default
  • Asked feedback related to OSG and SW for future versions and features and old versions.
  • Jeff communicated they plan to upgrade the OSG factory to 3.4.5
  • Brian Lin commented that end of November-dec 2020 is the end of CREAM-CE support. Therefore he suggested to include HTCondor 8.8 in OSG 3.5 (OSG 3.4 has Condor 8.6 and GlideinWMS 3.4).
    ACTION: have an offline discussion about that

Releases and Roadmap summary

Releases
  • GlideinWMS v3.4.5-2 available in OSG production (OSG 3.4.31)
  • GlideinWMS v3.5 in OSG upcoming testing – big change, single-user factory
  • GlideinWMS v3.4.6 (production) expected end of July – with several fixes
  • GlideinWMS v3.5.1 (development/upcoming) expected by mid-August
RoadmapSummary: More details in the Technical part
  • Dropping TAR files distribution (sent email in June, planned for 3.5.2)
  • Increase use of HTCondor python bindings, they will be a requirement (sent email in June, planned for 3.5.2)
  • Make collectors shared ports the default (3.5.2)
  • Drop support for Glexec (still used on some European sites - Steve Tim - planned for 3.6)
  • Drop Separate User collector ports (support only shared port, planned for 3.6)
  • Python 2 away. Move to Python 3 (will be 3.7, no set date). Have a Python 3 version in OSG upcoming by late Summer 2019
  • Decision Engine support started in 3.4.4 – Factory supporting multiple Frontend-like services
  • Keep collaborating with HTCondor: Use of token (security without x509 certificates. Support new HPC sites with stricter policies, blackhole detection, and singularity invocation
  • Automatic Factory configuration generation via CRIC
  • Monitoring Modernization (Summer project)
  • Move documentation to Jekyll (Summer project)
  • Deploy GlideinWMS in containers

Technical discussion

Marco asked the Stakeholders for Feedback about the planned changes sent via email and listed in the summary:
  • Increase user of HTCondor python binding among other operations. He asked if any further feedback but seems not to have any response.
    ACTION: he will send another email about the planning if he didn’t receive any complaint
  • No-one uses TAR files distribution.
    • Edgar agrees and he asked about git checkout in order to use the tar installation. We have explained that with TAR installation guarantee that all the python files and the paths are correct and with this removal this might break the workflow (like python_path or any invalid paths complaining ). We would have to do some changes in scripts of the installation. Handle libraries that are being managed by rpm and this will be dropped also. Edgar confirmed that this is perfectly fine.
  • Shared port. Marco would start to move the support to put shared port as default.
    • Steve asked if this configuration are gonna be for secondary collectors only. Marco explained about how the address is being used with the sinful strings. He stated in the future we’ll drop the use of separate ports as we have received input that shared ports are more reliable and the performance is much better
      ACTION: he will send an email to remind about this and there is no complains about
  • Single User Factory. This is released in GlideinWMS 3.5 and apart from the removal of GRAM GT2 and GT5, now Factory runs with a single user. Lorena and Marco have provided the migration scripts (including HTCondor changes) as well as a backup script to reverse the changes just in case the administrator regrets.
  • GlExec.
    • Marco asked if some European sites are still using glexec as we would like to discontinue the support for glexec. Last time Steve commented there are some experiments like Dune that is still forced to use glexec at some site in Europe and they are working with the admins to stop using it.
    • There was a discussion about privacy and retaining the pilot logs and how readable they can be. Jeff talking about 1 year retention and if they would keep them so long
      HTCondor-CE tracks the actual user who is running the job. ACTION: We’ll keep the discussion offline but Marco wanted to point out that if you publish the logs like that you might get complains about it.
Other questions from recent releases:
  • Brian Lin confirmed that the current OSG contains BLAHP fixes that were tested in collaboration w/ GlideinWMS.
    • There were issues related to jobs that keep running when glidein and condor were killed. The problem was seen in CMS jobs on PBS clusters. A sleep in glidein_startup.sh delaying signals and the BLAHP wrapper not propagating signals have been fixed. Still, PBS is sending sigquit and sigkill one after the other one and it doesn’t give time to the glidein to complete the cleanup and to send back some files. There is still a discussion going on with SITE administrators about possible PBS configurations but it seems a bug in PBS.
Other items in the upcoming releases:
  • Singularity development was moved to 3.4.6
    • Invoke Singularity via HTCondor . Condor now allows custom parameters that will allow this. Also will allow condor_ssh_to_job if unprivileged Singularity is used. Marco is still troubleshooting the issue with the condor team (G.Thain).
  • Blackhole prevention
    • Lorena was explaining to Edgar an overview of the system that she’s implementing. Getting stats from the STARD as a result of the collaboration with HTCondor team) and comparing them with the parameters set by the Frontend admin in the limits section from the configuration. She confirmed to Edgar that the architecture will rely on the pilots (and not in the FE as he asked) and the pilots identified as blackhole will be advertising that back.
      ACTION: we can set an offline meeting where Lorena can provide more information related and also point him out the HTCondor ticket-> Brian put it on the chat: https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=6698
  • Monitoring: Retirement of GlideinWMS monitoring pages and move to a better solution (i.e: graphana).
    • Monitoring pages moving to HTTPS. The stage will remind under HTTP but the rest will be HTTPS, to avoid complaining as they are host here in Fermilab.
    • Requested possibility to connect our summer intern with USDC one as they have a prototype solution for factory sending info to OSG GRACC.
      ACTION: offline meeting to work with the students about the monitoring
  • Python 3 migration:
    • Marco requested to be kept up to the date with the plans and times related to the SW versions in OSG (especially for example move to RHEL7, and drop RHEL6). Brian commented they plan to get rid of the support of RHEL6 in OSG 3.5
    • Edgar commented that LIGO may not update while on data taking and stay for a long time on RHEL6
  • The new UCSD DN includes commas (condor JSL authentication) which are a list separator in HTCondor.
    • This causes problems in the classads and configurations having problems of configuration as GSI_DAEMON_NAME doesn’t have scaped the comma. User master cannot connect back with the collector. This affects only the master in the Glidein as it cannot advertise to the collector from the VO. Edgar commented that it’s not only for his university, it’s for several ones.
    • This is not affecting normal operations and there is a ticket open, #22779
  • Multi-node submission:
    • Marco working on spawning jobs submitted in multiple nodes. Dirk is planning to use this in HPC. Will be in 3.4.6
    • Some capability was already there. This will add support for MPI invocation or other system launchers, directly from the Glidein invocation (no additional wrapper script). Separately, we are working on reporting independently stdout/err not to have all intermixed in a single file (will be available later).
    • Edgar was commenting that he already implemented something similar long time ago, with wrapper scripts. With this feature, you can parameters in the FE and Factory and glidein will take care of all. Without custom BLAHP script.
      ACTION: Marco asked if Edgar can provide any particular test machine where he can test (currently Marco relies on other people testing it). Offline discussion
  • Marco Mascheroni described the possibility to customize pilot start expression to allow CMS and other Vos FE validations scripts modify the start expression. A new parameters (GLIDEIN_Custom_Start) attribute was added and it was already tested in CMS ITB factory and ready for production. Will be in 3.4.6
    • Jeff pointed out that it should not only be adding a new attribute to the entry in the Factory configuration. He suggested to have this option also for the Frontend admins and set it in the Frontend configuration, like for example as Antonio was commenting to set the attributes by group also. Marco Mambelli asked if it will be only for some specific Frontend, not for all of them. We need to discuss and have more input
      ACTION: We should have a conversation related to this. Next GLIDEINWMS weekly meeting (17th July)
Summer Interns projects
*Kiana: Building the new GlideinWMS website by migrating documentation to Jekyll.
  • Asked by Edgar if we’re using GITHUB markups (yes). It will be similar to OSG documentation
  • Javier: Python classes + glideinwms and he will improve the testing abilities of the glideins.
    • He will work with Marco Mascheroni and will through the manual submission.
    • Burt asked to do with Docker and Kubernetes. He’s not sure if this is a good fit for a High School student though.
      ACTION: asking Krista to see if we can do it and if we have places in Fermilab to host it.
  • Thomas : He was already here last summer and he will continue the work in glidein logs to have them more accessible and readable.
    • Jeff commented that it could be nice to have the opportunity to work together with their USD students as they are doing something similar.
      ACTION: Send an email to follow up

Round table

  • Antonio reported a problem with glideins stop accepting jobs due to the timelines presence of lines.
    • Marco replied that this ticket is being assigned to Mascheroni.
    • Basically we’ll have two timelines (presence of the file that make glideins to stop accepting new jobs) Time interval of deadline
  • Edgar asked how long time took for Javier to set up GlideinWMS environment. When Javier confirmed about 1 day, the discussion started.
    • Burt proposed having configured FE and Factory in containers to ease the installation.
    • Marco commented that most of the problems from installation and configuration are about Condor authentication/authorization. This will change a lot now that we’re going have Condor tokens (still a couple of months away), then we can take that into consideration.
    • Edgar is doing this for the FE for BNL and the hard part is to maintain the FE state. The hard part is for the FE to keep updates of the glideins, queues. directories…etc
      ACTION: Have an offline discussion about

ACTION ITEMS

Summary of action items from above:

Marco Mambelli:
  • Send a reminder about the agreed plans for HTCondor binding requirement, drop of tar distributions, shared ports becoming a default
  • Start the following offline discussions:
    • CREAM support in HTCondor, OSG and GlideinWMS (B.Lin and CMS)
    • GlideinWMS in containers: deployment and state (Edgar, Burt, Krista about possible resources)
    • Publishing the Glidein Logs (Jeff, CMS)
  • Ask Edgar about access to resources to test MPI jobs
  • Will start the collaboration between Edgar and Thomas, to coordinate the monitoring effort
  • Will have a special topic discussion at the GlideinWMS meeting about the GLIDEIN_Custom_Start: should it be also in Factory, Frontend, both? (Jeff, Antonio, James, Marco Mascheroni)
Lorena:
  • Will discuss w/ Edgar about the Black-hole mechanism