Project

General

Profile

ST190821 Glidein custom start and Log publishing

Discussion during the August 21 GWMS weekly meeting: Weekly_Meeting_Notes

Attending: Marco Mambelli, Marco Mascheroni, Dennis Box, Lorena Lobato Pardavila, Leonardo Lai, Jeff Dost, Joe Boyd, James Letts, Frank Wuerthwein

We discussed two topics (mainly the first one):
  1. Customizable start expressions and mechanisms to affect jobs matching: the start expression normally comes from the frontend, what else is desirable and sound? Factory attributes, Site or node attributes (Environment variables, Files a the node, ...)?
  2. Would be OK to publish the Glidein Logs? Should access be restricted?

Customizable start expressions and mechanisms to affect jobs matching

A bit about the current implementation in GlideinWMS:
  • Why: add custom expression to have some extra checks
  • How: Enabled on the Frontend. Special custom start
  • Could be done adding 1 new Frontend group sending only to some sites (one or few). Something similar is done for High IO, GPU, would increase the needs for special groups and proliferate the number of groups.
  • Using GLIDEIN_CUSTOMIZE_START
Reasons this could be desired:
  • All users at the site
  • Site requiring something
  • Users want special privileges at a site

In short (summary): customizable start expressions are used by CMS via GLIDEIN_Custom_Start. Sites and the site description in CVMS can affect the jobs matching at sites. This mechanism is hidden from Factories and frontends and can bring inconsistencies. Resource characteristics can be better expressed via queues and attributes. This is a faster mechanism to solve emergencies (network problems, jobs definition errors, ...). Workflow management may not be capable to deal with the increasingly specialized resources, need to schedule on a specific resource instead of specifying the job requirements. Worry about the proliferation of Frontend groups and Factory entries.

Discussion
  • If it is to distinguish special resources and applies to all users using them, then A CE per distinct sub-cluster would push this from the FE configuration to the Factory configuration (new Site, site selection). Would be easily visible. CMS needs to know what sites are and what's running there. The custom-start is hiding things.
  • Some of these pilots are pilots in a vacuum. In this case, they could have a special startd expression, no need of all this mechanism (they are not requested by Frontend to Factory)
  • There is no need for new CEs. Behind the CE with the resources, a queue or a condor attribute would be OK, anything that the factory can use to steer/distinguish the resources.
  • All this mechanism is behind the back, not advertised. If you have a special resource then traditionally this is presented as a queue (an entry in GWMS Factory)
  • If you want to have something more squishy, policy expression, special privileges that a user has at a site (higher priority), then a mechanism like this may be needed and used. Something like:
    • I need to know who the special people are
    • then I can act
  • HEPCloud runs into the same problem. Not really, all policies consider the whole resource and information coming from it
  • If the sites change the policy there could be inconsistency (according to whet's known to Frontend and Factory jobs can run there, so glideins are requested, but they get wasted because jobs are not matching because of the site policy changes)
  • CalTech raised this requirement not to have more CEs: it is easy to see a single CE, there is always a queue or attribute to distinguish different resources
  • CMS uses the subsite concept, those attributes are used for job routing
    • subsite name is set in the validation script, reading the site conf (in CVMFS)
    • To have a different entry we need something that the CE parses
    • This still could be published in the schedd attributes (validation scripts are running before HTCondor is started in the glidein)
  • Questions for CMS and the Factory operators
    • Do we need multiple queue?
    • Do we need to duplicate groups to restrict some jobs to run only on some sites?
    • Is it OK to put VO specific attributes in the Factory configuration?
  • Having the site decide may bring:
    • Proliferation of specialization: resources too diverse, how to expose that diversity, how do you give to people the possibility to match that diversity to the job diversity.
  • Explosion of groups or entries may be the wrong solution
  • Sites can use this to police jobs
    • A job cannot understand its requirement, is not working at a site, the site is banning that job (or group of jobs).
    • This may be a problem in the job definitions, that are not understanding the diverse resources available
  • Is this something that the sites want to change very dynamically? A lot is network-driven? In this case, it may be needed
  • It may be a CMS Operational problem: you want to bar something that you don't see arriving to make the system stable
    • Work behind the factory back because the turnaround time is faster
    • Production: turn off match-making for a specific class of jobs on the fly, without central control.

We agreed that we should continue this discussion, including Antonio, on September 4th.

Discussion with Antonio Pérez-Calero, and GWMS developers on 9/4:
  • Sites may want to join in a very fluid way
  • CMS Emphasis is not on defining the resources but on tracking (monitoring what happened)
  • Would there be concerns if the start expression customization blocks jobs matching creating waste and decreasing efficiancy
  • This is not a concern for CMS:
    • Customizable glideins are beyond the pledge resources. It is not so important for central CMS operations.
    • These are accounted separately. Customized pilots are running in a sub-site, are running empty. Efficiency is not considered, like the HLT that has extra uses.
    • Any pledged resource cannot have customizable resources. The efficiency is counted only for the pledged resources.
  • CMS will give this knob only to more expert people
  • It is true that It breaks the paradigm of provisioning and the jobs are not well defined. But the potential benefits are there
  • Questions:
    • Do we want those resources to have a specific entry in the Factory?
    • Do we want a specific group in the Frontend?
  • At the moment there is an attribute that enables custom start expressions:
    • it can be in the Factory
    • it can be in the Frontend
    • at the moment it is used only in the Frontend by CMS and customized via a script
      • Glidein_Custom_Start is always there and true
      • Glidein_customize_start is customized in the script when the custom glidein is enabled by the frontend (group)
  • The current mechanism satisfies the need of CMS
  • There are several groups that are copy-paste of others where only some conditions are changed. Before a formal request, CMS will check its configuration.
  • If the script is VO specific, could that script look a list of sites? Why not an attribute with the list of sites?
  • CMS will consider this change, it would avoid multiple groups

Publishing Glidein Logs

Discussion
  • James - it would be extremely useful, especially for ITB. There are requirements for data coming from servers in the EU, GDPR (PII data should be scrubbed, username, DN, IP).
  • J Dost, VO can already rsync from the Factory. There was a new and improved script for moving data to GROCC. Will be on the new ITB at UCSD, we don't want it on a public server, only some chosen viewers in the USA. There are more strict restriction on exporting

Summary

  • customizable start expressions are used by CMS via GLIDEIN_Custom_Start. Sites and the site description in CVMS can affect the jobs matching at sites. This mechanism is hidden from Factories and frontends and can bring inconsistencies. Resource characteristics can be better expressed via queues and attributes. This is a faster mechanism to solve emergencies (network problems, jobs definition errors, ...). Workflow management may not be capable to deal with the increasingly specialized resources, need to schedule on a specific resource instead of specifying the job requirements. Worry about the proliferation of Frontend groups and Factory entries. CMS is aware that the current mechanism may have inconveniences but it fits its needs. Using a variable with CMS OK site list could allow reducing the groups' proliferation.
  • Publishing Glidein Logs would be extremely useful, especially for ITB. There are requirements for data coming from servers in the EU, GDPR (PII data should be scrubbed, username, DN, IP). VO can already rsync from the Factory especially the new ITB at UCSD. It is not public, requests are evaluated individually and there are restrictions on exporting