Feature #23091

Updated by Marco Mambelli about 1 year ago

High throughput computing workflows run thousands of jobs on a variety of different resources: from commercial and on-prem clouds, to high performance computing centers, to remote or local clusters. The goal of this project is to provide an additional communication channel to retrieve information from these different resources and increase the reliability of the infrastructure. This will be added to GlideinWMS, a workflow manager leveraging the HTCondor software framework to provision resources for scientific computing. It will benefit all the collaborations using GlideinWMS, including the LHC experiment CMS, all the FIFE experiments at Fermilab, the HEPCloud portal and Open Science Grid.
GlideinWMS project:

This includes the following activities:
# Getting familiar with distributed computing and GlideinWMS
* Survey of the state of the Art and evaluation of remote application logging solutions (frameworks, libraries, formats)
# Critical review of the current format of the Glidein stdout/err
# Design a format for an additional logging stream that can be used by glidein_startup

* Getting familiar with distributed computing
and other scripts within the Glidein (text, files forwarding) GlideinWMS
# * Build a simple system duplicating and transmitting stdout and stderr from the Glideins
# Design a system for many-to-many Glidein logging
** Multiple Glideins sending messages, multiple subscribers may be interested
** Globally Unique Glidein ID (to identify updates of the same files)
** Useful metadata (e.g. factory/entry_set/entry, frontend/group, to identify who could be interested
** Security consideration: authenticated messages, ...
* Development and integration related to distributed computing software for Grids, Clouds, and Supercomputers
# * Testing on High-Performance Computers and clouds
# * Integration in production

Some shortfalls of the current Glideins logging: system:
* Reports only stdout and stderr
Missing stdout/err for some Glideins (especially killed ones)
* Information only at the end (flush)
* Not reporting to multiple listeners
* Confusing or missing information from indirect and multi-job submissions

Consider also providing a critique of the current GlideinWMS software and suggestions to improve it, e.g. adding unit tests, linting, using specific libraries, ... Some of this is mentioned in #20901