Project

General

Profile

Feature #23091

Reliable, flexible and secure logging system for distributed workflows

Added by Marco Mambelli about 1 month ago. Updated 5 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
09/12/2019
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
Stakeholders:
Duration:

Description

High throughput computing workflows run thousands of jobs on a variety of different resources: from commercial and on-prem clouds, to high performance computing centers, to remote or local clusters. The goal of this project is to provide an additional communication channel to retrieve information from these different resources and increase the reliability of the infrastructure. This will be added to GlideinWMS, a workflow manager leveraging the HTCondor software framework to provision resources for scientific computing. It will benefit all the collaborations using GlideinWMS, including the LHC experiment CMS, all the FIFE experiments at Fermilab, the HEPCloud portal and Open Science Grid.
GlideinWMS project: https://tinyurl.com/gwmsprj-logging-pdf

This includes the following activities:
  1. Getting familiar with distributed computing and GlideinWMS
  2. Survey of the state of the Art and evaluation of remote application logging solutions (frameworks, libraries, formats)
  3. Critical review of the current format of the Glidein stdout/err
  4. Design a format for an additional logging stream that can be used by glidein_startup and other scripts within the Glidein (text, files forwarding)
  5. Build a simple system duplicating and transmitting stdout and stderr from the Glideins
  6. Design a system for many-to-many Glidein logging
    • Multiple Glideins sending messages, multiple subscribers may be interested
    • Globally Unique Glidein ID (to identify updates of the same files)
    • Useful metadata (e.g. factory/entry_set/entry, frontend/group, to identify who could be interested
    • Security consideration: authenticated messages, ...
  7. Development and integration related to distributed computing software for Grids, Clouds, and Supercomputers
  8. Testing on High-Performance Computers and clouds
  9. Integration in production
Some shortfalls of the current Glideins logging:
  • Reports only stdout and stderr
  • Missing stdout/err for some Glideins (especially killed ones)
  • Information only at the end (flush)
  • Not reporting to multiple listeners
  • Confusing or missing information from indirect and multi-job submissions

Consider also providing a critique of the current GlideinWMS software and suggestions to improve it, e.g. adding unit tests, linting, using specific libraries, ... Some of this is mentioned in #20901


Subtasks

Feature #23117: Additional logging channelWork in progressLeonardo Lai

Feature #23265: Add security mechanisms to the glidein logging channelWork in progressLeonardo Lai

History

#1 Updated by Marco Mambelli about 1 month ago

Some more details about item 2 above (logging solutions)

Some links about remote logging (solutions, discussions):
https://dzone.com/articles/what-is-remote-logging-1
https://www.loggly.com/blog/what-is-remote-logging/
https://www.netgate.com/docs/pfsense/book/monitoring/remote-logging.html
https://www.techjini.com/blog/remote-logging-and-its-importance/
https://bugfender.com/
https://stackify.com/best-log-management-tools/
https://www.owasp.org/index.php/Logging_Cheat_Sheet
https://docs.python.org/3/library/logging.html

The streaming of the information and handling publishers and subscribers are central parts of logging, handled by streaming platforms and message queues:
https://kafka.apache.org/documentation/#introduction
https://www.rabbitmq.com/
http://zeromq.org/

Here a fragment that sends a custom log, in glidein_startup.sh, using PHP in the receiving Web server:

function send_logs {
   debug_pilot_enabled=`grep '^DEBUG_PILOT' $glidein_config | awk '{print $2}'`
   if [ $debug_pilot_enabled"x" != "x" ]; then
       pilot_id=$glidein_factory"_"$glidein_entry"_"$condorg_cluster"_"$condorg_subcluster;
       cp $PWD/../_condor_stderr $PWD/$pilot_id"_condor_stderr" 
       cp $PWD/../_condor_stdout $PWD/$pilot_id"_condor_stdout" 
       curl -F file=@$PWD/$pilot_id"_condor_stderr" http://vocms0801.cern.ch/si_stuffs/debug_pilots/fileupload.php
       curl -F file=@$PWD/$pilot_id"_condor_stdout" http://vocms0801.cern.ch/si_stuffs/debug_pilots/fileupload.php
   fi
}

Here the PHP script in the frontend:
/var/www/html/si_stuffs/debug_pilots/fileupload.php

<?php
   $timestamp=date('Ymd_His');
   $log_file="/tmp/debug_pilot_post2.log";
   $filename=basename( $_FILES['file']['name']);
   preg_match('/T[0-9]+_[A-Z]+_[A-Z0-9]+/', $filename, $matches);
   $site_name=$matches[0];
   $uploaddir = "/var/www/html/si_stuffs/debug_pilots/uploads/".$site_name."/";
   if (!file_exists($uploaddir)) {
       mkdir($uploaddir, 0755, true);
   }
   $uploadfile = $uploaddir .$filename."_".$timestamp;
   move_uploaded_file($_FILES['file']['tmp_name'], $uploadfile);
   $log="Uploading: ".$uploadfile."\n";
   file_put_contents($log_file, $log, FILE_APPEND);
   #curl -F file=@$PWD/$LOG_FILE http://vocms0801.cern.ch/si_stuffs/debug_pilots/fileupload.php
?>

#2 Updated by Marco Mambelli about 1 month ago

  • Description updated (diff)

#3 Updated by Leonardo Lai about 1 month ago

  • Start date changed from 08/08/2019 to 08/14/2019
  • Due date set to 08/14/2019

due to changes in a related task: #23117

#4 Updated by Leonardo Lai 5 days ago

  • Start date changed from 08/14/2019 to 09/12/2019
  • Due date set to 09/12/2019

due to changes in a related task: #23117



Also available in: Atom PDF