Project

General

Profile

Feature #22866

Create dynamic pages serving the Glideins stdout, stderr and included content

Added by Marco Mambelli 5 months ago. Updated 12 days ago.

Status:
New
Priority:
Normal
Assignee:
Category:
GlideinMonitor
Target version:
Start date:
07/04/2019
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

From the preliminary observation seems that
The factory stores Glidein log files in a directory tree starting at /var/log/gwms-factory/client/
The tree has a folder per Frontend user, each with a folder per Factory ID.
These folders have multiple folders named entry_AAA, each with the log files for the Glideins sent to the corresponding AAA entry.
These include HTCondor logs (condor_activity... , one per day) and the stdout and stderr from the Glideins (job.NNN.MM.out and .err, where NNN and MM are the condor cluster and process ID for the job, a counter always incremented for a given schedd, so one entry will have non-consecutive unique numbers)
stdout/err contain structured information including the results of tests, Unix environment, and the HTCondor log files
Tools to extract the condor logs are in GWMS: factory/tools/gwms-logcat.sh, cat_logs.py, cat_MasterLog.py, cat_... , cat_XMLResult.py

The tree should be copied in a new location to avoid crowding the Factory drive and because the factory periodically purges the files.
Compression of the stdout/err files seems effective and should be considered (e.g tar.gz files with stdout/err from a job could be created)
The de-compression and decoding of the stdout/err files should happen client-side to keep low the server load
The files should be served in a secure way, e.g. authenticating the users with the x509 certificates, username/password or SSO.


Related issues

Blocks GlideinWMS - Feature #22848: Improve Glidein monitoring and troubleshootingNew06/28/2019

History

#1 Updated by Marco Mambelli 5 months ago

  • Blocks Feature #22848: Improve Glidein monitoring and troubleshooting added

#2 Updated by Marco Mascheroni 5 months ago

The tree should be copied in a new location to avoid crowding the Factory drive and because the factory periodically purges the files.

With this you mean rsync to another machine, right? Because IMHO rsynch to a new location on the same machine just creates overload.

#3 Updated by Marco Mambelli 3 months ago

An initial version of the system is in GitHub: https://github.com/onlineth/FactoryMonitoringIndex
And here attached 2 files provided by Thomas with a description of the system and a summary of the activity on this ticket:
  • GlideinWMS Monitoring Dashboard - Deployment.pdf
  • Summer 2019 Summarization.pdf

The current version implements most of the initial requests, it is not including authentication.
Thomas and Marco also participated in a SpinUp workshop testing a deployment at NERSC on Spin

Some next steps:
  • moving in the FNAL git repo:
    git clone --mirror https://github.com/onlineth/FactoryMonitoringIndex.git
    cd FactoryMonitoringIndex
    git remote add new ssh://p-glideinwms@cdcvs.fnal.gov/cvs/projects/glideinmonitor
    git push --mirror new
    
  • add a GlideinMonitor/glideinmonitor repo on github in the GlideinWMS group
  • Fix the one container demo for Spin (all self contained, including a log sample)
    • Fix authentication on the current version
  • Prapare the multi-container demo on Spin (NFS space, web server/app, updater, DB)
  • reorganize the folders in the repo to include Dockerfiles and orchestration

#4 Updated by Marco Mambelli 2 months ago

  • Target version changed from v3_5_x to v3_6_1

#5 Updated by Marco Mambelli about 1 month ago

  • Target version changed from v3_6_1 to v3_6_2

#6 Updated by Marco Mambelli 12 days ago

  • Category set to GlideinMonitor


Also available in: Atom PDF