Project

General

Profile

GlideinMonitor20190906

Convocation:

Greetings,
I'm looking for the correct place to deploy a Glidein monitoring service developed by a summer student that has been working with GlideinWMS this past summer.
The service is functional but is a prototype. So it would be deployed with limited access to help with GlideinWMS troubleshooting and to further the development of this project.
This is something different and independent from the Factories sending their data to elastic search servers (GRACC, Fermilab). The Glidein logs are files with structured information, it is making it available to troubleshooters.

https://drive.google.com/file/d/1wpyuoQozQmJ3TSLbQusHt0HeBbl7Iv8Y/view?usp=sharing
https://cdcvs.fnal.gov/redmine/issues/22848

There is no design document that can be shared.

The system is comprised by a Web server using Flask (it creates a structured file area and serves indexes and log pages), a sqlite DB, a file storage on disk and an ingestion area on disk.
The ingestion area can be populated via rsync by one or more Factories.
A minimal deployment would work with 50 GB of data on disk (spinning disk is OK), 1 TB would allow keeping the logs for a long time.
For the DB I'm expecting O(10) MB

Meeting on 9/6, 11:30am in WH12SE, https://fnal.zoom.us/j/695119889

Attending: Marco, Kevin R, Tony T, Joe B, Krista

Until the project has a design document it will not be considered for hosting at Fermilab. Tony will provide a template that will help in writing a design document. Expected time to write the document 1-2 weeks/man.
Also after there is a design document there are concerns that maintaining it will cost effort and not be worth it, there are no customers that requested it. It has a Web component, this will scare security and will be difficult to approve and maintain. The preference would be to have the Factory to host and maintain GlideinMonitoring.
Marco said that a system like this would have helped in the troubleshooting of Singularity other Glidein problems because it gives access to the Glidein logs (currently developers have to ask Factory operators to provide a copy), it eases the search and unpacks and decodes the Glidein logs. CMS and FactoryOps expressed interest.
Since the major hurdle is access to the log files, Marco should request access to the Factory and use grep.
The Factory normal operation is to provide a copy of the logs to selected customers and not allow access, anyway, Marco will try to request access as suggested.
Marco should look into getting stakeholders buy-in and official requests for the GlideinWMS project before continuing development or requesting resources at Fermilab.