Gather and aggregate pilot accounting information, specially payload informations
Below is the email from Brian formulating the request.
CMS is discussing this also in the CMS Sub. Inf meeting as reported by Antonio
CMS / WLCG is starting to attempt cross-checks of our various accounting data. One particular weak spot is the link between the glideinWMS pilots and the payloads. For both of these, we currently use the remote condor_history mechanism to aggregate completed job information into an ElasticSearch database. (Similar to GRACC and FIFEMON).
I’d like to get the various XML data that’s found in the pilot stdout/err into the ClassAd. I think this is relatively easy to do.
Basically, here’s my idea:
- Set LeaveJobInQueue so the job will stay in the “completed” state in the schedd until the attribute HasPostProcessingData is set (or a modest time, such as 12 hours, has passed).
- Will require a small edit to the generated submit file.
- As part of the post-processing of job stdout already done by the factory, decode the XML that records things like “number of payloads”, “payload runtime”, “verification state”, etc. Turn this into ClassAd attribute / value pairs and insert into the job ad.
- Use the transaction-based API in the condor python bindings so this is all set in one large transaction. This minimizes overhead in the schedd versus many condor_qedit.
- I think interacting with the schedd is the only “new” part here.
- Once the stdout file is parsed, set HasPostProcessingData to true. If the stdout is missing the data - or is empty - set HasBadStdout or HasEmptyStdout as necessary.
Once we have the mechanism to move data from pilot stdout to pilot ClassAd, we can start re-evaluating which pieces of data we move (for example: we should probably include the validation failure reason). For now, I’d like to focus on the transport mechanism.
With this, I hope to get an estimate of the CPU usage for each pilot job, even on CEs that don’t provide this information from the batch system (CREAM). This is essential to validating the EGI accounting data.
#1 Updated by Marco Mambelli over 4 years ago
Here the email with a similar request formulated by Antonio.
This is the email regarding pilot accounting information I mentioned in my previous message. Basically, I think we'd need to know for each pilot how long it ran and how effectively it used its resources (number of payloads executed, fraction of the time dedicated to validation, fraction of the time with resources being idle, etc). We need a way to get that info, if it's already available at the pilot level, and be able to aggregate it in diverse ways for monitoring and accounting purposes.
On Wed, Jul 13, 2016 at 3:22 PM, Antonio Perez-Calero Yzquierdo <email@example.com> wrote:
Dear Parag, all
As I commented in one of our past meetings, I'd like to understand how to obtain information per glidein on what fraction of the time their resources were actively being used. I mean used to run payloads, as opposed to running validation, idle while waiting for a matching job during a negotiation cycle, during the draining phase after retire time, etc.
The motivation for this comes from CMS need to monitor and account for pilot effects in our use of the resources, then together with payload information try to reproduce and cross check what sites report.
When I originally asked about it in the meeting, I was not aware of the factory monitoring from pilot logs, such as can be found at
Apparently, it does contain similar/identical information as to what I am asking about, so I understand therefore that the information indeed exists per glidein, and is being reported in the logs to factory after pilot finishes. So I need to get to the logs from the factories, but since I can't access the production factories, are they being replicated somewhere else?
In any case, I would like to understand how some of these metrics are being calculated for multicore partitionable glideins. For example, is wallclock time being weighted by the number of cores when combining single and multicore pilots in the monitoring views? Then, is wastage for a partially empty pilot being weighted by cores? Also, do we have a specific measurement for the idle resources of a pilot after retire time? Is the partitionable parent glidein collecting all the info from the dynamic slots?
In order to get the aggregated view for a certain period for the whole of CMS, or for a specific site having multiple entries, I'd have to get the logs from across the different factories supporting CMS and build the metrics myself, as there is no aggregated view as such provided by the glideinWMS monitoring, right? Brian do we have a kibana monitor based on all collected pilot logs already?
Thanks in advance,
#2 Updated by Marco Mambelli over 4 years ago
This was discussed also last Wednesday GWMS meeting, and also Thursday CMS Sub. Inf meeting. Antonio provided the slides available also at:
The attachment is available above, at the end of the ticket description.