Project

General

Profile

Feature #13277

Gather and aggregate pilot accounting information, specially payload informations

Added by Marco Mambelli over 4 years ago. Updated almost 4 years ago.

Status:
Closed
Priority:
High
Category:
Factory Monitoring
Target version:
Start date:
07/18/2016
Due date:
% Done:

0%

Estimated time:
Stakeholders:

CMS

Duration:

Description

Below is the email from Brian formulating the request.
CMS is discussing this also in the CMS Sub. Inf meeting as reported by Antonio

Email:

Hi all,

CMS / WLCG is starting to attempt cross-checks of our various accounting data. One particular weak spot is the link between the glideinWMS pilots and the payloads. For both of these, we currently use the remote condor_history mechanism to aggregate completed job information into an ElasticSearch database. (Similar to GRACC and FIFEMON).

I’d like to get the various XML data that’s found in the pilot stdout/err into the ClassAd. I think this is relatively easy to do.

Basically, here’s my idea:
- Set LeaveJobInQueue so the job will stay in the “completed” state in the schedd until the attribute HasPostProcessingData is set (or a modest time, such as 12 hours, has passed).
- Will require a small edit to the generated submit file.
- As part of the post-processing of job stdout already done by the factory, decode the XML that records things like “number of payloads”, “payload runtime”, “verification state”, etc. Turn this into ClassAd attribute / value pairs and insert into the job ad.
- Use the transaction-based API in the condor python bindings so this is all set in one large transaction. This minimizes overhead in the schedd versus many condor_qedit.
- I think interacting with the schedd is the only “new” part here.
- Once the stdout file is parsed, set HasPostProcessingData to true. If the stdout is missing the data - or is empty - set HasBadStdout or HasEmptyStdout as necessary.

Once we have the mechanism to move data from pilot stdout to pilot ClassAd, we can start re-evaluating which pieces of data we move (for example: we should probably include the validation failure reason). For now, I’d like to focus on the transport mechanism.

With this, I hope to get an estimate of the CPU usage for each pilot job, even on CEs that don’t provide this information from the batch system (CREAM). This is essential to validating the EGI accounting data.

Thoughts?

Thanks all,

Brian

20160714_SI_APCY.pdf (645 KB) 20160714_SI_APCY.pdf Marco Mambelli, 07/19/2016 09:50 AM

Related issues

Related to GlideinWMS - Feature #11755: Log number of activation/claims per glideinClosed02/17/2016

History

#1 Updated by Marco Mambelli over 4 years ago

Here the email with a similar request formulated by Antonio.

Hi Marco,

This is the email regarding pilot accounting information I mentioned in my previous message. Basically, I think we'd need to know for each pilot how long it ran and how effectively it used its resources (number of payloads executed, fraction of the time dedicated to validation, fraction of the time with resources being idle, etc). We need a way to get that info, if it's already available at the pilot level, and be able to aggregate it in diverse ways for monitoring and accounting purposes.

Cheers,

Antonio.

On Wed, Jul 13, 2016 at 3:22 PM, Antonio Perez-Calero Yzquierdo <> wrote:
Dear Parag, all

As I commented in one of our past meetings, I'd like to understand how to obtain information per glidein on what fraction of the time their resources were actively being used. I mean used to run payloads, as opposed to running validation, idle while waiting for a matching job during a negotiation cycle, during the draining phase after retire time, etc.

The motivation for this comes from CMS need to monitor and account for pilot effects in our use of the resources, then together with payload information try to reproduce and cross check what sites report.

When I originally asked about it in the meeting, I was not aware of the factory monitoring from pilot logs, such as can be found at

http://vocms0305.cern.ch/factory/monitor/factoryCompletedStats.html

Apparently, it does contain similar/identical information as to what I am asking about, so I understand therefore that the information indeed exists per glidein, and is being reported in the logs to factory after pilot finishes. So I need to get to the logs from the factories, but since I can't access the production factories, are they being replicated somewhere else?

In any case, I would like to understand how some of these metrics are being calculated for multicore partitionable glideins. For example, is wallclock time being weighted by the number of cores when combining single and multicore pilots in the monitoring views? Then, is wastage for a partially empty pilot being weighted by cores? Also, do we have a specific measurement for the idle resources of a pilot after retire time? Is the partitionable parent glidein collecting all the info from the dynamic slots?

In order to get the aggregated view for a certain period for the whole of CMS, or for a specific site having multiple entries, I'd have to get the logs from across the different factories supporting CMS and build the metrics myself, as there is no aggregated view as such provided by the glideinWMS monitoring, right? Brian do we have a kibana monitor based on all collected pilot logs already?

Thanks in advance,

Antonio.

#2 Updated by Marco Mambelli over 4 years ago

This was discussed also last Wednesday GWMS meeting, and also Thursday CMS Sub. Inf meeting. Antonio provided the slides available also at:
[[https://indico.cern.ch/event/547516/contributions/2220402/attachments/1309870/1959449/20160714_SI_APCY.pdf
]]
The attachment is available above, at the end of the ticket description.

#3 Updated by Marco Mambelli over 4 years ago

  • Stakeholders updated (diff)

#4 Updated by Parag Mhashilkar over 4 years ago

  • Priority changed from Normal to High
  • Target version changed from v3_2_x to v3_2_16

#5 Updated by Parag Mhashilkar over 4 years ago

  • Related to Feature #11755: Log number of activation/claims per glidein added

#6 Updated by Parag Mhashilkar over 4 years ago

  • Assignee changed from Marco Mambelli to Marco Mascheroni

#7 Updated by Parag Mhashilkar about 4 years ago

  • Target version changed from v3_2_16 to v3_2_17

#8 Updated by Parag Mhashilkar about 4 years ago

This was discussed during stakeholders meeting. Having this information in the glidein job's classad will make this more useful and along with other places. This way this info is stored in the job's history.

#9 Updated by Marco Mascheroni almost 4 years ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mascheroni to Parag Mhashilkar

#10 Updated by Parag Mhashilkar almost 4 years ago

  • Status changed from Feedback to Resolved
  • Assignee changed from Parag Mhashilkar to Marco Mascheroni

Merged to branch_v3_2

#11 Updated by Parag Mhashilkar almost 4 years ago

  • Status changed from Resolved to Closed

Also available in: Atom PDF