Project

General

Profile

Bug #11396

Frontend monitoring does not show number of running jobs correctly

Added by Parag Mhashilkar almost 4 years ago. Updated over 3 years ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
01/11/2016
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

CMS

Duration:

Description

This was pointed out by James Letts. In CMS they observed that the frontend logs correctly shows the number of running jobs however, monitoring pages show fewer jobs as running for the same time frame. The difference is usually in the order of 20K. Multicore jobs/glideins could be one of the possible reasons for such a discrepancy.

History

#1 Updated by Marco Mascheroni almost 4 years ago

  • Status changed from New to Feedback

Parag Mhashilkar wrote:

This was pointed out by James Letts. In CMS they observed that the frontend logs correctly shows the number of running jobs however, monitoring pages show fewer jobs as running for the same time frame. The difference is usually in the order of 20K. Multicore jobs/glideins could be one of the possible reasons for such a discrepancy.

Seems I found out where are the missing jobs. Let me summarize where I am with this before the meeting.

The number of running jobs in the frontend monitor (http://cmsgwms-frontend-global.cern.ch/vofrontend/monitor/frontendStatus.html) is wrong (see for example the running jobs here http://cms-gwmsmon.cern.ch/totalview)

Looking at the code it seems to me that the running jobs (the green line in the frontend chart above) are counted with this function https://cdcvs.fnal.gov/redmine/projects/glideinwms/repository/revisions/master/entry/frontend/glideinFrontendLib.py#L461 .

Checking the frontend logs there are many factory entries with 0 running jobs, but with running jobs from condor_status. See for example this line frome the frontend log:

[2016-01-20 15:02:21,755] DEBUG: glideinFrontendLib:516: Example running glidein ids at (u'vocms0305.cern.ch', u'CMSHTPC_T1_FR_CCIN2P3_cccreamceli01_multicore@v3_2@CMS-CERN', u'fecmsglobal@vocms0305.cern.ch') (total 0)

and this from a condor query at the user collector:

condor_status -const 'GLIDECLIENT_Group=="main" && GLIDEIN_Entry_Name=="CMSHTPC_T1_FR_CCIN2P3_cccreamceli01_multicore" && GLIDECLIENT_ReqNode=="vocms0305.cern.ch"' | wc -l
165

Still have to understand why they are not counted, I'll follow up.

#2 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_13 to v3_2_14

#3 Updated by Parag Mhashilkar over 3 years ago

  • Target version changed from v3_2_14 to v3_2_15

#4 Updated by Parag Mhashilkar over 3 years ago

James mentioned that after the update to v3.2.13, they haven't noticed this issue. So this is addressed and can be closed.

#5 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Feedback to Resolved

resolving this ticket. if we notice any issues we can open another ticket for future versions and reference back to this ticket.

#6 Updated by Parag Mhashilkar over 3 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF