Project

General

Profile

Bug #17822

HTCondor QEdit triggered also when advertise_pilot_accounting is not set

Added by Marco Mambelli over 2 years ago. Updated over 2 years ago.

Status:
Closed
Priority:
Urgent
Category:
-
Target version:
Start date:
10/04/2017
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

HTCondor QEdit triggered also when advertise_pilot_accounting is not set and the job is not in the queue. This triggers a condor error.

Here some notes and messages from 3.2.20.rc2 with some extra logging

* Unable to query Factory - happens intermittently
Seems to be caused when the job is done, no more in the queue
schedd.edit() fails
update_classads() in glideFactory.py is called for completed job and are no more in queue, so it fails
can it be ignored (job was already edited) os is it an error?

[2017-10-03 15:29:15,010] DEBUG: glideFactoryCredentials:97: updating credential file /var/lib/gwms-factory/client-proxies/user_frontend/glidein_gfactory_instan
ce/credential_fermicloud134-fnal-gov_OSG_gWMSFrontend.main_428803
[2017-10-03 15:29:15,011] DEBUG: glideFactoryCredentials:100: updating using privsep
[2017-10-03 15:29:15,041] INFO: glideFactory:521: Checking EntryGroups [0]
[2017-10-03 15:29:15,041] INFO: glideFactory:581: Aggregate monitoring data
[2017-10-03 15:29:15,124] INFO: glideFactory:587: Starting updating job classads
[2017-10-03 15:29:15,136] ERROR: glideFactory:97: Failed to add monitoring info to the glidein job classads msg1/2: Error querying schedd fermicloud131.fnal.gov in pool default using python bindings: Unable to edit job
[2017-10-03 15:29:15,137] DEBUG: glideFactory:98: MMDB Failure detail (jobinfo, vals): ('fermicloud131.fnal.gov', None) {'018.003': {'glidein_duration': 888, 'condor_duration': 875, 'condor_started': 1, 'activation_claims': 3, 'numjobs': 3}, '018.000': {'glidein_duration': 913, 'condor_duration': 900, 'condor_started': 1, 'activation_claims': 4, 'numjobs': 4}}
<built-in method values of dict object at 0x19a05c0>
[2017-10-03 15:29:15,137] ERROR: glideFactory:99: MMDB Failed to add monitoring info to the glidein job classads: Error querying schedd fermicloud131.fnal.gov in pool default using python bindings: Unable to edit job
Traceback (most recent call last):
  File "/usr/sbin/glideFactory.py", line 95, in update_classads
    values=map(json.dumps, joblist.values()))
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorMonitor.py", line 336, in executeAll
    raise QueryError(err_str)
QueryError: Error querying schedd fermicloud131.fnal.gov in pool default using python bindings: Unable to edit job
[2017-10-03 15:29:15,147] INFO: glideFactory:589: Finishing updating job classads
[2017-10-03 15:29:15,170] INFO: glideFactory:609: Sleep 59.8163831234 secs
[2017-10-03 15:30:15,046] INFO: glideFactory:500: Checking for credentials ['ITB_FC_CE2', 'ITB_FC_CE3']
[2017-10-03 15:30:15,070] DEBUG: glideFactoryCredentials:170: updating credential for frontend
[2017-10-03 15:30:15,070] DEBUG: glideFactoryCredentials:97: updating credential file /var/lib/gwms-factory/client-proxies/user_frontend/glidein_gfactory_instance/credential_fermicloud134-fnal-gov_OSG_gWMSFrontend.main_428803

Unmasking error:
[2017-10-03 17:49:26,526] ERROR: glideFactory:809: Exception occurred spawning the factory:
Traceback (most recent call last):
  File "/usr/sbin/glideFactory.py", line 794, in main
    frontendDescript, entries, restart_attempts, restart_interval)
  File "/usr/sbin/glideFactory.py", line 588, in spawn
    update_classads()
  File "/usr/sbin/glideFactory.py", line 95, in update_classads
    values=map(json.dumps, joblist.values()))
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorMonitor.py", line 332, in executeAll
    schedd.edit([jobid], attr, classad.quote(val))
RuntimeError: Unable to edit job

Editing job 019.004: MONITOR_INFO, "{\"glidein_duration\": 607, \"condor_duration\": 600, \"condor_started\": 1, \"activation_claims\": 30, \"numjobs\": 30}" 

History

#1 Updated by Marco Mambelli over 2 years ago

The test machines are fermicloud131 (factory), fermicloud134 (frontend)

#2 Updated by Marco Mambelli over 2 years ago

  • Status changed from Assigned to Resolved
  • Assignee changed from Marco Mascheroni to Marco Mambelli

An attribute in data dict was considered boolean instead of string.

#3 Updated by Marco Mambelli over 2 years ago

  • Status changed from Resolved to Feedback
  • Assignee changed from Marco Mambelli to Dennis Box

#4 Updated by Dennis Box over 2 years ago

  • Assignee changed from Dennis Box to Marco Mambelli

#5 Updated by Dennis Box over 2 years ago

feedback sent via email

#6 Updated by Marco Mambelli over 2 years ago

  • Status changed from Feedback to Resolved

#7 Updated by Marco Mambelli over 2 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF