condor_q output may have changed in Condor 7.9pre
We should be proactive on this so we will work with 7.9.x out of the box.
I'm running a pre-release of condor 7.9. The specific RPM is: https://koji.hep.caltech.edu/koji/buildinfo?buildID=773
I believe something is breaking the 'condor_q' command the frontend runs.
The output of the debug log:
[2012-05-02T11:46:38-05:00 2165299] Failed to retrieve jobs state from the subprocess:
[2012-05-02T11:47:48-05:00 2165299] Failed to retrieve jobs state from the subprocess:
[2012-05-02T11:49:05-05:00 2165299] Failed to retrieve jobs state from the subprocess:
[2012-05-02T11:50:16-05:00 2165299] Failed to retrieve jobs state from the subprocess:
From the info log:
[2012-05-02T11:53:37-05:00 2165299] Iteration at Wed May 2 11:53:37 2012
[2012-05-02T11:53:37-05:00 2165299] Querying schedd, entry, and glidein status using child processes.
[2012-05-02T11:53:46-05:00 2165299] WARNING: Failed to retrieve jobs state information from the subprocess.
[2012-05-02T11:53:46-05:00 2165299] WARNING: Missing schedd, factory entry, and/or current glidein state information. Unable to calculate required glideins, terminating loop.
[2012-05-02T11:53:46-05:00 2165299] Writing stats
[2012-05-02T11:53:46-05:00 2165299] Sleep
Not very helpful logging. This only happens if there is a job for the group. If there are no jobs in queue with the job query expression = true, it seems to work (ie, sees 0 jobs, advertises 0 needed jobs to the factory).
#1 Updated by Douglas Strain over 7 years ago
Derek and I tracked this down. Apparently in this development version, the condor_q returns XML that has "ProcID" instead of "ProcId". This causes an exception in our parsing.
1) I can't imagine that this was done on purpose. Would it be possibly for someone at CondorWeek to mention this to the condor team so it can be fixed before a proper release?
2) Derek also suggested we have more extensive logging in the subprocess code where this forks to do condor_q and condor_status, since, if the forked process fails, there is no reason given, and it is a pain to track down.
I think that #2 should be done as part of this ticket.
#3 Updated by Derek Weitzel over 7 years ago
#4 Updated by Derek Weitzel over 7 years ago
- File 0001-Adding-logging-for-exceptions-when-running-in-the-su.patch 0001-Adding-logging-for-exceptions-when-running-in-the-su.patch added
Adding (untested) patch for some simple logging that would have caught this error.
#7 Updated by Douglas Strain over 7 years ago
Due to problems in the attribution and message, I went ahead and re-committed this patch in a new branch: branch_v2plus_2692_try2. Please use this one for reviewing and merging.
New commit numbers are as follows:
commit:56f5abd2fd9401fd79af81b79d64f2223062f912 list2dict: getting rid of a redundant condition taken care of above
commit:fc308e86234487c7909c34f3cb97060c885fe2bb list2dict now can handle case insensitive requests during xml parsing
commit:255a19a23f5507bd3c83e3af45f5150457e3e8f2 Adding logging for exceptions when running in the subprocesses. From ticket #2692