Project

General

Profile

Bug #2963

Fronend v2_6_1 crashing regularly

Added by Igor Sfiligoi over 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
Urgent
Assignee:
Douglas Strain
Category:
Frontend
Target version:
Start date:
09/14/2012
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

The UCSD T3 team has installed the v2_6_1 FE and has notice it crashing regularly, every few hours.

The error message was always along the lines of:
[2012-09-13T04:15:40-07:00 22051] WARNING: Exception occurred: ['Traceback (most recent call last):\n', ' File "/home/frontend/glideinWMS/frontend/glideinFrontend.py", line 281, in main\n frontendDescript,groups,restart_attempts,restart_interval)\n', ' File "/home/frontend/glideinWMS/frontend/glideinFrontend.py", line 199, in spawn\n aggregate_stats()\n', ' File "/home/frontend/glideinWMS/frontend/glideinFrontend.py", line 43, in aggregate_stats\n status=glideinFrontendMonitorAggregator.aggregateStatus()\n', ' File "/home/frontend/glideinWMS/frontend/glideinFrontendMonitorAggregator.py", line 266, in aggregateStatus\n if type_attribute in global_fact_totals[fos][fact][attribute].keys():\n', "KeyError: 'MatchedJobs'\n"]

I put a try-except around
"glideinFrontendMonitorAggregator.py", line 266
and the FE has not crashed since (12h+)

So at this point I am pretty sure the above was the actual reason for the crashes.

History

#1 Updated by Parag Mhashilkar over 7 years ago

  • Assignee changed from Parag Mhashilkar to Douglas Strain

#2 Updated by Douglas Strain over 7 years ago

  • Status changed from New to Feedback
  • Assignee changed from Douglas Strain to Parag Mhashilkar

Moved the creation of the dictionary out of the if statement, so the keys get created in all cases.
Also added a condition to ensure that the dictionary key is not tested in case it still is missing.

commits in branch_v2plus_2963: commit:da64d92 and commit:10693c0

I wasn't able to actually reproduce the error (think you need more extensive monitoring data than I have), so please review carefully.

#3 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Feedback to Assigned
  • Assignee changed from Parag Mhashilkar to Douglas Strain

#4 Updated by Douglas Strain over 7 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Douglas Strain to Parag Mhashilkar

Ok, I have addressed your concerns about the creation of empty dictionaries. Can you please review?

After you are done, assign it back to me. I am still investigating, since I am concerned why the key was missing in the first place. I saw that the relevant code only works if you have multiple groups, so that's why we all missed it in testing.

#5 Updated by Douglas Strain over 7 years ago

Ok, I have a better understanding of this issue. This happens when there are multiple groups and the first group has no matched jobs in this iteration, but the second group does. This probably happened "every few" hours during times when job submission happened on one group but not the other (or maybe jobs matched slower on the first group, etc).

This fix does indeed solve the issue in this case.
Please review now. Thanks.

#6 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Feedback to Assigned
  • Assignee changed from Parag Mhashilkar to Douglas Strain

Fix looks ok to me. Lets change the key lookup to faster alternative before merging it to branch_v2plus and master.

#7 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Assigned to Resolved

Merged the changes to branch_v2plus & master.
On second thought, kept the key lookup same for now and this will be addressed throughout the code in next release.

#8 Updated by Parag Mhashilkar about 7 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF