Project

General

Profile

Bug #2952

unexpected behavior and unclear logging when the match expression fails in the frontend

Added by Krista Larson over 7 years ago. Updated almost 7 years ago.

Status:
Closed
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
10/03/2012
Due date:
% Done:

100%

Estimated time:
(Total: 0.00 h)
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

When I have an invalid match expression, the entire frontend iteration terminates and this is what is in the logs so it's not very descriptive to the actual problem:

[2012-09-11T14:25:01-05:00 9819] Jobs found total 28897 idle 11779 (old 0, voms 11163) running 17118
[2012-09-11T14:25:02-05:00 9819] Glideins found total 17127 idle 2 running 17124 limit 100000 curb 90000
[2012-09-11T14:25:02-05:00 9819] Using 1 proxies
[2012-09-11T14:25:02-05:00 9819] Match
[2012-09-11T14:25:02-05:00 9819] Counting subprocess created
[2012-09-11T14:25:02-05:00 9819] WARNING: Failed to retrieve Real state information from the subprocess.
[2012-09-11T14:25:49-05:00 9819] WARNING: Failed to retrieve Running state information from the subprocess.
[2012-09-11T14:25:49-05:00 9819] WARNING: Failed to retrieve Idle state information from the subprocess.
[2012-09-11T14:25:49-05:00 9819] WARNING: Failed to retrieve OldIdle state information from the subprocess.
[2012-09-11T14:25:49-05:00 9819] Terminating iteration due to errors

Also, the frontend was using 2 factories and the expression worked fine on one of them. Why didn't the frontend continue to submit to the factory where the match expression worked?


Subtasks

Bug #2980: unexpected behavior and unclear logging when the match expression fails in the frontend -- masterClosedParag Mhashilkar

History

#1 Updated by Parag Mhashilkar over 7 years ago

More details: This needs to be addressed at several levels. Consider following example which we document for more details.

match_expr='glidein["attrs"]["GLIDEIN_Site"] in job["DESIRED_Sites"].split(",")'
  • We should update the docs, installation instructions to give match_expr example which is more robust. Admins build on top of the documented examples by adding complex logical expressions not realizing that foo in glidein["attrs"]["foo"] may not exist. We should change it to something like
match_expr='glidein["attrs"].has_key("GLIDEIN_Site") and (glidein["attrs"]["GLIDEIN_Site"] in job["DESIRED_Sites"].split(","))'

Agreed, that we have match_attrs but there are cases when the expr logic is faulty. This is a powerful tool but we should not expect admins to know all the details.

  • match_expr is a python code that gets evaluated and we don't log errors in a more meaningful and helpful way. This needs to be fixed
  • When queering a collector or processing the received info fails, do we want continue queering another collector? In case we get authentication errors queering a collector, we move on to the next one. However, if there is an issue processing the received information, we just stop everything and return. Unless someone can come up with a valid reason, I don't see why we should behave differently here. We should move on to next collector in this case as well.

#2 Updated by Burt Holzman over 7 years ago

I prefer using dict's get method to has_key.

We should catch any exceptions from the evaluation of the match expression and log it clearly. I think we should have a discussion on what to do if the match expr throws an exception for some subprocesses but not all -- should it be fatal or not?

-B

#3 Updated by Parag Mhashilkar over 7 years ago

Yes get() is a better option. I was just giving an example.

#4 Updated by Burt Holzman over 7 years ago

  • Assignee set to Parag Mhashilkar

#5 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from New to Feedback
  • Assignee changed from Parag Mhashilkar to Douglas Strain

After looking much into the details, it looks like we may not have an option to continue processing in case of errors during the calculations. Unless someone has better alternatives. Apart from this I have taken care of rest of the issues mentioned in the ticket

branch_v2plus: commit:36786c7

#6 Updated by Parag Mhashilkar over 7 years ago

  • Target version changed from v2_7_x to v2_6_2

#7 Updated by Douglas Strain over 7 years ago

Looks good.

My only comment is in frontend/glideinFrontendElement.py "except KeyError, e:". I like how you split this out. However, is KeyError the only thing we can catch? Are there any other common errors? What happens if you just put garbage in the match expression? Can we catch the compilation error (not sure what exception that occurs in that case)? [Note: This is more of a "bonus points" comment]

#8 Updated by Douglas Strain over 7 years ago

  • Assignee changed from Douglas Strain to Parag Mhashilkar

#9 Updated by Parag Mhashilkar over 7 years ago

We now catch keyerror and other exceptions as well. I just pulled out keyerror for better logging. If expression is really bad reconfig wont compile it and fail.

#10 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Feedback to Resolved

Merge to branch_v2plus and created a new ticket for master

#11 Updated by Parag Mhashilkar about 7 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF