unexpected behavior and unclear logging when the match expression fails in the frontend
When I have an invalid match expression, the entire frontend iteration terminates and this is what is in the logs so it's not very descriptive to the actual problem:
[2012-09-11T14:25:01-05:00 9819] Jobs found total 28897 idle 11779 (old 0, voms 11163) running 17118
[2012-09-11T14:25:02-05:00 9819] Glideins found total 17127 idle 2 running 17124 limit 100000 curb 90000
[2012-09-11T14:25:02-05:00 9819] Using 1 proxies
[2012-09-11T14:25:02-05:00 9819] Match
[2012-09-11T14:25:02-05:00 9819] Counting subprocess created
[2012-09-11T14:25:02-05:00 9819] WARNING: Failed to retrieve Real state information from the subprocess.
[2012-09-11T14:25:49-05:00 9819] WARNING: Failed to retrieve Running state information from the subprocess.
[2012-09-11T14:25:49-05:00 9819] WARNING: Failed to retrieve Idle state information from the subprocess.
[2012-09-11T14:25:49-05:00 9819] WARNING: Failed to retrieve OldIdle state information from the subprocess.
[2012-09-11T14:25:49-05:00 9819] Terminating iteration due to errors
Also, the frontend was using 2 factories and the expression worked fine on one of them. Why didn't the frontend continue to submit to the factory where the match expression worked?
#1 Updated by Parag Mhashilkar over 8 years ago
More details: This needs to be addressed at several levels. Consider following example which we document for more details.
match_expr='glidein["attrs"]["GLIDEIN_Site"] in job["DESIRED_Sites"].split(",")'
- We should update the docs, installation instructions to give match_expr example which is more robust. Admins build on top of the documented examples by adding complex logical expressions not realizing that foo in glidein["attrs"]["foo"] may not exist. We should change it to something like
match_expr='glidein["attrs"].has_key("GLIDEIN_Site") and (glidein["attrs"]["GLIDEIN_Site"] in job["DESIRED_Sites"].split(","))'
Agreed, that we have match_attrs but there are cases when the expr logic is faulty. This is a powerful tool but we should not expect admins to know all the details.
- match_expr is a python code that gets evaluated and we don't log errors in a more meaningful and helpful way. This needs to be fixed
- When queering a collector or processing the received info fails, do we want continue queering another collector? In case we get authentication errors queering a collector, we move on to the next one. However, if there is an issue processing the received information, we just stop everything and return. Unless someone can come up with a valid reason, I don't see why we should behave differently here. We should move on to next collector in this case as well.
#2 Updated by Burt Holzman over 8 years ago
I prefer using dict's get method to has_key.
We should catch any exceptions from the evaluation of the match expression and log it clearly. I think we should have a discussion on what to do if the match expr throws an exception for some subprocesses but not all -- should it be fatal or not?
#5 Updated by Parag Mhashilkar over 8 years ago
- Status changed from New to Feedback
- Assignee changed from Parag Mhashilkar to Douglas Strain
After looking much into the details, it looks like we may not have an option to continue processing in case of errors during the calculations. Unless someone has better alternatives. Apart from this I have taken care of rest of the issues mentioned in the ticket
#7 Updated by Douglas Strain over 8 years ago
My only comment is in frontend/glideinFrontendElement.py "except KeyError, e:". I like how you split this out. However, is KeyError the only thing we can catch? Are there any other common errors? What happens if you just put garbage in the match expression? Can we catch the compilation error (not sure what exception that occurs in that case)? [Note: This is more of a "bonus points" comment]