Project

General

Profile

Bug #11487

AWS Probe crashing while many VMs running

Added by Kevin Retzke about 3 years ago. Updated about 3 years ago.

Status:
Assigned
Priority:
High
Assignee:
Start date:
01/25/2016
Due date:
% Done:

0%

Estimated time:
Duration:

Description

11:11:43 CST Gratia: processing instance i-6dfa2caa
11:11:43 CST Gratia: Creating a Record 2016-01-25T17:11:43Z
11:11:43 CST Gratia: Creating a UsageRecord 2016-01-25T17:11:43Z
11:11:43 CST Gratia: the tags are
11:11:43 CST Gratia: Name: glidein_startup.sh
11:11:43 CST Gratia: getting instance data
11:11:43 CST Gratia: getting spot price
11:11:44 CST Gratia: ERROR: Error getting data for instance i-6dfa2caa from ec2
11:11:44 CST Gratia: 'NoneType' object has no attribute 'get'
11:11:44 CST Error in Gratia probe: 'NoneType' object has no attribute 'get'

Appears to be an uncaught exception when the AWS spot price query fails.

History

#1 Updated by Kevin Retzke about 3 years ago

For quick fix and to help with debugging modified probe on fermicloud370 to catch exceptions within the instance loop and print a stack trace; now only failing instances will not be sent rather than terminating the entire run.

probes|5c2b0162

[root@fermicloud370 ~]# diff -uw aws-gratia-probe /usr/share/gratia/awsvm/
--- aws-gratia-probe    2016-01-06 13:14:43.000000000 -0600
+++ /usr/share/gratia/awsvm/aws-gratia-probe    2016-01-25 18:37:32.000000000 -0600
@@ -2,7 +2,7 @@
 import gratia.common.Gratia as Gratia
 import gratia.common.GratiaCore as GratiaCore
 import gratia.common.GratiaWrapper as GratiaWrapper
-from gratia.common.Gratia import DebugPrint, Error
+from gratia.common.Gratia import DebugPrint, Error, DebugPrintTraceback
 import boto3;
 from boto3.session import Session
 from pprint import pprint;
@@ -108,6 +108,7 @@
                 owneracct=reservation['OwnerId']
                 instances=reservation['Instances']
                 for instance in instances:
+                    try:
                     DebugPrint(4,"processing instance %s"% instance['InstanceId'])
                     r = Gratia.UsageRecord()
                     # set the defaults
@@ -244,9 +245,11 @@
                     r.ResourceType("AWSVM")
                     r.CpuDuration(0,'system')
                     r.AdditionalInfo("Version","1.0")
-        
+                    except Exception as e:
+                        DebugPrint(1,"ERROR: uncaught exception while processing instance, not sending record")
+                        DebugPrintTraceback()
+                    else:
                     DebugPrint(4,"sending record")
-
                     Gratia.Send(r)

#2 Updated by Kevin Retzke about 3 years ago

  • Status changed from New to Assigned
  • Assignee set to Kevin Retzke

#3 Updated by Kevin Retzke about 3 years ago

There appears to be a bug in boto3 causing this exception:

09:13:37 CST Gratia: processing instance i-d8eb381f
09:13:37 CST Gratia: Creating a Record 2016-01-26T15:13:37Z
09:13:37 CST Gratia: Creating a UsageRecord 2016-01-26T15:13:37Z
09:13:37 CST Gratia: no tags
09:13:37 CST Gratia: getting instance data
09:13:37 CST Gratia: getting spot price
09:13:38 CST Gratia: ERROR: Error getting data for instance i-d8eb381f from ec2
09:13:38 CST Gratia: 'NoneType' object has no attribute 'get'
09:13:38 CST Gratia: ERROR: uncaught exception while processing instance, not sending record
09:13:38 CST Gratia: In traceback print (0)
09:13:38 CST Gratia: In traceback print (1)
09:13:38 CST Gratia: Traceback (most recent call last):
  File "/usr/share/gratia/awsvm/aws-gratia-probe", line 209, in process_session
    market_price=ec2_util.spot_price_at_termination(instance['InstanceId'])
  File "/usr/lib/python2.6/site-packages/gratia/awsvm/ec2_util.py", line 130, in spot_price_at_termination
    match = re.match(r"Service initiated \((.*)\)",instance.state_transition_reason)
  File "/usr/lib/python2.6/site-packages/boto3/resources/factory.py", line 214, in property_loader
    return self.meta.data.get(name)
AttributeError: 'NoneType' object has no attribute 'get'

Quick fix is working, losing under 10 records (out of ~1500) per run with the error.



Also available in: Atom PDF