Project

General

Profile

Bug #21570

Factory crashing with malformed HTCondor log: AttributeError: dirSummaryTimingsOut instance has no attribute 'data'

Added by Marco Mascheroni 8 months ago. Updated 7 months ago.

Status:
Closed
Priority:
High
Category:
-
Target version:
Start date:
12/18/2018
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:

Factory Ops

Duration:

Description

This is the bug that caused #21569 . It's still happening right now.

[2018-12-16 14:50:25,175] ERROR: glideFactoryLogParser:289: dirSummarySimple failed
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryLogParser.py", line 287, in get_simple
    obj = dirSummarySimple(self)
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryLogParser.py", line 238, in __init__
    self.data=copy.deepcopy(obj.data)
AttributeError: dirSummaryTimingsOut instance has no attribute 'data'
[2018-12-16 14:50:25,176] WARNING: fork:56: Forked process '<function forked_check_and_perform_work at 0x18f6410>' failed
[2018-12-16 14:50:25,176] ERROR: fork:57: Forked process '<function forked_check_and_perform_work at 0x18f6410>' failed
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/fork.py", line 53, in fork_in_bg
    out = function_torun(*args)
  File "/usr/sbin/glideFactoryEntryGroup.py", line 216, in forked_check_and_perform_work
    factory_in_downtime, entry, work[entry.name])
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1089, in check_and_perform_work
    params, in_downtime, condorQ)
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1493, in unit_work_v3
    frontend_name, client_web, params)
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1586, in perform_work_v3
    entry.gflFactoryConfig.log_stats.logSummary(client_log_name, log_stats)
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryMonitoring.py", line 681, in logSummary
    self.current_stats_data[client_name][username] = stats[username].get_simple()
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryLogParser.py", line 287, in get_simple
    obj = dirSummarySimple(self)
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryLogParser.py", line 238, in __init__
    self.data=copy.deepcopy(obj.data)
AttributeError: dirSummaryTimingsOut instance has no attribute 'data'

Before that exception I also saw exceptions like this one (the first time only)

Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/util.py", line 266, in file_pickle_load
    (fname, expiration, fname_time))
ExpiredFileException: File /var/lib/gwms-factory/work-dir/aggregated_stats_dict.data expired, older then 3600 seconds (file time: 1544949344.35)

It's important to notice that we got a warning about the disk filling up as well. That might be the real reason why all of this happened: probably some monitoring files are in a weird state right now. I'll investigate more.

Putting 3.4.3 as the targeted version since this seems important and it's my top priority right now.

History

#1 Updated by Marco Mascheroni 8 months ago

Seems a condor log record was corrupted (disk full?). See:

[2018-12-18 09:45:40,260] ERROR: glideFactoryEntry:1577: invalid literal for int() with base 10: '6605027029 (6603802'
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1574, in perform_work_v3
    log_stats[credential_username + ":" + client_int_name].load()
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorLogParser.py", line 648, in load
    obj.load()
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorLogParser.py", line 87, in load
    self.loadFromLog()
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryLogParser.py", line 119, in loadFromLog
    job_id=rawJobId2Nr(el[0])
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/condorLogParser.py", line 1039, in rawJobId2Nr
    return (int(arr[0]), int(arr[1]))
ValueError: invalid literal for int() with base 10: '6605027029 (6603802'

And this is the error:

...
012 (6602974.000.000) 12/16 00:37:43 Job was held.
        CE job in status 1 put on hold by SYSTEM_PERIODIC_HOLD due to non-existent route in JOB_ROUTER_ENTRIES or route job limit.
        Code 26 Subcode 0
...
012 (6605027029 (6603802.001.000) 12/16 14:11:36 The job's remote status is unknown
...
029 (6604336.007.000) 12/16 14:11:36 The job's remote status is unknown
...

I propose to add a protection here:

def rawJobId2Nr(str):
    """ 
    Convert the log representation into (ClusterId,ProcessId)

    Return (-1,-1) in case of error
    """ 
    arr=str.split(".")
    if len(arr)>=2:
        return (int(arr[0]), int(arr[1]))
    else:
        return (-1, -1) #invalid

Thoughts?

#2 Updated by Marco Mascheroni 8 months ago

Testing:

def rawJobId2Nr(str):
    """ 
    Convert the log representation into (ClusterId,ProcessId)

    Return (-1,-1) in case of error
    """ 
    arr=str.split(".")
    try:
        return (int(arr[0]), int(arr[1]))
    except (KeyError, ValueError):
        return (-1, -1) #invalid

#3 Updated by Marco Mambelli 8 months ago

I don't see where KeyError would come from. I see IndexError if there is no dot or empty string, ValueError if it is not an int. Split could give AttributeError if str is not a string but I'm not expecting that:

    except (IndexError, ValueError):

I would have chosen 0.0 as invalid job id (it is used for file names, ... ), but it was already there so OK. Just change the exceptions.

#4 Updated by Marco Mascheroni 8 months ago

  • Status changed from New to Feedback
  • Assignee changed from Marco Mascheroni to Marco Mambelli

#5 Updated by Marco Mascheroni 8 months ago

  • Stakeholders updated (diff)

#6 Updated by Marco Mambelli 8 months ago

  • Assignee changed from Marco Mambelli to Marco Mascheroni

ok to merge

#7 Updated by Marco Mascheroni 8 months ago

  • Status changed from Feedback to Resolved

#8 Updated by Marco Mambelli 7 months ago

  • Subject changed from AttributeError: dirSummaryTimingsOut instance has no attribute 'data' to Factory crashing with malformed HTCondor log: AttributeError: dirSummaryTimingsOut instance has no attribute 'data'

#9 Updated by Marco Mambelli 7 months ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF