Project

General

Profile

Bug #23172

occasional errors in test_glideFactoryDowntimeLib.py

Added by Leonardo Lai about 2 months ago. Updated 12 days ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
Start date:
08/26/2019
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

On August 23rd I noticed that my branch v34/leo triggered an error in a unit test, according to the daily CI report.
The error was located in test_glideFactoryDowntimeLib.py, function test_downtimesfile.
Interestingly, it was only showing in SL7 (SL6 instead passed the test); also, the test was successful both the day before (August 22nd) and the day after (August 24th), without making any change to the portion of code being tested.
Here is an excerpt of the unit tests log from August 23rd (the full log is attached):

2422: TESTING ==========> ./test_glideFactoryDowntimeLib.py
2423: /scratch/workspace/glideinwms_ci/label_exp/RHEL7/label_exp2/swarm/venv-2.7/lib/python2.7/site-packages/htcondor/__init__.py:25: UserWarning: 
2424: Using a null condor_config.
2425: Neither the environment variable CONDOR_CONFIG, /etc/condor/,
2426: /usr/local/etc/, nor ~/condor/ contain a condor_config source.
2427:   warnings.warn(message)
2428: 
2429: Running tests...
2430: ----------------------------------------------------------------------
2431: F....
2432: ======================================================================
2433: ERROR [0.014s]: test_downtimesfile (__main__.TestDowntimes)
2434: ----------------------------------------------------------------------
2435: Traceback (most recent call last):
2436:   File "./test_glideFactoryDowntimeLib.py", line 62, in test_downtimesfile
2437:     check_time=now + 1))
2438: AssertionError: True is not false
2439: 
2440: ----------------------------------------------------------------------
2441: Ran 5 tests in 0.024s
2442: 
2443: FAILED (errors=1)

After some investigations, I've found a potential flaw in the test itself that could be the cause of my issue.
The test code (unittests/test_glideFactoryDowntimeLib.py) is the following:

    def test_downtimesfile(self):
        now = long(time.time())
        self.downtime.startDowntime(
            entry="All",
            comment="unittest downtime",
            create_if_empty=True)
        self.assertTrue(
            self.downtime.checkDowntime(
                entry="All",
                check_time=None))
        self.assertTrue(
            self.downtime.checkDowntime(
                entry="James",
                check_time=None))
        # Test downtime comments
        self.assertEquals(self.downtime.downtime_comment, "unittest downtime")
        self.downtime.endDowntime(entry="All", comment="end unittest downtime")

        # Use now+1 since we just ended the downtime
        # The second counter may not have updated
        self.assertFalse(
            self.downtime.checkDowntime(
                entry="All",
                check_time=now + 1))

The time is partitioned in slots of 1 second each.
Basically, the last assertion is checked at time "now+1", to make sure that endDownTime happened in the previous time interval. However, this is not always true: the variable "now" is only set at the beginning of the function and never updated, so endDownTime could potentially be invoked in the next second (or even later), causing "now+1" to point the same second of endDownTime. In this case, the check would fail because the downtime is considered still in progress, according to glideFactoryDowntimeLib.py:

def checkDowntime(fname,entry="Any",frontend="Any",security_class="Any",check_time=None):
   ...
   if check_time<=time_tuple[1]:
       return (comment, True) # within limit

The error is rarely triggered because the code section between the "now" assignment and the assertion executes quite fast, so most of the times the two events fall in the same time slot: probably this is why it went unnoticed for a long time.
The same issue is also present elsewhere in the same test script.
Fixing it should be trivial, by checking the condition at tick "long(time.time())+1" rather than "now+1".

unit_test#jQVEEqH.log (2.04 MB) unit_test#jQVEEqH.log Leonardo Lai, 08/26/2019 02:04 PM

History

#1 Updated by Leonardo Lai about 2 months ago

A simple fix has been implemented in v34/23172, which consists in updating the "now" variable where necessary.

#2 Updated by Leonardo Lai about 2 months ago

  • Target version set to v3_4_7

#3 Updated by Leonardo Lai about 2 months ago

The CI tests succeed after the fix; it would be good if someone could verify the reasonableness of the changes before merging.

#4 Updated by Leonardo Lai about 2 months ago

  • Status changed from Assigned to Feedback

#5 Updated by Marco Mambelli 12 days ago

  • Target version changed from v3_4_7 to v3_6_1

#6 Updated by Marco Mambelli 12 days ago

  • Status changed from Feedback to Resolved

merged to branch_v3_4



Also available in: Atom PDF