Project

General

Profile

Bug #25291

Duplicate glideclient classads - problem with hash() in Python 3

Added by Marco Mambelli 3 months ago. Updated 2 months ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
Start date:
12/07/2020
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

A behavior seen in the Python 3 version of GlideinWMS is that there are duplicate glideclient classads, with the same credentials and content. The system works but new classads are created instead of updating the old ones.
The name of the classads is a string derived by the Frontend and Group name, preceded by an ID.

After investigating, Bruno and I found out that the ID, that changes and causes the name of the classad to change (therefore to have new classads) is calculated using the Python hash() function.
In Python 3 this function changed using a different algorithm and using a random seed that causes different results for different python processes. The classads are published by FrontendElemets that run in different processes, thus the different ID causing different classads.
https://stackoverflow.com/questions/40137072/why-is-hash-slower-under-python3-4-vs-python2-7
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED

The solution is to use a hash function that remains consistent across processes.
  • PYTHONHASHSEED can be set to 0, to disable the random seed
  • A different hash function can be used, one that is consistent across processes (and Python invocations)

The latter is preferable because the random seed alteration varies the interpreter behavior and was added to improve security.

Possible hash functions could be

Although FNV is faster, given the limited use it is preferable to use something available in the stdlib (limited new code, no new dependency), e.g.:

hashlib.md5(content).hexdigest()[:8]
or
hashlib.md5(content).digest().encode('base64')[:8]

Both return an 8 char string (6 should be sufficient), with limited collisions, the second one even less than the first (given the bigger cardinality of alpha chars vs hex chars).

So the tasks for this ticket are:
  1. decide on a function with consistent results to replace the current use of Python hash()
  2. scan the code to see if there are other uses besides the one in glideinFrontendInterface.py, check if those need to be consistent across processes as well, and replace the hash function
factory/glideFactoryLib.py:    hash_val=str(abs(hash(dn+voms))%1000000)
frontend/glideinFrontendInterface.py:        #logSupport.log.debug("Using hash_str=%s (%d)"%(hash_str,abs(hash(hash_str))%1000000))
frontend/glideinFrontendInterface.py:        return str(abs(hash(hash_str))%1000000)
lib/util.py:    >>> pp(dict( (v,k) for k,v in flattenDict(testData, lift=hash, join=lambda a,b:hash((a,b))) ))
unittests/test_lib_exprParser.py:             'a>>3', 'a/b', 'a/3', 'lambda a,b:hash((a,b))',

History

#1 Updated by Bruno Coimbra 3 months ago

Note from Python 3 documentation:

By default, the hash() values of str and bytes objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.
Changing hash values affects the iteration order of sets. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).
See also PYTHONHASHSEED.

#2 Updated by Bruno Coimbra 3 months ago

  • Assignee changed from Bruno Coimbra to Marco Mambelli
  • Status changed from New to Feedback

I implemented the new hash_nc() function to replace the native Python 3 hash(). This function is located in the glideinwms.lib.util module and uses the MD5 implementation of the native Python library hashlib, which ensures consistency across different processes. Additionally, the new function allows chopping the hex digest to create hashes of specific sizes. Note that the hash_nc() function requires a byte string as input and returns a decoded string.

Function implementation:

def hash_nc(data, len=None):
    """Non-cryptographic MD5 hashing function

    Args:
        data (bytes): Data to hash
        len (int, optional): Hash length. Defaults to None.

    Returns:
        str: Hash
    """ 

    out = hashlib.md5(data).hexdigest()
    if len:
        out = out[:len]

    return out

#3 Updated by Bruno Coimbra 3 months ago

The changes are in the v39/25291 branch.

#4 Updated by Bruno Coimbra 3 months ago

1. I'd remove the line from factory/glideFactoryLib.py if it is not used. double check first.

I double checked and the variable was not used indeed. I removed the line from the code.

2. Then the hash function should accept AnyStr, both str or bytes as input and encode if needed inside the function.
Use force_bytes() from defaults.py inside hash_nc()
See sym/pubCrypto.py
Also this way the dependency from defaults.py goes away in the modules using hash_nc()

I modified hash_nc() to use force_bytes() and accept bytes. I removed the unnecessary dependencies of defaults.py.

3. Add a TODO note on an update that should be done for python 3.9 to set usedforsecurity to False to the md5 constructor.
Some implementations may start complaining about md5 if they think it's used for cryptographic purposes.

Done.

4. hexdigit is already a string that can contain letters and digits, why not base64 (a filenamesafe version with no _@, e.g. base64.b64encode(s, altchars=b'-.')) or base32 to avoid any punctuation, both are more compact than hex (20 to 40%). I'd lean towards base32 but I don't feel strongly about this.

I made changes so we now return the hash in base32.

#5 Updated by Marco Mambelli 2 months ago

  • Assignee changed from Marco Mambelli to Bruno Coimbra

OK to merge

#6 Updated by Bruno Coimbra 2 months ago

  • Status changed from Feedback to Resolved

Also available in: Atom PDF