Project

General

Profile

Bug #10877

Frontends with same identity are mixed up even it they have different names and mappings

Added by Marco Mambelli almost 4 years ago. Updated almost 4 years ago.

Status:
New
Priority:
Normal
Assignee:
Parag Mhashilkar
Category:
-
Target version:
-
Start date:
11/06/2015
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

This becomes important if the same DN is used by multiple groups.

In the Factory configuration frontends are identified in the security section as:

      <frontends>
         <frontend name="hepcloudFE" identity="hepcloudFE@cmssrv280.fnal.gov">
            <security_classes>
               <security_class name="frontend" username="hepcloud_1"/>
            </security_classes>
         </frontend>
         <frontend name="cms" identity="hepcloudFE@cmssrv280.fnal.gov">
            <security_classes>
               <security_class name="cms" username="cms_1"/>
            </security_classes>
         </frontend>
         <frontend name="kisti" identity="hepcloudFE@cmssrv280.fnal.gov">
            <security_classes>
               <security_class name="kisti" username="kisti_1"/>
            </security_classes>
         </frontend>
         <frontend name="nova" identity="hepcloudFE@cmssrv280.fnal.gov">
            <security_classes>
               <security_class name="nova" username="nova_1"/>
            </security_classes>
         </frontend>
         <frontend name="zzztest" identity="hepcloudFE@cmssrv280.fnal.gov">
            <security_classes>
               <security_class name="zzztest_sc" username="pnova_1"/>
            </security_classes>
         </frontend>
      </frontends>

Which becomes a python nested dictionary (saved as frontend.descript:
# File: frontend.descript
#
hepcloudFE      {'ident': u'hepcloudFE@cmssrv280.fnal.gov', 'usermap': {u'frontend': u'hepcloud_1'}}
cms     {'ident': u'hepcloudFE@cmssrv280.fnal.gov', 'usermap': {u'cms': u'cms_1'}}
kisti   {'ident': u'hepcloudFE@cmssrv280.fnal.gov', 'usermap': {u'kisti': u'kisti_1'}}
nova    {'ident': u'hepcloudFE@cmssrv280.fnal.gov', 'usermap': {u'nova': u'nova_1'}}
zzztest         {'ident': u'hepcloudFE@cmssrv280.fnal.gov', 'usermap': {u'zzztest_sc': u'pnova_1'}}

Note how all entries share the same identity name

In glideinFactoryConfig.py (frontend description) there are:
- def get_identity(self, frontend): which returns the identity given the name
- def get_frontend_name(self, identity): which returns the first name found in the dictionary given an identity

If multiple names have the same identity, as above, then this is ambiguous.

In check_and_perform_work and unit_work_v3/unit_work_v2 there are these 2 calls:

client_expected_identity = entry.frontendDescript.get_identity(client_security_name)
entry.frontendDescript.get_frontend_name(client_expected_identity)

and the second one is used to get the name of the frontend:

    frontend_name = "%s:%s" % \
        (entry.frontendDescript.get_frontend_name(client_expected_identity),
         credential_security_class)

This may pick a wrong name causing a KeyError:

[2015-11-06 17:29:40,683] DEBUG: glideFactoryEntryGroup:289: Setting parallel_workers limit dynamically based on the available free memory
[2015-11-06 17:29:40,683] DEBUG: glideFactoryEntryGroup:294: Setting parallel_workers limit of 42
[2015-11-06 17:29:40,761] WARNING: fork:50: Failed child '<function forked_check_and_perform_work at 0x2376230>': 'zzztest:nova'
[2015-11-06 17:29:40,761] ERROR: fork:51: Failed child '<function forked_check_and_perform_work at 0x2376230>': 'zzztest:nova'
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/glideinwms/lib/fork.py", line 46, in fork_in_bg
    out = function(*args)
  File "/usr/sbin/glideFactoryEntryGroup.py", line 216, in forked_check_and_perform_work
    factory_in_downtime, entry, work[entry.name])
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 943, in check_and_perform_work
    params, in_downtime, condorQ)
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1304, in unit_work_v3
    frontend_name, client_web, params)
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1616, in perform_work_v3
    log=entry.log, factoryConfig=entry.gflFactoryConfig)
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryLib.py", line 537, in keepIdleGlideins
    if glidein_totals.has_sec_class_exceeded_max_held(frontend_name):
  File "/usr/lib/python2.6/site-packages/glideinwms/factory/glideFactoryLib.py", line 1740, in has_sec_class_exceeded_max_held
    return self.frontend_limits[frontend_name]['held'] >= self.frontend_limits[frontend_name]['max_held']
KeyError: 'zzztest:nova'
[2015-11-06 17:29:40,769] WARNING: fork:149: Failed to extract info from child 'FNAL_HEPCLOUD_AWS_West_2a_m3.2xlarge_NOVA'

History

#1 Updated by Marco Mambelli almost 4 years ago

The workaround for this is to use different identities (not one like identity="").
This requires to have different DNs that are mapped by condor to different identities.

#2 Updated by Marco Mambelli almost 4 years ago

  • Assignee changed from Marco Mambelli to Parag Mhashilkar

In meetings and discussions on 11/9 we agreed that this is a misinterpretation of the documentation.
There is a one-to one relationship between frontend name and identity.

I.e. In the factory configuration (the test line can be removed):

      <frontends>
         <frontend name="hepcloudFE" identity="hepcloudFE@cmssrv280.fnal.gov">
            <security_classes>
               <security_class name="frontend" username="hepcloud_1"/>
               <security_class name="cms" username="cms_1"/>
               <security_class name="kisti" username="kisti_1"/>
               <security_class name="nova" username="nova_1"/>
               <security_class name="zzztest_sc" username="pnova_1"/>
            </security_classes>
         </frontend>
      </frontends>

And in the frontend generic security configuration includes the FE name (secutiry_name):
   <security classad_proxy="/fe_proxy" proxy_DN="/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=host.fnal.gov" proxy_selection_plugin="ProxyAll" security_name="hepcloudFE" sym_key="aes_256_cbc">
      <credentials>
      </credentials>
   </security>

the group security configuration does not (no security_name and DN):
         <security>
            <credentials>
               <credential absfname="/accesskey" keyabsfname="/secretkey" pilotabsfname="/cloud_proxy" security_class="cms" trust_domain="HEPCloud_AWS_USEAST_1_CMS" type="key_pair+vm_id" vm_id="ami-1111111"/>
               <credential ... />
               <credential ... />
            </credentials>
         </security>

The documentation (both web and OSG wiki) should be updated to reinforce this one to one relationship.

Furthermore the fork.py could be modified to expose failures in the function invoked.
SL6 and bigger (py2.6) support try, except, finally. py2.4 requires the 2 nested try.
Something like below would show the error in the log file:

        try:
          try:
            out = function(*args)
            os.write(w, cPickle.dumps(out))
          except Exception, e:
            #MMDB
            logSupport.log.warning("Failed child '%s': %s" % (function, e))
            logSupport.log.exception("Failed child '%s': %s" % (function, e))
        finally:
            os.close(w)
            # Exit, immediately. Don't want any cleanup, since I was created
            # just for performing the work
            os._exit(0)

Passing the ticket to Parag,
Marco



Also available in: Atom PDF