Project

General

Profile

Bug #21898

Error preventing the Frontend fo match jobs

Added by Marco Mambelli 3 months ago. Updated about 1 month ago.

Status:
Resolved
Priority:
High
Category:
Frontend
Target version:
Start date:
02/20/2019
Due date:
% Done:

0%

Estimated time:
(Total: 0.00 h)
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

Edgar is having problems with GlideinWMS 3.4.3 Frontend. It seems the matching expression is not working properly.
Here his message:
I tested this on my LIGO frontend today. Without changing any of the configurations after upgrading to glideinwms-3.4.3 I see jobs that were matching in my main group are no longer matching. As soon as I go back to version 3.4.2 this seems fixed.
This is the match expression I use

 <match match_expr='((glidein["attrs"]["GLIDEIN_Site"] in job.get("DESIRED_Sites","").split(",")) or (glidein["attrs"]["GLIDEIN_Site"] in job.get("DESIRED_XSEDE_Sites","").split(","))  or (job.get("DESIRED_Sites\
","nosite")=="nosite")) and (job.get("RequestMemory", 1020)&lt;=glidein["attrs"]["GLIDEIN_MaxMemMBs"]) and ((job.get("REQUIRED_OS", "any")=="any") or (job.get("REQUIRED_OS")==glidein["attrs"]["GLIDEIN_REQUIRED_OS"\
])) and (job.get("UNDESIRED_Sites","nosite")=="nosite" or (glidein["attrs"]["GLIDEIN_Site"] not in job.get("UNDESIRED_Sites","").split(",")))' start_expr='(OpenScienceGrid =?=True || is_itb=?=True)&amp;&amp;((DESI\
RED_Sites=?=undefined) || stringListMember(GLIDEIN_Site,DESIRED_Sites,",")) &amp;&amp;(RequestMemory &lt;=GLIDEIN_MaxMemMBs)&amp;&amp; ((isUndefined(REQUIRED_OS)) || (REQUIRED_OS=?="any") || (REQUIRED_OS=?=GLIDEIN\
_REQUIRED_OS))&amp;&amp;(UNDESIRED_Sites=?=undefined || !stringListMember(GLIDEIN_Site,UNDESIRED_Sites,",")) '>
      <factory query_expr="GLIDEIN_MaxMemMBs=!=UNDEFINED">
         <match_attrs>
            <match_attr name="GLIDEIN_MaxMemMBs" type="int"/>
            <match_attr name="GLIDEIN_REQUIRED_OS" type="string"/>
            <match_attr name="GLIDEIN_Site" type="string"/>
         </match_attrs>
         <collectors>
            <collector DN="/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=gfactory-1.t2.ucsd.edu" factory_identity="gfactory@gfactory-1.t2.ucsd.edu" my_identity="feligo@gfactory-1.t2.ucsd.edu" node=\
"gfactory-1.t2.ucsd.edu"/>
         </collectors>
      </factory>
      <job comment="Define job constraint and schedds globally for simplicity" query_expr="(OpenScienceGrid =?=True || is_itb=?=True)&amp;&amp; (JobUniverse==5)&amp;&amp;(GLIDEIN_Is_Monitor =!= TRUE)&amp;&amp;(JOB\
_Is_Monitor =!= TRUE)">
         <match_attrs>
            <match_attr name="DESIRED_Sites" type="string"/>
            <match_attr name="DESIRED_XSEDE_Sites" type="string"/>
            <match_attr name="REQUIRED_OS" type="string"/>
            <match_attr name="RequestMemory" type="int"/>
            <match_attr name="UNDESIRED_Sites" type="string"/>

This should be investigated understanding why the matching is broken and fixing the frontend if it is an improper handling of variables or the comparison:
- find out why the match is different in the 2 versions
- if 3.4.2 is incorrect, document the change for the release notes
- if 3.4.3 is incorrect, fix the problem, we may release a 3.4.4 (or 3.4.3.1)


Subtasks

Support #21940: Unit tests for boolean and string valuesFeedbackDennis Box


Related issues

Related to glideinWMS - Support #21537: Double-check functions that deal with boolean ClassAd facing possible misleading behaviorNew2018-12-12

Related to glideinWMS - Bug #22354: Strings evaluated as booleanClosed2019-04-10

History

#1 Updated by Marco Mambelli 3 months ago

The info form Edgar is also in the OSG ticket: https://opensciencegrid.atlassian.net/browse/SOFTWARE-3545

#2 Updated by Marco Mambelli 3 months ago

More from Edgar:
The frontend config is here:

https://github.com/opensciencegrid/osg-ligo-fe/blob/master/config/frontend.xml

And an example submit script would be

universe = vanilla
executable = run_bayeswave.sh
should_transfer_files = YES
log = osg-default-singularity.log
error = osg-default-singularity-$(cluster)-$(process)-$(node).err
output = osg-default-singularity-$(cluster)-$(process)-$(node).out
+DESIRED_Sites="RAL" 
+OpenScienceGrid = true
+SingularityImage = "/cvmfs/ligo-containers.opensciencegrid.org/james-clark/bayeswave:master" 
x509userproxy = pilot_proxy
Requirements = (HAS_SINGULARITY =?=True)
notification = never
queue 5

Since Marco Mascheroni technically is employed by UCSD I can probably somehow bend the rules to give him access to the ligo host and momentarily upgrade. It is pretty easy to rollback. The reason why I upgraded is that I wanted to run the same version that the submit host since the submit hosts are running latest greatest. The second reason is that I wanted to move LIGO to the gwms based singularity since GLUEX has been working fine.

#3 Updated by Marco Mascheroni 3 months ago

I got access to the LIGO frontend (production, Edgar is a really brave man! :p). Here is more details. I put a breakpoint here right before the match is evaluated in glideinFrontendLib.py (and I run the main group by hand):

318                      try:
319                          # Evaluate the Compiled object first.
320                          # Evaluation order does not really matter.
321                          if glidein["attrs"]["GLIDEIN_Site"] == 'RAL' and job['DESIRED_Sites'] == 'RAL':
322                              import pdb; pdb.set_trace()
323  ->                        match = eval(match_obj)

The global match expression was not giving problem:

(Pdb) [1325] root@osg-ligo-1 ~# sudo -u frontend /usr/bin/python /usr/sbin/glideinFrontendElement.py 324344 /var/lib/gwms-frontend/vofrontend main run
> /usr/lib/python2.6/site-packages/glideinwms/frontend/glideinFrontendLib.py(323)countMatch()
-> match = eval(match_obj)
(Pdb) ((glidein["attrs"]["GLIDEIN_Site"] in job.get("DESIRED_Sites","").split(",")) or (glidein["attrs"]["GLIDEIN_Site"] in job.get("DESIRED_XSEDE_Sites","").split(","))  or (job.get("DESIRED_Sites","nosite")=="nosite")) and (job.get("RequestMemory", 1020)<=glidein["attrs"]["GLIDEIN_MaxMemMBs"]) and ((job.get("REQUIRED_OS", "any")=="any") or (job.get("REQUIRED_OS")==glidein["attrs"]["GLIDEIN_REQUIRED_OS"])) and (job.get("UNDESIRED_Sites","nosite")=="nosite" or (glidein["attrs"]["GLIDEIN_Site"] not in job.get("UNDESIRED_Sites","").split(",")))
True
(Pdb) match
False

But as you can see match was still evaluating to False (ofc I evaluated it after I stepped over the match instruction).

I noticed that in the group_main/group.descript there is this (I actually haven't found out where is this coming from yet):

MatchExpr (True) and (glidein["attrs"].get("GLIDEIN_REQUIRE_GLEXEC_USE", False) is False)

And that part of the expression is evaluating to False:

MatchExpr (True) and (glidein["attrs"].get("GLIDEIN_REQUIRE_GLEXEC_USE", False) is False)

Reason is that this attr is a string which is comparing to a boolean:

(Pdb) glidein["attrs"]["GLIDEIN_REQUIRE_GLEXEC_USE"]
'False'
(Pdb) type(glidein["attrs"]["GLIDEIN_REQUIRE_GLEXEC_USE"])
<type 'str'>

The "incriminated" change is this one (I am diff-ing 343 and 342):

diff --git a/frontend/glideinFrontendElement.py b/frontend/glideinFrontendElement.py
index 66fd6eb..d33284e 100755
--- a/frontend/glideinFrontendElement.py
+++ b/frontend/glideinFrontendElement.py
@@ -560,7 +560,7 @@ class glideinFrontendElement:

             glidein_el = self.glidein_dict[glideid]
             glidein_in_downtime = \
-                glidein_el['attrs'].get('GLIDEIN_In_Downtime') == 'True'
+                glidein_el['attrs'].get('GLIDEIN_In_Downtime', False) is True

             count_jobs = {}   # straight match
             prop_jobs = {}    # proportional subset for this entry
@@ -580,10 +580,10 @@ class glideinFrontendElement:
             # Note: if GLEXEC is set to NEVER, the site will never see
             # the proxy, so it can be avoided.
             if (self.glexec != 'NEVER'):
-                if (glidein_el['attrs'].get('GLIDEIN_REQUIRE_VOMS')=="True"):
+                if (glidein_el['attrs'].get('GLIDEIN_REQUIRE_VOMS', False) is True):
                         prop_jobs['Idle']=prop_jobs['VomsIdle']
                         logSupport.log.info("Voms proxy required, limiting idle glideins to: %i" % prop_jobs['Idle'])
-                elif (glidein_el['attrs'].get('GLIDEIN_REQUIRE_GLEXEC_USE')=="True"):
+                elif (glidein_el['attrs'].get('GLIDEIN_REQUIRE_GLEXEC_USE', False) is True):
                         prop_jobs['Idle']=prop_jobs['ProxyIdle']

I believe it comes from this ticket: https://cdcvs.fnal.gov/redmine/issues/21325#note-1

Should we just revert the changes for GLIDEIN_REQUIRE_GLEXEC_USE and GLIDEIN_REQUIRE_VOMS? Notice that GLIDEIN_In_Downtime is a bool so Lorena's change is justified for that. But it is not for the other two attrs that are str.

(Pdb) type(glidein["attrs"]["GLIDEIN_In_Downtime"])
<type 'bool'>

#4 Updated by Marco Mambelli 3 months ago

My suggestion (considering Marco's findings and Lorena's indications) is:
1. to use a comparison that can work w/ both strings and booleans (at least for python 2), e.g. using str():

if (str(glidein_el['attrs'].get('GLIDEIN_REQUIRE_VOMS'))=="True")

2. to add a unit test testin for both booleans and string values in the inputs, and check if the comparison is python3 proof
3. to add more comnversions and checks (using str2bool())

To minimize the changes and possibility of further problems:
1 has to be done urgently as part of this ticket
3 should be done in a separate ticket or at least a separate branch not to be merged now
2 can be done either w/ 1 or w/ 3, definitely before committing 3

3

#5 Updated by Marco Mascheroni 3 months ago

I'd call lower on the string and compare it to 'true'. Just in case..

if (str.lower(str(glidein_el['attrs'].get('GLIDEIN_REQUIRE_VOMS')()=="true")

#6 Updated by Lorena Lobato Pardavila 3 months ago

Comments regarding the points mentioned in Marco Mambelli's last comment:

1. I added the change in v35/21898. Would like to see feedback from Marco Mascheroni for this specific use case. Will keep doing further tests, anyway
2. Talked to Dennis about. He told me he'll try to add the unit tests today
3. A while ago, I had already opened a ticket a concerning to: https://cdcvs.fnal.gov/redmine/issues/21537. We can follow it this point up there.

#7 Updated by Lorena Lobato Pardavila 3 months ago

  • Related to Support #21537: Double-check functions that deal with boolean ClassAd facing possible misleading behavior added

#8 Updated by Lorena Lobato Pardavila 3 months ago

  • Related to Support #21537: Double-check functions that deal with boolean ClassAd facing possible misleading behavior added

#9 Updated by Lorena Lobato Pardavila 3 months ago

  • Related to deleted (Support #21537: Double-check functions that deal with boolean ClassAd facing possible misleading behavior)

#10 Updated by Marco Mascheroni 3 months ago

  • Status changed from New to Resolved

#11 Updated by Lorena Lobato Pardavila 3 months ago

  • Start date changed from 02/13/2019 to 02/20/2019
  • Due date set to 02/20/2019

due to changes in a related task: #21940

#12 Updated by Marco Mambelli 3 months ago

  • Target version changed from v3_5 to v3_4_4

#13 Updated by Marco Mascheroni about 1 month ago

  • Assignee changed from Marco Mascheroni to Lorena Lobato Pardavila
  • Status changed from Resolved to Feedback

Turned out that this issue was not actually solved. The changes I made did fix an issue with the unsanitized classad GLIDEIN_REQUIRE_GLEXEC_USE , but the source of the original issue was in another place that is only triggered when glidein_glexec_use is set to NEVER (which I had not in my tests!).

This time I checked the fix on the frontend Edgar gave me access to. Changes are in v34/21898_1

#14 Updated by Marco Mambelli about 1 month ago

  • Related to Bug #22354: Strings evaluated as boolean added

#15 Updated by Marco Mambelli about 1 month ago

  • Status changed from Feedback to Resolved

Let's leave this ticket closed and solve the issue in the new ticket [#22354]



Also available in: Atom PDF