Project

General

Profile

Bug #22370

Periodic scripts seem to use the prefix inconsistently, only when invoked by startd cron

Added by Marco Mambelli 8 months ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
04/12/2019
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

From the following message from Krista, it seems that the behavior of periodic scripts is inconsistent.
They do use the prefix when invoked as startd cron scripts. They use no prefix when invoked before starting the glidein.
This brings them to define different attributes and should be avoided.
If this is what happens, two solutions could be considered:

1- add the prefix also when the scripts are running outside startd cron
2- do not run the periodic scripts outside startd cron

Note: Option 1 is preferred

I finally found the email from Diego that talked about periodic scripts for CMS.  I have not done any research to see if this applies here but wanted to update the thread in case it helps.  We can also check this as we continue the blacklist testing.  It’s from a while ago (and it was an old observation from him) so I have no idea what version of gwms or condor applies:

I dealt with this a while ago, so maybe my description is not 100% accurate, but if i recall correctly, this is what I've observed:

Depending on whether you have declared a prefix on the FE or not, these periodic scripts will work differently.
If you set, for example,  prefix="GLIDEIN_PS_", in the frontend.xml and your script is echoing the attribute FOO=BAR, the first time the script is executed, you will see the attribute FOO in the Startd classAd, then in the subsequent "periodic" executions the attribute that is propagated is GLIDEIN_PS_FOO (FOO will stay with its initial value). Now the tricky part is that if in one of the periodic executions the attribute FOO doesn't get echoed for any reason (the script got broken, the flow of the script has changed, etc) then GLIDEIN_PS_FOO will get UNDEFINED.

Now if you are not using a prefix (prefix="NOPREFIX" or something like that), then every subsequent "periodic" execution will overwrite the value set in the first execution of the script and if, for any reason, FOO doesn't get echoed it will take the initial value (from the fist execution)

History

#1 Updated by Lorena Lobato Pardavila 8 months ago

  • Subject changed from Periodic scripts seem to use the the prefix inconsistently, only when invoked by startd cron to Periodic scripts seem to use the prefix inconsistently, only when invoked by startd cron

#2 Updated by Lorena Lobato Pardavila 8 months ago

  • Description updated (diff)

#3 Updated by Marco Mambelli 8 months ago

Clarifying my ticket request:

1. Krista seems to describe this behavior: "They do use the prefix when invoked as startd cron scripts. They use no prefix when invoked before starting the glidein." : Ok is this the correct behavior or?
- any behavior is OK as long as documented
- anyway behaving always the same is preferred because of 2 below

2. "This brings them to define different attributes and should be avoided." : Which kind of/Example of the attributes? Do you mean that the consequence of having the behavior mentioned above, make them have to define new attributes to correct it?
- If there is a prefix, e.g. GLIDEIN_PS_ and an attribute, e.g. ATTR1, according to point 1, the script defines 2 attributes: ATTR1 when invoked the first time, GLIDEIN_PS_ATTR1 when invoked periodically. This is not desired. Always GLIDEIN_PS_ATTR1 would be better

3. "If this is what happens, two..": Do you mean with "this", defining new attributes or if the behavior defined above is what we observe?
"this"=1. If the behavior described in 1 here happens (first verify the behavior of attributes from periodic scripts), then add the prefix also when the scripts are run outside startd cron.

Note that things could differ if the auxiliary function (ADD_CONFIG_LINE_SOURCE) is used or if the script writes directly to stdout.

What we want to avoid is that the same line in a script produces a different result when running the first time in glideinwms or when running periodically in condor startd.

#4 Updated by Lorena Lobato Pardavila 8 months ago

In other words:

- If you use the prefix (ex: GLIDEIN_PS_ )and an attribute (ex: ATTR1), the attr name changes in the lifetime of the pilot based on when it is executed. For example, in this case: ATTR1 when invoked the first time, GLIDEIN_PS_ATTR1 when invoked periodically . And there are two attrs in the classad and if there is a failure it goes to undefined.
- If you do no prefix, the name is consistent and it reverts to the last value if there is a failure.
- The problem is not only in the names (for example, using prefix, they should be consistent), it’s also in the values that they are getting (especially when failure case)

Updates from my discussion with Diego Davila. Based on his last comment [1], I'll make sure ,that in fact there is a duplicated attribute when a prefix is used and corrected by adding the prefix also when the scripts are running outside startd cron.

I think you got it right, what you have described is what I observed, but I think this is all condor, once I talked to the devs and they told me that's they way condor crons are supposed to work. The only issue that could be coming from GlideinWMS (and I'm not entirely sure) is the fact of getting a duplicated attribute when a prefix is used (one with the prefix, one without it).

#5 Updated by Lorena Lobato Pardavila 7 months ago

  • Status changed from New to Resolved

Fixed. What was happening:

Context
The script they were having troubles with is this:
https://gitlab.cern.ch/CMSSI/CMSglideinWMSValidation/blob/master/singularity_validation.sh

Then, in the CMS Global pool you can currently see that pilots advertise the attributes produced in that script twice, one with the prefix, one without it. For example:

$ condor_status -pool vocms0815 -const 'slottype=="Partitionable" && GLIDEIN_CMSSite=="T1_ES_PIC"' -limit 1 -af:h HAS_SINGULARITY GLIDEIN_PS_HAS_SINGULARITY GLIDEIN_PS_GPUDetection GPUDetection
HAS_SINGULARITY GLIDEIN_PS_HAS_SINGULARITY GLIDEIN_PS_GPUDetection GPUDetection    
true            true                       No GPUs detected        No GPUs detected 

Explanation
In the function advertise from that singularity_validation script, they have several outputs. According to [[http://glideinwms.fnal.gov/doc.prd/factory/custom_scripts.html#periodic]], they should either call "echo" or "add_condor_vars_line" but no both. The issue here was that "echo" added the prefix and the duplicated attributes that we have been seeing (without prefix) were coming from the call "add_condor_vars_line". Thus, they should remove that line.

Solution
Removed that line ( add_condor_vars_line $key "$atype" "-" "+" "Y" "Y" "+") from advertise function. Confirmed that now we don't have duplicates:

[llobato@fermicloud133 llobato]$ condor_status -l | grep -i "singularity" 
GLIDEIN_PS_CVMFS_singularity_opensciencegrid_org_REVISION = 62318
GLIDEIN_PS_HAS_CVMFS_singularity_opensciencegrid_org = true
GLIDEIN_PS_HAS_SINGULARITY = true
GLIDEIN_PS_OSG_SINGULARITY_IMAGE_DEFAULT = "/cvmfs/singularity.opensciencegrid.org/bbockelm/cms:rhel6" 
GLIDEIN_PS_OSG_SINGULARITY_PATH = "/bin/singularity" 
GLIDEIN_PS_OSG_SINGULARITY_VERSION = "2.6.1-dist" 
GLIDEIN_PS_SINGULARITY_VALIDATION_TIME = ".246398702" 
GLIDEIN_Singularity_Use = "DISABLE_GWMS" 
HasSingularity = true
SingularityVersion = "2.6.1-dist" 
StarterAbilityList = "HasTDP,HasFileTransferPluginMethods,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasSingularity,HasPerFileEncryption,HasFileTransfer,HasTransferInputRemaps,HasVM,HasReconnect,HasMPI" 
GLIDEIN_PS_CVMFS_singularity_opensciencegrid_org_REVISION = 62318
GLIDEIN_PS_HAS_CVMFS_singularity_opensciencegrid_org = true
GLIDEIN_PS_HAS_SINGULARITY = true
GLIDEIN_PS_OSG_SINGULARITY_IMAGE_DEFAULT = "/cvmfs/singularity.opensciencegrid.org/bbockelm/cms:rhel6" 
GLIDEIN_PS_OSG_SINGULARITY_PATH = "/bin/singularity" 
GLIDEIN_PS_OSG_SINGULARITY_VERSION = "2.6.1-dist" 
GLIDEIN_PS_SINGULARITY_VALIDATION_TIME = ".195535660" 
GLIDEIN_Singularity_Use = "DISABLE_GWMS" 
HasSingularity = true
SingularityVersion = "2.6.1-dist" 
StarterAbilityList = "HasTDP,HasFileTransferPluginMethods,HasJobDeferral,HasJICLocalConfig,HasJICLocalStdin,HasSingularity,HasPerFileEncryption,HasFileTransfer,HasTransferInputRemaps,HasVM,HasReconnect,HasMPI" 

Resolving ticket.

#6 Updated by Marco Mambelli 6 months ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF