Project

General

Profile

Support #22233

Feature #20505: Consider if the blacklisting script from FIFE can be added as GlideinWMS frontend feature

Troubleshoot FIFE blacklist and period attribute

Added by Lorena Lobato Pardavila 8 months ago. Updated 6 months ago.

Status:
Closed
Priority:
Normal
Category:
-
Target version:
Start date:
03/27/2019
Due date:
% Done:

0%

Estimated time:
Stakeholders:
Duration:

Description

FIFE folks have a periodic script1 which evaluates (often as the period is set) if the node is in a blacklist which is housed in the central web server that Fermilab runs.
They have this script set with period=3600 (seconds) [2] which supposes to be executed every hour.

Scripts are all executed once before starting the HTCondor glidein. Periodic scripts are invoked also later, repeatedly according to the period specified (in seconds):

It seems the script is being executed only once (at the beginning) so we were asked to take a look to see if we can figure out what's happening.

[1]https://cdcvs.fnal.gov/redmine/projects/discompsupp/wiki/Managing_the_Blacklist
[2]

 <file absfname="/etc/gwms-frontend/scripts/blacklist.sh" 
           after_entry="True" 
           after_group="False" 
           const="True" 
           executable="True" 
           period="3600" 
           untar="False" 
           wrapper="False">
     <untar_options cond_attr="TRUE"/>
     </file>

History

#1 Updated by Lorena Lobato Pardavila 8 months ago

  • Status changed from New to Work in progress
  • Tracker changed from Bug to Support

#2 Updated by Lorena Lobato Pardavila 8 months ago

  • Parent task set to #20505

#3 Updated by Lorena Lobato Pardavila 8 months ago

  • Target version set to v3_5
  • Assignee set to Lorena Lobato Pardavila

It seems the script was creating an infinite loop here, namely when executing curl command and "sleep 60":

# Max 5 retries
n=0
until [ $n -ge 5 ]
do
    #Replace the next line with the real webserver/blacklist file
    curl -s --insecure $blacklist_url > $TMPFILE && break
    n=$[$n+1]
    sleep 60
done

After doing several tests, I believe to make it work, curl command should be established with a maximum time(it worked for me with --max-time=60 and sleep 60)of allowance of operation to take, for preventing your batch jobs from hanging for hours due to slow networks or links going down.

On the other hand, we also believe it may be affected by some interaction between the script and HTCondor. We'll talk to HTCondor team during the next meeting, to get further information about.

#4 Updated by Lorena Lobato Pardavila 8 months ago

  • Status changed from Work in progress to Resolved

Confirmed with HTCondor team that there is no hidden interaction between HTCondor and curl command that could affect in this case. They agreed that adding –max-time as suggested in our tests, could help to avoid the hanging of the curl command.

I resolve the ticket as no changes are needed on our side.

#5 Updated by Marco Mambelli 6 months ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF