Project

General

Profile

Bug #2970

after_filelist is not populated with correct condor tarball

Added by Parag Mhashilkar over 7 years ago. Updated about 7 years ago.

Status:
Closed
Priority:
High
Assignee:
Parag Mhashilkar
Category:
-
Target version:
Start date:
09/24/2012
Due date:
% Done:

0%

Estimated time:
First Occurred:
Occurs In:
Stakeholders:
Duration:

Description

On an internal test of running Condor 7.8.3 glideins on gWMS
branch_v2_6_1_gf1, we are not able to get 7.8.3 glideins (the frontend is
only requesting 7.8.3 glideins), and instead only 7.8.2 glideins. We were
able to get 7.8.3 glideins to run only when we removed all condor 7.8.2
references in the factories' glideinWMS.xml.

We saw this behavior on two separate instances of the factory, here are
the logs from one of them:

from job.709.0.err showing 7.8.3 was selected:
CONDOR_PLATFORM_7.8.3-default-default 1

from job.709.0.err startdlog showing 7.8.2 was actually ran:
09/17/12 11:01:14 (pid:30720) **********************************************
09/17/12 11:01:14 (pid:30720) * condor_startd (CONDOR_STARTD) STARTING UP
09/17/12 11:01:14 (pid:30720) *

/data2/condor_local/execute/dir_22474/glide_U22614/main/condor/sbin/condor_startd
09/17/12 11:01:14 (pid:30720) * SubsystemInfo: name=STARTD type=STARTD
class=DAEMON
09/17/12 11:01:14 (pid:30720) *
Configuration: subsystem:STARTD
local:<NONE> class:DAEMON
09/17/12 11:01:14 (pid:30720) * $CondorVersion: 7.8.2 Aug 08 2012 $
09/17/12 11:01:14 (pid:30720) *
$CondorPlatform: x86_rhap_5.8 $
09/17/12 11:01:14 (pid:30720) * PID = 30720
09/17/12 11:01:14 (pid:30720) *
Log last touched time unavailable (No
such file or directory)
09/17/12 11:01:14 (pid:30720)

from glideinWMS.xml:
<condor_tarball arch="default" os="default"
tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz"
version="default"/>
<condor_tarball arch="default" os="default"
tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.3-x86_rhap_5.8-stripped.tar.gz"
version="7.8.3"/>
<condor_tarball arch="default" os="default"
tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz"
version="7.8.2"/>

Tim Mortensen
OSG Glidein Factory Operations

History

#1 Updated by Parag Mhashilkar over 7 years ago

Update on what I found out (email I sent to Tim)

After spending some time and doing extensive analysis here is what I
could find

[0937] gfactory@cabinet-10-10-6 /tmp/parag$ ls
-al /home/gfactory/glideinsubmit/glidein_v1_2/client_log/user_fe7/entry_CMS_T2_US_UCSD_gw2/job.708.0.err 
-rw-rw-r-- 1 fe7 fe7 50042 Sep 17
11:21 /home/gfactory/glideinsubmit/glidein_v1_2/client_log/user_fe7/entry_CMS_T2_US_UCSD_gw2/job.708.0.err

Error file was created at 11:21 and picked up the condor tarball info
from after_file_list.c9haCf.lst which lists all the condor tarballs
irrespective of version to be condor_bin.c9haCf.tgz

This is a condor 7.8.2 condor tarball

[0942] gfactory@cabinet-10-10-6 /tmp/parag$ strings -a
sbin/condor_startd | grep -i condorversion
CondorVersion
$CondorVersion: 7.8.2 Aug 08 2012 $

glideinWMS.c9haCf.xml points to correct condor version tarball. So
somehow looks like while reconfig something went wrong and the
after_filelist was not updated correctly.

I can see that the current after_file_list.c9hgtW.lst in staging area is
correct with tarballs pointing to right versions. So something you did
fixed the problem and recreated the after_filelist correctly.

So the way it is setup now, when CONDOR_VERSION=default, it will fetch
7.8.3. When condor_version explicitly set, it will fetch the correct
tarball, either 7.8.3 or 7.8.2 based on the version requested.

I am interested to know what you did in first place that caused this
issue and then later the steps that resulted in the issue being fixed.
Do you remember by any chance you had set tar_file in <condor_tarball>
pointing to wrong version at anytime?

#2 Updated by Parag Mhashilkar over 7 years ago

Tips on how to reproduce the problem (From Tim)....

  • Remove all tarball references in glideinWMS.xml except for the default and
    do a reconfig and the following is produced:
<condor_tarballs>
   <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.3-x86_rhap_5.8-stripped.tar.gz" version="default"/>
</condor_tarballs>
http://cabinet-10-10-6.t2.ucsd.edu/glidefactory/stage/glidein_v1_2/after_file_list.c9lfKq.lst
# File: after_file_list.c9lfKq.lst
#
# Outfile     InFile             Cache/exec     Condition     ConfigOut
##############################################################################
validate_node.sh     validate_node.c96hTg.sh     exec     TRUE     FALSE
condor_platform_select.sh     condor_platform_select.c96hTg.sh     exec     TRUE     FALSE
condor_bin_default-default-default.tgz     condor_bin.c9lfKq.tgz     untar     CONDOR_PLATFORM_default-default-default     CONDOR_DIR
create_mapfile.sh     create_mapfile.c96hTg.sh     exec     TRUE     FALSE
collector_setup.sh     collector_setup.c96hTg.sh     exec     TRUE     FALSE
gcb_setup.sh     gcb_setup.c6le0z.sh     exec     TRUE     FALSE
glexec_setup.sh     glexec_setup.c96hTg.sh     exec     TRUE     FALSE
java_setup.sh     java_setup.c8gdK4.sh     exec     TRUE     FALSE
glidein_memory_setup.sh     glidein_memory_setup.c96hTg.sh     exec     TRUE     FALSE
condor_startup.sh     condor_startup.c96hTg.sh     exec     TRUE     FALSE
  • Then put the config back to original and do another reconfig:
<condor_tarballs>
   <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz" version="default"/>
   <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.3-x86_rhap_5.8-stripped.tar.gz" version="7.8.3"/>
   <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz" version="7.8.2"/>
</condor_tarballs>
http://cabinet-10-10-6.t2.ucsd.edu/glidefactory/stage/glidein_v1_2/after_file_list.c9lfL5.lst
# File: after_file_list.c9lfL5.lst
#
# Outfile     InFile             Cache/exec     Condition     ConfigOut
##############################################################################
validate_node.sh     validate_node.c96hTg.sh     exec     TRUE     FALSE
condor_platform_select.sh     condor_platform_select.c96hTg.sh     exec     TRUE     FALSE
condor_bin_default-default-default.tgz     condor_bin.c9lfL5.tgz     untar     CONDOR_PLATFORM_default-default-default     CONDOR_DIR
condor_bin_7.8.3-default-default.tgz     condor_bin.c9lfL5.tgz     untar     CONDOR_PLATFORM_7.8.3-default-default     CONDOR_DIR
condor_bin_7.8.2-default-default.tgz     condor_bin.c9lfL5.tgz     untar     CONDOR_PLATFORM_7.8.2-default-default     CONDOR_DIR
create_mapfile.sh     create_mapfile.c96hTg.sh     exec     TRUE     FALSE
collector_setup.sh     collector_setup.c96hTg.sh     exec     TRUE     FALSE
gcb_setup.sh     gcb_setup.c6le0z.sh     exec     TRUE     FALSE
glexec_setup.sh     glexec_setup.c96hTg.sh     exec     TRUE     FALSE
java_setup.sh     java_setup.c8gdK4.sh     exec     TRUE     FALSE
glidein_memory_setup.sh     glidein_memory_setup.c96hTg.sh     exec     TRUE     FALSE
condor_startup.sh     condor_startup.c96hTg.sh     exec     TRUE     FALSE
  • Finally do an upgrade without changing anything:
<condor_tarballs>
   <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz" version="default"/>
   <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.3-x86_rhap_5.8-stripped.tar.gz" version="7.8.3"/>
   <condor_tarball arch="default" os="default" tar_file="/home/gfactory/glideinsubmit/glidein_v1_2.cfg/Prestage/gfactory-2.6.1-condor-7.8.2-x86_rhap_5.8-stripped.tar.gz" version="7.8.2"/>
</condor_tarballs>
http://cabinet-10-10-6.t2.ucsd.edu/glidefactory/stage/glidein_v1_2/after_file_list.c9lfMe.lst
# File: after_file_list.c9lfMe.lst
#
# Outfile     InFile             Cache/exec     Condition     ConfigOut
##############################################################################
validate_node.sh     validate_node.c96hTg.sh     exec     TRUE     FALSE
condor_platform_select.sh     condor_platform_select.c96hTg.sh     exec     TRUE     FALSE
condor_bin_default-default-default.tgz     condor_bin.c9lfL5.tgz     untar     CONDOR_PLATFORM_default-default-default     CONDOR_DIR
condor_bin_7.8.3-default-default.tgz     condor_bin.c9lfMe.tgz     untar     CONDOR_PLATFORM_7.8.3-default-default     CONDOR_DIR
condor_bin_7.8.2-default-default.tgz     condor_bin.c9lfL5.tgz     untar     CONDOR_PLATFORM_7.8.2-default-default     CONDOR_DIR
create_mapfile.sh     create_mapfile.c96hTg.sh     exec     TRUE     FALSE
collector_setup.sh     collector_setup.c96hTg.sh     exec     TRUE     FALSE
gcb_setup.sh     gcb_setup.c6le0z.sh     exec     TRUE     FALSE
glexec_setup.sh     glexec_setup.c96hTg.sh     exec     TRUE     FALSE
java_setup.sh     java_setup.c8gdK4.sh     exec     TRUE     FALSE
glidein_memory_setup.sh     glidein_memory_setup.c96hTg.sh     exec     TRUE     FALSE
condor_startup.sh     condor_startup.c96hTg.sh     exec     TRUE     FALSE

So apparently there is something wrong with the reconfig, and I think I
ran an upgrade (but I didn't retest) before you started troubleshooting
which "fixed" the problem.

#3 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Assigned to Feedback
  • Assignee changed from Parag Mhashilkar to Douglas Strain

Committed to branch_v2plus_2970

I have tested several combinations now and tried to cover as many more possibilities.

#4 Updated by Douglas Strain over 7 years ago

  • Assignee changed from Douglas Strain to Parag Mhashilkar

Parag walked me through the changes. I approve and this can be merged into v2plus. Also, it should probably be merged into master as well.

#5 Updated by Parag Mhashilkar over 7 years ago

  • Status changed from Feedback to Resolved

branch_v2plus: commit:899fe76
master: commit:839165b

#6 Updated by Parag Mhashilkar about 7 years ago

  • Status changed from Resolved to Closed


Also available in: Atom PDF